01.AdvancesinPatternRecognition

PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON

Illllllllll Illllllllll

P A T T E R N RECOGNIT ION

E D I T O R P I N A K P A N I P A L

!

I Hill II P A T T E I! N R E C O G N I T I O N

PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON

Mil II Mill

I Mill II P A T T E R N R E C O G N I T I O N Ind ian S ta t i s t i ca l I ns t i t u t e , Ko lka ta . India

2 - A Janua ry 2007

I T 0 I P I N A K P A N I P A L

Indian Statistical Institute, India

^jjp World Scientific NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • H O N G K O N G • TAIPEI • CHENNAI

Published by

World Scientific Publishing Co. Pte. Ltd.

5 Toh Tuck Link, Singapore 596224

USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601

UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

ADVANCES IN PATTERN RECOGNITION Proceedings of the Sixth International Conference on Advances in Pattern Recognition (ICAPR 2007)

Copyright © 2007 by World Scientific Publishing Co. Pte. Ltd.

All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN-13 978-981-270-553-2 ISBN-10 981-270-553-8

Printed in Singapore by B & JO Enterprise

PREFACE

The Electronics and Communication Sciences Unit (ECSU) of the Indian Statistical Institute is organizing the sixth International Conference on Advances in Pattern Recognition (ICAPR 2007) at the Indian Statistical Institute, Kolkata, from 2 n d to 4 t h January, 2007. Since the advent of knowledge based computing paradigm, pattern recognition has become an active area of research involving scientists and engineers from different disciplines of the physical and earth sciences. A number of conferences are being organized every year, which act as platforms to present and exchange ideas on different facets of pattern recognition. It is needless to mention that ICAPR has carved out a unique niche within this list of conferences on pattern recognition, particularly for its continued success in focusing upon application-driven research. We are confident that the programme of this ICAPR will be as exciting as the previous ones.

You may be aware of the overwhelming response that we received since the publication of the call for papers for ICAPR 2007 in February 2006. We received 123 papers from 32 different countries. Given the constraint on time like any other three-day conference, it was indeed difficult for us to select a very few out of these high-quality technical contributions. I am thankful to the learned members of the programme committee whose untiring effort helped me to ultimately select a total of 68 papers for oral presentation. The selected papers represent a number of important frontiers in pattern recognition ranging from Biometrics, Document Analysis, Image Registration & Transmission to traditional areas like Image Segmentation, Multimedia Object Retrieval, Shape Recognition and Speech & Signal Analysis. I am happy that we shall see an excellent balance of theory and application focused research in the programmes of ICAPR 2007. Another important aspect of the programme will be a series of invited talks by renowned exponents in the field of pattern recognition and related areas. We look forward to listening to plenary speakers Prof. K. V. Mardia, Prof. T. Tan and Prof. E. J. Delp. I am confident that it would also be a rewarding experience for all of us to interact with our invited speakers Prof. I. Bloch and Prof. V. Lakshmanan.

Publication of a proceedings of this standard requires tremendous infrastructural support. I am fortunate to have a very active advisory committee who extended their support whenever we required it. The organizing committee is working hard to make the event a grand success. The editorial workload was huge bu t Bibhas Chandra Dhara and Partha Pratim Mohanta, made it easy for me through their hard work and dedication.

I must acknowledge the staffs of the ECSU for their untiring support to conference secretariat. Particularly I am thankful to N. C. Deb, D. K. Gayen and S. K. Shaw for their support to technical work. The administrative responsibilites are being organized by S. K. Seal, S. Sarkar, D. Mitra, R. Chatterjee, D. Shaw, S. S. Das and supported by S. Deb.

I am also thankful to WebReview Team, Katholieke Universiteit Leuven, ESAT/COSIC, to let me use their WebSubmision and WebReview Software. This made our job easier. Of course, the World Scientific editorial team lent a graceful touch to the printed format of this publication. I also acknowledge the help of Subhasis Kumar Pal for maintaining our webserver problem free.

We also thank our sponsors for their kind help and support. I conclude with my heartfelt thanks to the contributors, for submitting their papers, and who now are

ready to present it before the august audience of ICAPR. I am sure that this collection of papers and their presentation will motivate us to explore the further research and the advances made in pattern recognition. Thank you.

Pinakpani Pal Electronics and Communication Sciences Unit Indian Statistical Institute

INTERNATIONAL ADVISORY COMMITTEE

Chairman

Sankar Kumar Pal, India

M e m b e r s

Shun-ichi Amari, Japan Gabriella Sanniti di Baja, Italy

Horst Bunke, Switzerland Bulusu Lakshmana Deekshatulu, India

S.C. Dutta Roy, India Vito Di Gesu, Italy J.K. Ghosh, India

Anil K. Jain, USA Nikola Kasabov, New Zealand

Rangachar Kasturi, USA M. Kunt, Switzerland

C.T. Lin, Taiwan M.G.K. Menon, India

A.P. Mitra, India Heinrich Niemann, Germany

Witold Pedrycz, Canada V.S. Ramamurthy, India

C.R. Rao, USA Erkki Oja, Finland

Lipo Wang, Singapore Jacek M. Zurada, USA

viii

General Chair

D. Dutta Majumder, ISI, Kolkata

Plenary Chair

Nikhil Ranjan Pal, ISI, Kolkata

Tutorial Chair

Bhabatosh Chanda, ISI, Kolkata

Organizing Committee

Arun Kumar De (Chairman), ISI, Kolkata Partha Pratim Mohanta (Convener), ISI, Kolkata

B. D. Acharya, DST, New Delhi Aditya Bagchi, ISI, Kolkata

Bhabatosh Chanda, ISI, Kolkata Bidyut Baran Chaudhuri, ISI, Kolkata Narayan Chandra Deb, ISI, Kolkata Malay Kumar Kundu, ISI, Kolkata

Jharna Majumdar, ADE, Bangalore Dipti Prasad Mukherjee, ISI, Kolkata

C. A. Murthy, ISI, Kolkata Nikhil Ranjan Pal, ISI, Kolkata

Srimanta Pal, ISI, Kolkata S. Rakshit, CARE, Bangalore

Kumar Sankar Ray, ISI, Kolkata Bimal Roy, ISI, Kolkata

S. K. Sarkar, NPL, New Delhi Swapan Kumar Seal, ISI, Kolkata

Bhabani Prasad Sinha, ISI, Kolkata

ix

INTERNATIONAL PROGRAMME COMMITTEE

Abhik Mukherjee, BESU, Shibpur Abraham Kandel, University of South Florida Tampa

Amit Das, BESU, Shibpur Amita Pal, ISI, Kolkata

Amitabha Mukerjee, IIT, Kanpur Anup Basu, University of Alberta

Basabi Chakraborty, Iwate Prefectural University, Japan Bhabatosh Chanda, ISI, Kolkata

Bhargab Bhattacharya, ISI, Kolkata Bidyut Baran Chaudhuri, ISI, Kolkata

Bimal Roy, ISI, Kolkata Brian C. Lovell, The University of Queensland, Australia

C. A. Murthy, ISI, Kolkata C. V. Jawahar, HIT, Hyderabad

Dipti Prasad Mukherjee, ISI, Kolkata Hisao Ishibuchi, Osaka Prefecture University, Japan

Irina Perfiljeva, University of Ostrava, Czech Republic Isabelle Bloch, ENST, France

Jayanta Mukhopadhyay, IIT, Kharagpur Koczy T. Laszlo, Hungary

Kumar Shankar Ray, ISI, Kolkata Lipo WANG, Nanyang Technological University, Singapore

Malay Kundu, ISI, Kolkata Mrinal Mondal, University of Alberta, Canada

Nikhil Ranjan Pal, ISI, Kolkata Niladri Chatterjee, IIT, Delhi

Okyay Kaynak, Bogazici University, Turkey Olli Simula, Helsinki University of Technology, Finland Oscar Castillo, Tijuana Institute of Technology, Mexico

Punam Saha, University of Pennsylvania, USA Ryszard S. Choras, Institute of Telecommunications, Poland

Sanjoy Saha, JU, Kolkata Sansanee Auephanwiriyakul, Chiang Mai University, Thailand

Scott Acton, University of Virginia, USA Sid Ray, Monash University, Australia

Somnath Sengupta, IIT, Kharagpur Soo-Young Lee, Korea Advanced Institute of Sc. & Technology, Korea

Subhashis Banerjee, IIT, Delhi Subhasis Choudhury, IIT, Bombay

Sukhendu Das, IIT, Madras Sung-Bae Cho, Yonsei University, Korea

Takeshi Furuhashi, Nagoya University, Japan Visvanathan Ramesh, Siemens Corporate Research Inc., USA

Yutaka Hata, University of Hyogo, Japan Pinakpani Pal (Chairman), ISI, Kolkata

ADDITIONAL REVIEWERS

Aditya Bagchi, ISI, Kolkata Arijit Bishnu, IIT, Kharagpur

Arun K. De, ISI, Kolkata Ashish Ghosh, ISI, Kolkata

Bibhas Chandra Dhara, Jadavpur University Debrup Chakrabarty, CINVESTAV, IPN, Mexico

Durga Prasad Muni, ISI, Kolkata Mandar Mitra, ISI, Kolkata

Nilanjan Ray, University of Alberta, USA Oscar Montiel, Tijuana Institute of Technology, Mexico Patricia Melin, Tijuana Institute of Technology, Mexico

Roberto Sepulveda, Tijuana Institute of Technology, Mexico Somitra Kumar Sanadhya, ISI, Kolkata

Srimanta Pal, ISI, Kolkata Subhamay Maitra, ISI, Kolkata

Swapan Kumar Parui, ISI, Kolkata Umapada Pal, ISI, Kolkata Utpal Garain, ISI, Kolkata

SPONSORS

Adobe

RD INDIA SCIENCE LAB

Reserve Bank of India

CONTENTS

Xlll

Preface v

International Advisory Committee vii

Organizing Committee viii

International Programme Committee ix

Additional Reviewers x

Sponsors xi

Part A Plenary Lecture 1

Why Statistical Shape Analysis is Pivotal to the Modern Pattern Recognition? 3 Kanti V. Mardia

Part B Invited Lectures 13

On the Interest of Spatial Relations and Fuzzy Representations for Ontology-Based Image Interpretation 15 Isabelle Block, Celine Hudelot and Jamal Atif

A Technique for Creating Probabilistic Spatio-Temporal Forecasts 26

V. Lakshmanan and Kiel Ortega

Part C Biometrics 33

An Efficient Measure for Individuality Detection in Dynamic Biometric Applications 35 B. Chakraborty and Y. Manabe

Divide-and-Conquer Strategy Incorporated Fisher Linear Discriminant Analysis: An Efficient Approach for Face Recognition 40

S. Noushath, G. Hemantha Kumar, V. N. Manjunath Aradhya and P. Shivakumara

Ear Biometrics: A New Approach 46 Anupam Sana, Phalguni Gupta and Ruma Purkait

Face Detection using Skin Segmentation as Pre-Filter 51 Shobana L., Anil Kr. Yekkala and Sameen Eajaz

Face Recognition Using Symbolic KDA in the Framework of Symbolic Data Analysis 56 P. S. Hiremath and C. J. Prabhakar

Minutiae-Orientation Vector Based Fingerprint Matching 62 Li-min Yang, Jie Yang and Yong-liang Zhang

Recognition of Pose Varied Three-Dimensional Human Faces Using Structured Lighting Induced Phase Coding 66

Debesh Choudhury

Writer Recognition by Analyzing Word Level Features of Handwritten Documents 73 Prakash Tripathi, Bhabatosh Chanda and Bidyut Baran Chaudhuri

Part D Clustering Algorithms 79

A New Symmetry Based Genetic Clustering Technique for Automatic Evolution of Clusters 81 Sriparna Saha and Sanghamitra Bandyopadhyay

A Non-Hierarchical Clustering Scheme for Visualization of High Dimensional Data 88 G. Chakraborty, B. Chakraborty and N. Ogata

An Attribute Partitioning Approach to Correlation Connected Clusters 93

Vijaya Kumar Kadappa and Atul Negi

Part E Document Analysis 99

A Hybrid Scheme for Recognition of Handwritten Bangla Basic Characters Based on HMM and MLP Classifiers 101

U. Bhattacharya, S. K. Parui and B. Shaw

An Efficient Method for Graphics Segmentation from Document Images 107 S. Mandal, S. P. Chowdhury, A. K. Das and B. Chanda

Identification of Indian Languages in Romanized Form 112 Pratibha Yadav, Girish Mishra and P. K. Saxena

Online Bangla Handwriting Recognition System 117

K. Roy, N. Sharma, T. Pal and U. Pal

Oriya Off-Line Handwritten Character Recognition 123 U. Pal, N. Sharma and F. Kimura

Recognition of Handwritten Bangla Vowel Modifiers 129 S. K. Parui, U. Bhattacharya and S. K. Ghosh

Template-Free Word Spotting in Low-Quality Manuscripts 135 Huaigu Cao and Venu Govindaraju

Unconstrained Handwritten Digit Recognition: Experimentation on MNIST Database 140 V. N. Manjunath Aradhya, G. Hemantha Kumar and S. Noushath

Part F Image Registration and Transmission 145

An Adaptive Background Model for Camshift Tracking with a Moving Camera 147 R. Stolkin, I. Florescu, G. Kamberov

Colour and Feature Based Multiple Object Tracking Under Heavy Occlusions 152 Pabboju Sateesh Kumar, Prithwijit Guha and Amitabha Mukerjee

DCT Properties as Handle for Image Compression and Cryptanalysis 157 Anil Kr. Yekkala, C. E. Veni Madhavan and Narendranath Udupa

Genetic Algorithm for Improvement in Detection of Hidden Data in Digital Images 164 Santi P. Maity, Prasanta K. Nandi and Malay K. Kundu

High Resolution Image Reconstruction from Multiple UAV Imagery 170 Jharna Majumdar, B. Vanathy and Lekshmi S.

Image Registration and Object Tracking via Affine Combination 175 Nilanjan Ray and Dipti Prasad Mukherjee

Progressive Transmission Scheme for Color Images Using BTC-PF Method 180 Bibhas Chandra Dhara and Bhabatosh Chanda

Registration Algorithm for Motion Blurred Images 186 K. V. Arya and P. Gupta

Part G Image Segmentation 191

Aggregation Pheromone Density Based Change Detection in Remotely Sensed Images 193 Megha Kothari, Susmita Ghosh and Ashish Ghosh

Automatic Brain Tumor Segmentation Using Symmetry Analysis and Deformable Models 198 Hassan Khotanlou, Olivier Colliot and Isabelle Bloch

Edge Recognition in MMWave Images by Biorthogonal Wavelet Decomposition and Genetic Algorithm 203

C. Bhattacharya and V. P. Dutta

Extended Markov Random Fields for Predictive Image Segmentation 208

R. Stolkin, M. Hodgetts, A. Greig and J. Gilby

External Force Modeling of Snakes Using DWT for Texture Object Segmentation 215 Surya Prakash and Sukhendu Das

I-FISH: Increasing Detection Efficiency for Fluorescent Dot Counting in Cell Nuclei 220 Shishir Shah and Fatima Merchant

Intuitionistic Fuzzy C Means Clustering in Medical Image Segmentation 226 T. Chaira, A. K. Ray and 0. Salvetti

Remote Sensing Image Classification: A Wavelet-Neuro-Fuzzy Approach 231 Saroj K. Meher, B. Uma Shankar and Ashish Ghosh

Part H Multimedia Object Retrieval 237

An Efficient Cluster Based Image Retrieval Scheme Using Localized Texture Pattern 239

Saumen Mandal, Sanjoy Kumar Saha, Amit Kumar Das and Bhabatosh Chanda

Feature Selection Based on Human Perception of Image Similarity for Content Based Image Retrieval 244 P. Narayana Rao, Chakravarthy Bhagvati, R. S. Bapi, Arun K. Pujari and B. L. Deekshatulu

Identification of Team in Possession of Ball in a Soccer Video Using Static and Dynamic Segmentation 249 V. Pallavi, Jayanta Mukherjee, A. K. Majumdar, Shamik Sural

Image Retrieval Using Color, Texture and Wavelet Transform Moments 256

R. S. Choras

Integrating Linear Subspace Analysis and Iterative Graphcuts For Content-Based Video Retrieval 263 P. Deepti, R. Abhilash and Sukhendu Das

Organizing a Video Database Around Concept Maps 268 K. Shubham, L. Dey, R. Goyal, S. Gupta and S. Chaudhury

Statistical Bigrams: How Effective Are They in Text Retrieval? 274 Prasenjit Majumder, Mandar Mitra and Kalyankumar Datta

Part I Pattern Recognition 279

Adaptive Nearest Neighbor Classifier 281 Anil K. Ghosh

Class-Specific Kernel Selection for Verification Problems 285 Ranjeeth Kumar and C. V. Jawahar

Confidence Estimation in Classification Decision: A Method for Detecting Unseen Patterns 290

Pandu R Devarakota and Bruno Mirbach

ECG Pattern Classification Using Support Vector Machine 295 S. S. Mehta and N. S. Lingayat

Model Selection for Financial Distress Classification 299 Srinivas Mukkamala, Andrew H. Sung, Ram B. Basnet, Bemadette Ribeiro and Aarmando S. Vieira

XV11

Optimal Linear Combination for Two-Class Classifiers 304 0. Ramos Terrades, S. Tabbone and E. Valveny

Support Vector Machine Based Hierarchical Classifiers for Large Class Problems 309 Tejo Krishna Chalasani, Anoop M. Namboodiri and C. V. Jawahar

Unsupervised Approach for Structure Preserving Dimensionality Reduction 315 Amit Saxena and Megha Kothari

Part J Shape Recognition 319

A Beta Mixture Model Based Approach to Text Extraction from Color Images 321 Anandarup Roy, Swapan Kumar Parui and Utpal Roy

A Canonical Shape-Representation for a Polygon 327 Sukhamay Kundu

A Framework for Fusion of 3D Appearance and 2D Shape Cues for Generic Object Recognition 332

Manisha Kalra and Sukhendu Das

Constructing Analyzable Models by Region Based Technique for Object Category Recognition 338 Yasunori Kamiya, Yoshikazu Yano and Shigeru Okuma

DRILL: Detection and Representation of Isothetic Loosely Connected Components without Labeling 343

P. Bhowmick, A. Biswas and B. B. Bhattacharya

Pattern Based Bootstrapping Method for Named Entity Recognition 349

A sif Ekbal and Sivaji Bandyopadhyay

SCOPE: Shape Complexity of Objects using Isothetic Polygonal Envelope 356

Arindam Biswas, Partha Bhowmick and Bhargab B. Bhattacharya

Segmental K-Means Algorithm Based Hidden Markov Model for Shape Recognition and its Applications 361

Tapan Kumar Bhowmik, Swapan Kumar Parui, Manika Kar and Utpal Roy

Part K Speech and 1-D Signal Analysis 367

Automatic Continuous Speech Segmentation Using Level Crossing Rate 369

Nagesha and G. Hemantha Kumar

Automatic Gender Identification Through Speech Analysis 375 Anu Khosla and Devendra Kumar Yadav

Error-Driven Robust Particle Swarm Optimization for Fuzzy Rule Extraction and Structure Estimation 379

Sumitra Mukhopadhyay and Ajit K. Mandal

xvm

HMM Based POS Tagger and Rule-Based Chunker for Bengali 384 Sivaji Bandyopadhyay and Asif Ekbal

Non-Contemporary Robustness in Text-Dependent Speaker-Recognition Using Multi-Session Templates in an One-Pass Dynamic-Programming Framework 391

V. Ramasubramanian, V. Praveen Kumar and S. Thiyagarajan

Some Experiments on Music Classification 396 Debrup Chakraborty

Text Independent Identification of Regional Indian Accents in Spoken Hindi 401 Kamini Malhotra and Anu Khosla

Part L Texture Analysis 405

An Efficient Approach for Texture Classification with Multi-Resolution Features by Combining Region and Edge Information Using a Modified CSNN 407

Lalit Gupta and Sukhendu Das

Upper Bound in Model Order Selection of MRF with Application in Texture Synthesis 413 Arnab Sinha and Sumana Gupta

Wavelet Features for Texture Classification and Their Use in Script Identification 419

P. S. Hiremath and Shivashankar S.

Author Index

PART A

Plenary Lecture

Why Statistical Shape Analysis is Pivotal to the Modern Pattern Recognition?

Kanti V. Mardia

Department of Statistics University of Leeds,

Leeds, West Yorkshire LS2 9JT, UK E-mail: [email protected]

www. maths, leeds. ac. uk

There have been great strides in shape analysis in this decade. Pattern recognition, image analysis, and morphometries have been the major contributors to this area but now bioinformatics is driving the subject as well, and new challenges are emerging; also the methods of pattern recognition are evolving for bioinformatics. Shape analysis for labelled landmarks is now moving to the new challenges of unlabelled landmarks motivated by these new applications. ICP, EM algorithms, etc. are well used in image analysis but now Bayesian methods are coming into the arena. Dynamic Bayesian networks are other developments. We will discuss the problem of averaging, image deformation, projective shape and Bayesian alignment. The aim of this talk will be to convince the scientists that statistical shape analysis is pivotal to the modern pattern recognition.

Keywords: Bayesian analysis; Bioinformatics; Protein gel; Deformation; Average image; Discrimination; Penalized likelihood.

1. Introduction

We have reviewed the topic over the years starting from two volumes1'2 in 1993, 1994. The subsequent reviews until 2001 include papers3"6 . Since then the subject has grown especially for shapes on manifold, eg. two recent workshops in the USA of the American Institute of Mathematics in 2005 and the Institute of Mathematical Applications in 2006. Also our Leeds Annual Statistical Research (LASR) Workshops ht tp : / /www .maths . leeds .ac .uk/ S ta t i s t i cs /workshop have been keeping abreast of the field especially in relation to shapes and images. An excellent treatment of recent developments in shape analysis including shape manifold can be found in the edited volume of Krim and Yezzi.7 Further stride has been due to its new connections with Bioinformatics - a field bursting with challenges. The field of shape analysis as

'covered until 1998 by Diyden and Mardia8 has domi-nated mainly by labelled shape analysis. New innovations are now emerging in unlabelled shape analysis. Perhaps Cross and Hancock9 is one of the early statistical papers in the image area via EM-algorithm. Glasbey and Mardia10 gave some different perspectives through penalized likelihood methods for images. A cross-over to Bioinformatics can also be seen for example in Richmond et al.11

A Bayesian hierarchical model for unlabelled shape is proposed in Green and Mardia12 which has

not tried images yet (only bioinformatics). Mardia et al.13 have given a hybrid approach for image deformation and discrimination where some landmarks are labelled. One of the tools for deformation has been through a part of thin plate spline (TPS) but many other radial functions can be used. Mardia et al.14 have shown how TPS gives advantages over various radial functions using Brodatz type texture images. Thus, it is important to distinguish between labelled and unlabelled configurations, finite and 'infinite' number of points, outline or solid shape, linear or nonlinear transformations, parametric or non-parametric methods, and so on. We now describe some special topics.

2. Labelled Shape Analysis

Consider a configuration of points in M.m. For pattern recognition, applications generally m = 2 or 3. "Shape" deals with the residual-structure of this configuration when certain transformations are filtered out. More specifically, the shape of a configuration consists of its equivalence class under a group of transformations. Important groups for machine vision are the similarity group, the affine group and the projective group. Here the group action describes the way in which an image is captured. For instance if two different images of the same scene are obtained using a pinhole camera, the corresponding transformation between the two images is the composition of two central projections, which is a projective trans-

mailto:[email protected]

http://www.maths.leeds.ac.uk/

4 Why Statistical Shape Analysis is Pivotal to the Modern Pattern Recognition?

formation. If the two central projections can be approximated by parallel projections, which is the case of remote views of the same planar scene, the projective transformation can be approximated by an affine transformation. Further, if these parallel projections are orthogonal projections on the plane of the camera, this affine transformation can be approximated by a similarity transformation. Therefore, the relationships between these shapes as follows: if two configurations have the same similarity shape then they automatically have the same affine shape; if they have the same affine shape they will have the same projective shape. For example, two squares of different sizes have the same similarity, affine and projective shape whereas a square and a rectangle have the same affine and projective shape but not the same similarity shape. On the other hand, a square and a kite have the same projective shape but not the same affine shape.

The word "shape" often refers in statistics to similarity shape where only the effects of translation, scale and rotation have been filtered out (see for example, Dryden and Mardia,8). In recent years, substantial progress has been made in similarity shape analysis since appropriate shape space (e.g. Kendall's space) and shape coordinates (e.g., Bookstein coordinates) have been available. A simple example of Bookstein coordinates is for the shape of a triangle where the shape coordinates are obtained after taking one of the vertices as the origin and rotating the triangle so that the base of the triangle lies on the x-axis, and then rescaling the base to the unit size. The motivation behind such coordinate systems is similar to those in directional statistics where to analyze spherical data one requires a coordinate system such as longitude and latitude (see for example, Mardia and Jupp,15). Similar type of coordinates are avail-able for affine shape (Goodall and Mardia,16). For affine shape in 2-D, we can obtain shape coordinates by using three points that determine the direction and the origin of the axes, and the unit length between the points on each of these two axes.

A convenient projective shape space as well as an appropriate coordinate system for this shape space has been put forward by Mardia and Patrangenaru17

where in 2-D, now the coordinate frame consists of four points (0, 0), (0, 1), (1, 0) and (1, 1). This allows reconstruction of three-dimensional image given two-dimensional multiple views of a scene. A "sym

metrical" approach for projective shape space has been given in Kent and Mardia.18

3. Unlabelled Shape Analysis and Bioinformatics

Various new challenging problems in shape matching have been appearing from different scientific areas including Bioinformatics and Image Analysis. In a class of problems in Shape Analysis, one assumes t h a t the points in two or more configurations are labelled and these configurations are to be matched after filtering out some transformation. Usually the transformation is a rigid transformation or similarity transformation. Several new problems are appearing where the points of configuration are either not labelled o r the labelling is ambiguous, and in which some points do not appear in each of the configurations. An example of ambiguous labelling arises in understanding the secondary structure of proteins, where we are given not only the 3-dimensional molecular configuration but also the type of molecules (amino acids) a t each point. A generic problem is to match such two configurations, where the matching has to be invariant under some transformation group.

There are other related examples from Image Analysis such as matching buildings when one has multiple 2-dimensional views of 3-dimensional objects (see, for example, Cross and Hancock9). The problem here requires filtering out the projective transformations before matching. Other examples involve matching outlines or surfaces (see, for example, Chui and Rangarajan19). Here there is no labelling of points involved, and we are dealing with a continuous contour or surface rather than a finite number of points. Duta et al.20 give a specific example of unlabelled matching solid shapes.

Green and Mardia12 build a hierarchical Bayesian model for the point configurations a n d de-rive inferential procedure for its parameters. I n particular, modelling hidden point locations as a Poisson process leads to a considerable simplification. They discuss in particular the problem when only a linear or affine transformation has to be filtered out. They also provide an implementation of the resulting methodology by means of Markov chain Monte Carlo (MCMC) samplers. Under a broad parametric family of loss functions, an optimal Bayesian point estimate of the matching matrix have been constructed, which turns out to depend on a single parameter of

Kanti V. Mardia 5

the family. Also discussed there is a modification to the likelihood in their model to make use of partial label ('colour') information at the points. The principal innovations in this approach are (a) the fully model-based approach to alignment, (b) the model formulation allowing integrating out of the hidden point locations, (c) the prior specification for the rotation matrix, and (d) the MCMC algorithm.

We now give some details together with an example.

3 .1 . Notation

Consider again two configurations of unlabelled landmarks in d dimensions, Xj, j = 1 , . . . , J and yk, k = 1 , . . . , K, respectively, represented as matrices x ( J x d) and y (K x d), where J is not necessarily the same as K. The objective is to find suitable subsets of each configuration and a suitable transformation, such the the two subconfigurations become closely matched. One of the key parameters is the matching matrix of the configurations which is represented by M (J x K), where Mjk indicates whether the points in the two configurations Xj and yk are matched or not. That is, Mjk = 1, if Xj matches yk, and 0 otherwise. Note that M is the adjacency matrix for the bipartite graph representing the matching, and that Y^,j k Mjk — L number of matches.

The transformation g, say, lies in a specified group Q of transformations. Depending on the application, suitable choices for Q include (a), translations, (b) rigid body transformations, (c) similarity transformations, (d) affine transformations and (e) projective transformations.

It is sometimes notationally helpful to add an extra column (k = 0) to M to yield Mo, where rrijo = 1 — JZfe=i mjk = 1 if Xj is not matched to any yk, and 0 otherwise. The matrix M (or equivalently A/0) us called a "hard" labelling because each element is either 0 or 1. It is also helpful to consider "soft" labellings given by a J x (K + 1) matrix M*, say, where 0 < m*k < 1 and ^Zk=0 m*k = 1. There is now no constraint on ]T\ m*k. Note that M is symmetric in the roles of j and k, but Mo and M* are not.

Thus the overall matching objective can be restated as finding a matrix M and transformation g such that Xj « g(yk) for j , k with mjk = 1, as measured, e.g., by a sum of squares criterion. In computational geometry, the problem is termed the largest common point set (LCP) problem. We consider two

related statistical models to tackle the problem.

3.2. Some statistical approaches

Model 1: Regression models In this approach we condition on the landmark positions for one configuration y and on the matching matrix M (or equivalently Mo), and then model the distribution of the landmarks of the other configuration x. In the hard version of the model, the landmarks Xj, j = 1 , . . . , J are taken to be conditionally independent with

Xj ~ Nd(g{yk),o-2Id), when rrijk = 1 for some k,

(1)

and Xj ~ Nd{g{y0),alld), when mj0 = 1.

Here OQ » <T2 is large, and yo is located in the center of the y configuration, and the latter distribution is meant to represent a broad distribution for those x landmarks not matched to any y landmarks. The matching parameters M and transformation g e Q are to be estimated.

There is also a soft model under which the landmarks {XJ}, j = 1 , . . . , J are treated as conditionally independent identically distributed observations from a mixture distribution,

K

Xj ~ n0Nd(g(y0),alld) + ^^kNd{g{yk), cr2/d), fe=i

where Ylo ^k = 1- ^ n e hard membership function Mo does not appear in this formulation, but the posterior probabilities that Xj comes from the class for yk form a soft membership matrix M*.

This soft model has been used by several authors including Cross and Hancock,9 Luo and Hancock,21 Walker,22 Chui and Rangarajan,19 and Kent et al.23 In this case the EM algorithm can be used to compute the MLEs (at least locally), and it takes the form of a simple explicit iterative updating algorithm. In general it converges quite quickly in this context, though the solution depends heavily on the starting value for g (Kent et al.,23), as well as the parameters a and oo- Dryden et al.24 have treated multiple configurations.

Model 2: Bayesian Hierarchical Model The matching problem can also be given a formulation which is symmetric in the roles of j and k by introducing a set of unknown latent sites {fii, i =


1 , . . . , J} to represent a collection of "true" locations. In particular, given rrijk = 1 for some j and k, the key assumption in this approach is that there is a value of i such that

XJ ~ N{in,a2I), g(yj) ~ N{m,a2I).

This approach has been implemented in a Bayesian framework by Green and Mardia (2006) with inference carried out by MCMC. In particular, the hidden points {ni} are generated by a Poisson process with some rate A and the proportion of matches is governed by a parameter p. When g comes from the rigid body group g(y) = Ay+a, where A is a rotation matrix and a is a translation vector, the likelihood can be shown to take the form

likelihood oc I T j,k:mjk = l

p<j>{{xj - Ayk - a}/<T\/2)

(2) where 4>{.) is the density of iV(0,1), after integrating out the {p-i}- This construction defines a hard version of Model 2; it is also possible to define a soft version. We add suitable priors on A, y and a (see, Green and Mardia12).

Estimation for the hard versions of Models 1 and 2 is difficult to carry out analytically. Model 2 feels more natural for this situation and an elegant and powerful MCMC algorithm, which avoids the need for reversible jump steps, has been developed by Green and Mardia.12 The soft versions can be tackled by EM, though the choice of starting point is critical since the likelihood (or posterior) will generally be highly multimodal.

3.3. Example: Matching protein gels

The objective in this example is to match two elec-trophoretic gels automatically given two gel images (see Fig. 1). However we will assume that the images have been preprocessed and we are given the locations of the centres of 35 proteins on each of the two gels. The correspondence between pairs of proteins, one protein from each gel, is unknown, so our aim is to match the two gels based on these sets of unlabelled points. We suppose that it is known that the transformation between the gels is affine. In this case, experts have already identified 10 points; see Horgan et al.25 Based on these 10 matches, the linear part of the transformation is estimated a priori (Dryden & Mardia,8 pp. 20-1, 292-6).

Here, we have only to make inference o n the translation parameter and the unknown matching between certain of the proteins. Figure 2 gives the matches from this method. Note that all o f the expert-identified matches, points 1 to 10 in each set, are declared to be matches with high probability in the Bayesian analysis.

4. Image Deformations

Deformation is a basic tool of image analysis (see, for example, Toga,26 Glasbey and Mardia,10 Hajnal et al.,27 and Singh et al.,28) which maps a region S to a region T in Rd. Consider a landmark-based example. Let U, i = 1 , . . . , k be the configuration of landmarks in the "starting" or "source" image S, and xt, i = 1 , . . . , k, be their homologues in a "target" image T. A useful deformation that takes the i ' s onto the z's and maps every point of S onto some poin t of T is the thin-plate spline (Bookstein29 ; Dryden and Mardia8). In the applied literature, e.g. radiology or evolutionary biology, this is often called a "model", though it is not actually serving that role: i t is a prediction function. In Toga26 for instance, none of the deformations introduced to relate pairs of neu-roanatomical images are assigned standard errors, as they would necessarily be assigned when these a r e estimates of some underlying model.

One common method of fitting a deformation is to use a thin-plate spline for each coordinate of the deformation. It is well-known that the thin-plate spline can also be given a stochastic interpretation as a predictor for a certain (intrinsic) stochastic process conditioned to match the observed values a t the source landmarks. An advantage of the stochastic approach is that confidence limits on the predicted values can be provided. Mardia et al.30 reviewed the stochastic framework and associated prediction error in detail.

There are two common strategies for fitting deformations given information at a set of landmarks. One involves a minimizing a roughness penalty, e.g. for a thin-plate spline, and another involves prediction for a stochastic process, e.g. for a self-similar intrinsic random field. The stochastic approach allows parameter estimation and confidence limits for the predicted deformation. An application is presented there in Mardia et al.30 from a study of breast data and how the images deform as a function of the imaging procedure.

Kanti V. Mardia 7

C&

(a)

(b)

Fig. 1. Gel Images

Mardia et al.14 address the problem of the distortion effect produced by different types of non-linear deformation strategies on textured images. The images are modelled by a Gaussian random field. They give various examples to illustrate that the model generates realistic images. They consider two types of deformations: a deterministic deformation and a landmark based deformation. The latter includes

Fig. 2. The 17 most probable matches in the gel data, + symbols signify x points, o symbols the y points, linearly transformed by premultiplication by the ixed affine transformation. The solid line for each of the 17 matches joins the matched points, and represents the inferred translation plus noise.

various radial basis type deformations including the thin-plate splines based deformation. The effects of deformations are assessed through Kuilback Leibler divergence measure. The measure is estimated by statistical sampling techniques. It is found empirically that this divergence measure is approximately distributed as a lognormal distribution under various different deformations. Thus a coefficient of variation based on logdivergence provides a natural criterion to compare different types of deformations. They find that the thin-plate splines deformation is almost optimal over the wider class of the radial type deformations.

A new method of a compositional approach to multiscale image deformation is given in de Souza et al.31 Here a general framework is presented for the application of image deformation techniques, in which a smooth continuous mapping is modelled by a piecewise linear transformation, with separate components at a number of discrete scales. These components are combined by composition, which has several advantages over addition. The ideal transformation between two images is the best compromise between matching criteria and smoothness constraints, and have been estimated using both deterministic and stochastic analyses.


5. Penalized Image Averaging and Discrimination

The importance of statistically based objective discrimination methods is difficult to overstate. The need for such is in many areas. We describe here semi-landmark the procedure of Mardia et al.13

Glasbey & Mardia10 provide a landmark free method based on a penalized likelihood to discriminate. However, where landmark information is readily available it is judicious to make full use of it in any discrimination procedure. Mardia et al.13 expand upon Glasbey & Mardia's10 method by incorporating landmarks. The approach combines Glasbey and Mardia's method on the full image with Procrustes methods for landmark data; it is a fusion of landmark and standardized image information. The use of the extra landmark information improves the discrimination, giving a wider separation in the stu-dentized difference in means of within and between group measures. In addition, the use of landmarks significantly improves the computational speed compared with the landmark free approach.

The penalized likelihood is comprised of similarity and distortion parts. The likelihood measures the similarity between images after warping and the penalty is a measure of distortion of a warping. The images they discriminate, the measures of similarity consist of normalized image information in the two-and three-dimensional settings respectively.

We now give some details together with an example.

5.1. Image averaging

The basic strategy to obtain an average of a sample

of images is as follows. We assume the images possess

easily identifiable landmarks. The strategy is

(1) Obtain the mean shape of landmarks using the Procrustes mean.

(2) Register the sample of images to the mean shape. (3) Warp all the images to the mean shape using say

thin plate spline. (4) Interpolate the images (if necessary) to give a ho

mologous coordinate system over all the sample images.

(5) Average the images over the homologous coordinate system.

5.2. Image distance and discrimination

We now describe a general strategy for image discrimination based on penalized likelihood.

5.2.1. Image distance

Define a measure of distance between two images /$ and fj, both in m dimensions, to be

P(fi, fj) = T(fh ft) + \L(Xi, Xj), (3)

where T(fi,fj) is a standardised difference i n texture/surface information, L(Xj,Xj) is a landmark based distance between images and A is a weighting parameter. Next, for a sample of n images / i ; • • •, fn, in TTI dimensions with corresponding landmarks X i , . . . , X n , define an average image by minimising

n n

F = ^ T ( / i , / ) + A ^ L ( X i , M ) , (4) i = l t = l

with respect to A and fi. Here / is the mean image and T and L are as defined in quantity (3). An algorithm to minimise this quantity is described in Section 5.3. The optimum average image is referred to as the perturbed Procrustes average when it is evaluated by an algorithmic method.

Discriminant analysis. Consider the case where we wish to allocate an individual image to one of two populations, say HA and l i s - Suppose we are given samples of training data from the distinct populations, these are / i , . • •, / m from n ^ and gi, • •. ,gn

from HB- Firstly, we obtain the optimum image averages, / and g via Section 5.3. Then we carry out discriminant analysis on the variables P(h, f) and P(h,g), where P is given in (3) and h is an individual image.

In order to use (3) we need to select an appropriate: weightingj)arameter A. We select A which attains the maximum separation between the training samples according to P(ft, / ) , P{gj,f), P{fi,g) and P(gj,g), i = 1 , . . . ,m; j = 1 , . . . ,n, that is the distance between each image and each average image. We obtain the optimal choice of A by selecting that which optimises the studentised difference between P within group and P between group means. Here the within group consists of P(fi, f) and P{gj,g) and the between group consists of P(fi,g) and P(gj,f), i = 1 , . . . , m; j = 1 , . . . , n. Mardia et al.13 have given an explicit expression for optimum A.

Kanti V. Mardia 9

Assuming that training data has been used to obtain the population mean images and the optimum A has been derived, we can define a discriminant rule, based on image distances, to allocate a new image h in the classical way, eg. Fisher's rule.

5.2.2. Specific image distances in the two dimensional setting

Full Procrustes with texture information. We define a measure of distance between two images fc (t) and fj(t), via warping one to the other, to be

A( /* , / i ) = Et(/i(*X i-X1(*)) /.x - / J ( t ) ) 2 + AdF(X,,X i)

2 . {0)

The measure of distance Pi given in equation (5) is comprised of a similarity part and a distortion of shape part through the Procrustes distance. The parameter A determines the relative weighting between similarity and distortion. Also, it is worth commenting that we could have alternatively used d/p in (5) for balance in the measure, but it is still valid to use dp- Note, this measure does not detect size differences. However, a size measure could be added to the distance measure Pi. A suitable size term is the ratio of centroid sizes of the landmark configurations.

Bending energy with texture information methods. Here, an alternative approach to discrimination is given, where the bending energy matrix is used instead of the Procrustes distance. Define the second measure of distance between two images /,(£) and fj(t), via warping one to the other, to be

PuifiJj) = E t C f i ^ - x ^ ) ) - fj(t))2+ A vec(X;)TBlock(B(X.,), B(XJ)))vec(Xi),

(6) where B(X.j) is the bending energy matrix for the deformation with Xj as the source. The measure of distance Pi given in equation -(6}-^ comprised of a similarity part and a distortion part. Again the parameter A determines the relative weighting between similarity and distortion. Note Pi does not account for any affine differences between images. This is due to the bending energy being affine invariant. Therefore, this is only appropriate where very little of the warp from one configuration to another can be accounted for by an affine transformation. As with Pi given in quantity (5), Pi in (6) does not detect size differences.

5.3. Perturbed Procrustes averages

The optimum average image can be arrived at by perturbation of the Procrustes mean shape via the following optimisation procedure: Algorithm 1 1. Set A = Aj. 2. Obtain /z, the Procrustes mean shape of X i , . . . , X n . 3. Obtain / = ± ]T"=1 * / 1 _ > X i ( / i ) . evaluate F. 4. Perturb fj, to give fj,pert. 5. Obtain / = £ £"=i tfj^-oc^/i), evaluate Fnew f o r Uperf I f Fnew < F a ccep t /X p e r t ,

i.e set H=npert. 6. Repeat steps 3, 4 and 5 until F cannot be reduced further. 7. Set A = A J + i , repeat steps 2 to 6 until F cannot be reduced further.

Here ^/i—x^/t) denotes the tth image warped to the mean shape, fj,. The final / is the average image. Note in Algorithm 1, /z is only changed if the objective function improves. This algorithm possesses an MCMC-like property, however unlike MCMC methods in this case only improvements due to perturbation are accepted.

In this algorithm various choices of distance could be used. However, the objective function F must penalise affine transformations in the perturbation otherwise severe shearing occurs. This will not occur where the Procrustes distance is used. We propose Gaussian updates from a bivariate normal distribution on each landmark as the perturbation method. Mardia et al.13 have used a systematic grid search for the choice of A.

5.4. Example: Fish

Here we consider photographic images obtained under controlled conditions of two species of fish , haddock and whiting. These are part of a data set of ten haddock and ten whiting images. Eight of each species are randomly selected as training data leaving two from each species as test data. In total sixteen corresponding landmarks are defined on each of the fish in this example (see Figures 3, 4).

Effect on average of perturbation . The extent of the improvement in the perturbed average have been assessed under a criteria based on the the distance Pi from equation (5) and found Pi to be smaller.

10 Why Statistical Shape Analysis is Pivotal to the Modem Pattern Recognition?

Effect on discrimination of perturbation. Here we wish to assess if the perturbed averages improve the discrimination procedure. In order to do this, Mardia et al.13 consider the studentised differences between Pi within species and Pi between species for A = 48.1 million. This value attains the optimal separation for the Procrustes'average images. Taking A = 48.1 million is not the optimal value to discriminate using the perturbed average, however parity enables comparison between the studentised differences. We find that the perturbed average gives a greater separation between species.

Another example of reconstruction through a stochastic deformation template related to degratded fish is given in de Souza et al.32 which may be applicable for other fish images - rather than photographs as here.

Fig. 3. Haddock 1.

Fig. 4. Whiting 1.

The details of the allocation for the training and test data are given in Mardia et al.13 All the fish were correctly allocated.

6. Discussion

We have given here a personal view of why shape analysis is pivotal to pattern recognition in a very broad sense. Still the field is young and we hope to see many new developments coming from interdisciplinary research. Mardia and Gills33 have identified three themes for statistics in the 21st century. First, statistics should be viewed in the broadest way for scientific explanation or prediction of any phenomenon. Second, the future of statistics lies in a holistic approach to interdisciplinary research. Third, a change of attitude is required by statisticians - a paradigm shift - for the subject to go forward.

References

1. Mardia, K.V. and Kanji, G. (eds.) (1993). Statistics and Images: Vol L Carfax Publishing Co. Ltd., Abingdon, Oxfordshire.

2. Mardia, K.V. (ed.) (1994). Statistics and Images: Vol. II Carfax Publishing Co. Ltd., Abingdon, Oxfordshire.

3. Dryden, I.L., Mardia, K.V. and Walder, A.N. (1997). Review of the use of context in statistical image analysis. Journal of Applied Statistics, 24, pp513-538.

4. Mardia, K.V. (1997). Bayesian image analysis. J. Theoretical Medicine, 1, pp63-77.

5. Glasbey, C.A. and Mardia, K.V. (1998). A review of image warping methods. J. Appl. Statist., 25, ppl§5-171.

6. Mardia, K.V. (2001). Shapes in Images. In Pattern Recognition Prom Classical to Modern Approaches, edited by Pal, S.K. and Pal, A., World Scientific, ppl47-167.

7. Krim, H. and Yezzi, Jr. A. (eds). (2006). Statistics and Analysis of Shapes. BIrkhauser, Boston.

8. Dryden, I.L. and Mardia, K.V. (1998). Statistical Shape Analysis. J. Wiley, Chichester.

9. Cross, A. D. J. k Hancock, E. R. (1998). Graph matching with dual-step EM algorithm. IEEE Trans. Patt Anal Mach. Intell. 20, 1236-53.

10. Glasbey, C.A. and Mardia, K.V. (2001). A Penalized Likelihood Approach to Image Warping (with discussion). Journal of Royal Statistical Society, Series B, 63, pp465-514.

11. Richmond, N.J., Willett, P. and Clark, R.D. (2004). Alignment of three-dimensional molecules using an image recognition algorithm. Journal of Molecular Graphics and Modelling, 23, pp 199-209.

12. Green, P.J. and Mardia, K.V. (2006). Bayesian alignment using hierarchical models, with applications in protein bioinformatics. Biometrika, 93 , pp235-254.

13. Mardia, K.V., McDonnell, P. and Linney, A.D. (2006). Penalised image averaging and discriminations with facial and fishery applications. Journal of

Kanti V. Mardia 11

Applied Statistics, 33, pp339-369. 14. Mardia, K.V., Angulo, J.M. and Goitia, A. (2006).

Synthesis of image deformation strategies. Image and Vision Computing, 24, ppl-12.

15. Mardia, K.V. and Jupp, RE. (2000). Directional Statistics, 2nd edition. J. Wiley, Chichester.

16. Goodall, C.R. and Mardia, K.V. (1993), Multivariate aspects of shape theory. Ann. Statist., 21 , 848-866.

17. Mardia, K.V. and Patrangenaru, V. (2005). Directions and projective shapes. Annals of Statistics, 33, ppl666-1699.

18. Kent, J.T. and Mardia, K.V. (2006). A new representation for projective shape. LASR2006 Proceedings. Edited by: Barber, S., Baxter, P.D., Mardia, K.V. and Walls, R.E., Leeds University Press, pp75-78.

19. Chui, H. and Rangarajan, A. (2003). A new point matching algorithm for non-rigid registration. Computer Vision and Understanding, 89, 114-141.

20. Duta, N., Jain, A.K. and Mardia, K.V. (2002). Matching of palm prints. Pattern Recognition Letters, 23 , pp477-485.

21. Luo, B. and Hancock, E.R. (2001). Structural Matching using the EM algorithm and singular value decomposition. IEEE Trans. PAMI, 23, 1120-1136.

22. Walker, G. (2000). Robust, non-parametric and automatic methods for matching spatial point patterns. PhD thesis, University of Leeds.

23. Kent, J. T., Mardia, K. V. & Taylor, C. C. (2004). Matching problems for unlabelled configurations. In Bioinformatics, Images and Wavelets, Proceedings of LASR 2004, Ed. R. G. Aykroyd, S. Barber & K. V. Mardia, pp. 33-6. Leeds: Leeds University Press.

24. Dryden, I.L., Hirst, J.D. and Melville, J.L. (2006).

Statistical analysis of unlabelled point sets: comparing molecules in chemoinformatics. Biometrics to appear.

25. Horgan, G.W., Creasey, A. and Fenton, B. (1992). Superimposing two-dimensional gels to study genetic variation in malaria parasites. Electrophoresis, 13, 871-875.

26. Toga, A.W. (1999). Brain Warping. Academic Press, San Diego.

27. Hajnal, J.V., Hill, D.L.G. and Hawkes, D.J. (2001). (Eds) Medical Image Registration. CRC Press, London.

28. Singh, A.J., Goldgof, D. and Terzopoulos, D. (1998). (Eds) Deformable Models in Medical Image Analysis. IEEE Computer Soc, Los Alamitos, California.

29. Bookstein, F.L. (1991). Morphometric Tools for Landmark Data: Geometry and Biology. Cambridge University Press, Cambridge.

30. Mardia, K.V., Bookstein, F.L., Kent, J.T. and Meyer, C.R. (2006b). Intrinsic random fields and image deformations. Journal of Mathematical Imaging and Vision, in press.

31. de Souza, K.M.A., Kent, J.T., Mardia, K.V. and Glasbey, C. (2007). A compositional approach to multiscale deformation. In preparation.

32. de Souza, K.M.A., Kent, J.T. and Mardia, K.V. (1999). Stochastic templates for Aquaculture Images and a Parallel Pattern Detector. J. Roy. Statist. Soc. C. 48, pp211-227.

33. Mardia, K. V. and Gilks, W. (2005), "Meeting the statistical needs of 21st-century science," Significance, 2, 162-165.

PART B

Invited Lectures

15

On the Interest of Spatial Relations and Fuzzy Representations for Ontology-Based Image Interpretation

Isabelle Bloch, Celine Hudelot1 and Jamal Atif2

Ecole Nationale Superieure des Telecommunications - GET-Telecom Paris CNRS UMR 5141 LTCI - Signal and Image Processing Department

46 rue Barrault, 75013 Paris, France E-mail: Isabelle. [email protected]

1 Current address: Ecole Centrale Paris, Grande Voie des Vignes, 92 295 Chdtenay-Malabry Cedex, France E-main: [email protected]

2Current address: Universite des Antilles et de la Guyane, Guyane, FVance

E-main: [email protected]

In this paper we highlight a few features of the semantic gap problem in image interpretation. We show that semantic image interpretation can be seen as a symbol grounding problem. In this context, ontologies provide a powerful framework to represent domain knowledge, concepts and their relations, and to reason about them. They are likely t o be more and more developed for image interpretation. A lot of image interpretation systems rely strongly on descriptions of objects through their characteristics such as shape, location, image intensities. However, spatial relations are very important too and provide a structural description of the imaged phenomenon, which is often more stable and less prone to variability than pure object descriptions. We show that spatial relations can be integrated in domain ontologies. Because of the intrinsic vagueness we have to cope with, at different levels (image objects, spatial relations, variability, questions to be answered, etc.), fuzzy representations are well adapted and provide a consistent formal framework to address this key issue, as well as the associated reasoning and decision making aspects. Our view is that ontology-based methods can be very useful for image interpretation if they are associated to operational models relating the ontology concepts to image information. In particular, we propose operational models of spatial relations, based on fuzzy representations.

Keywords: Image interpretation, semantic gap, ontology, spatial relations, fuzzy sets, fuzzy relations, brain imaging.

1. Introduction

The literature acknowledges several attempts towards formalization of some domains. For instance in medicine, noticeable efforts have led to the development of the Neuronames Brain Hierarchy(http: / / bralninfo.rprc.washington.edu/) and the Foundational model of anatomy (FMA) ( h t t p : / / s i g . biostr.Washington .edu/projects/ fm/AboutFM.html)at the University of Washington, or Neuranat (ht tp: / /www.chups. jussieu.fr /ext / neuranat)in Paris at CHU La Pitie-Salpetriere. Generic formalizations of spatial concepts were also developed and specified in different fields, for spatial reasoning in artificial intelligence, for Geographic Information Systems, etc.

In a parallel domain, well formalized theories for image processing and recognition appeared in the image and computer vision community.

Noticeably, both types of developments still remain quite disjoint and very few approaches try to use the abstract formalizations to guide image in

terpretation. The main reason is to be found in the so called "semantic gap", expressing the difficulty to link abstract concepts with image features. This problem is also related to the symbol grounding problem.

In this paper we highlight a few features of the semantic gap problem in image interpretation. We show that semantic image interpretation can be seen as a symbol grounding problem in Section 2. In this context, ontologies provide a powerful framework to represent domain knowledge, concepts and their relations, and to reason about them. Therefore, they are likely to be more and more developed for image interpretation. We briefly explain the potentials of ontologies towards this aim in Section 3. A lot of image interpretation systems rely strongly on descriptions of objects through their characteristics such as shape, location, image intensities. However, spatial relations are very important too, as explained in Section 4, and provide a structural description of the imaged phenomenon, which is often more stable and less prone to variability than pure object descriptions. We show




http://bralninfo.rprc.washington.edu/

http://sig

http://www.chups.jussieu.fr/ext/

16 On the Interest of Spatial Relations and Fuzzy Representations for Ontology-Based Image Interpretation

that spatial relations can be integrated in domain ontologies. Because of the intrinsic vagueness we have to cope with, at different levels (image objects, spatial relations, variability, questions to be answered, etc.), fuzzy representations are well adapted and provide a consistent formal framework to address this key issue, as well as the associated reasoning and decision making aspects. This question is addressed in Section 5. Our view is that ontology-based methods can be very useful for image interpretation if they are associated to operational models of spatial relations (and other concepts), in particular based on fuzzy representations. These operational models contribute to reduce the semantic gap. We provide some hints on this integration in Section 6.

As a typical application where all these issues are raised, we illustrate our purpose with examples in brain image interpretation.

2. Semantic gap in image interpretation and symbol grounding

The symbol grounding problem has been first introduced in artificial intelligence by Harnad in,1 as an answer to the famous Searle's criticisms of artificial systems.2 It is defined in1 through the fundamental question: How is symbol meaning to be grounded in something other than just more meaningless symbols. As underlined in the literature, symbol grounding is still an unsolved problem (see e.g.3).

In the robotics community, this problem was addressed as the anchoring problem:4 a special form of symbol grounding needed in robotic systems that incorporate a symbolic component and a reasoning process. The anchoring process is defined as the problem of creating and maintaining the correspondence between symbols and sensor data that refer to the same physical object.

In our case, artificial systems are not robotic systems but image interpretation systems. As the former, they incorporate a symbolic component. Some similarities between Anchoring and Pattern Recognition have been underlined in,5 in order to assess the potentiality of using ideas and techniques from anchoring to solve the pattern recognition problem and vice versa. Similarly, we argue that image interpretation could greatly benefit from such a correspondence. Indeed, the image interpretation problem can be defined as the automatic extraction of the meaning of an image. The image semantics cannot be

considered as being included explicitly in the image itself. It rather depends on prior knowledge on the domain and the context of the image. It is therefore necessary to ground the digital representation of an image (perceptual level) with the semantic interpretation that a user associates to it (linguistic level). In the image indexing and retrieval community, this problem is called the semantic gap problem, i.e. the lack of coincidence between the information that one can extract from the visual data and the interpretation of these data by a user in a given situation.6

Our view is that image interpretation can be seen as a symbol grounding problem, i.e. the dynamical process of associating image data to human interpretations by taking into account the influence of external factors such as the social environment (application domain, interpretation goal, ...) or the physical environment of the interpretation. Indeed, image interpretation is the process of finding semantics and symbolic interpretations of image content. This problem has the same nature as the physical grounding of linguistic symbols in visual information in the case of natural language processing systems.7'8 In our case, linguistic symbols are application domain concepts defined by their linguistic names and their definition.

Example: In cerebral image interpretation, concepts can be: brain: part of the central nervous system located in the head, caudate nucleus: a deep gray nucleus of the telencephalon involved with control of voluntary movement, glioma: tumor of the central nervous system that arises from glial cells,...

Rather than being constrained by a grammar and a syntax as in a formal or natural language, the concepts are organized in a semantic knowledge base which describes their semantics and their hierarchical and structural dependencies.

Example: The human brain is a structured scene and spatial relations are highly used in the anatomical brain description (e.g. the left thalamus is to the left of the third ventricle and below the lateral ventricle).

This structural component, in the form of spatial relations, plays a major role in image interpretation. This aspect is detailed in Section 4.

Ontologies are useful to represent the semantic knowledge base. They entail some sort of the world view, i.e. a set of concepts, their definitions and their relational structure which can be used to describe and reason about a domain. This aspect is detailed

Isabelle Block, Celine Hudelot and Jamal Atif 17

in Section 3. As underlined by Cangelosi in,9 a symbol

grounding mechanism, as language itself, has both an individual and a social component. The individual component called Physical Symbol Grounding refers to the ability for a system to create an intrinsic link between perceptions and symbols. The Social Symbol Grounding refers to the ability to communicate with other systems by the creation of a shared lexicon of perceptually-grounded symbols. It is strongly related to the research on human language origins and evolution where external factors such as cultural and biological evolution are primordial.

N.tii u\im\ l.*t .-„.•*. l . \ .

i i . , f >•-,

»: Mumliwi

. » • < • * t •

• • ' • * ; •

livMiK^V.S."' * * ^ 9X >¥*>•)?

denotes

Semantic Kmnv-Sedge 1 brain 1

1 ' 1 1 . ] t 1

1

real world /•V"> ' (fvujlfiiiig;

perceptions by visual sensors Visual Percepts

Image pixels regions descriptors..

Fig. 1. Physical and external symbol grounding for image interpretation.

In the case of image interpretation systems, these two components of the symbol grounding are also essential and take the following form: on the one hand, the physical symbol grounding consists of the internal creation of the link between visual percepts (image level) and a known semantic model of the part of the real world which concerns the application domain (domain semantic level). On the other hand, in order to enable communication and interoperability with humans or other systems, this grounded interpretation must capture a consensual information accepted by a group. As a consequence a social external symbol grounding component raises for image interpretation. Moreover, image interpretation systems operate in a dynamic environment which is prone to changes and variations. The interpretation process is highly influenced by external factors such as the environmental context, the perception system

or the interpretation goal and it has to adapt itself to these external factors. As a consequence, image interpretation is a distributed and adaptive process between physical symbol grounding and external symbol grounding as shown in Figure 1.

3. Ontologies for image interpretat ion

In knowledge engineering, an ontology is defined as a formal, explicit specification of a shared conceptualization.10 An ontology encodes a partial view of the world, with respect to a given domain. It is composed of a set of concepts, their definitions and their relations which can be used to describe and reason about a domain. Ontological modeling of knowledge and information is crucial in many real world applications such as medicine for instance.11

Let us mention a few existing approaches involving jointly ontologies and images. By using ontologies, the physical symbol grounding consists in ontology grounding,12 i.e. the process of associating abstract concepts to concrete data in images. This approach is considerably used in the image retrieval community to narrow the semantic gap. In,13

the author proposes to ground, in the image domain, a query vocabulary language used for content-based image retrieval using supervised machine learning techniques. A supervised photograph annotation system is described in,14 using an annotation ontology describing the structure of an annotation, irrespectively of the application domain, and a second ontology, specific to the domain, which describes image contents. Another example concerns medical image annotation, in particular for breast cancer,15 and deals mainly with reasoning issues. But image information is not direcly involved in these two systems. Other approaches propose to ground intermediate visual ontologies with low level image descriptors,16-18

and are therefore closer to the image interpretation problem. In,19 the enrichment of the Wordnet lexicon by mapping its concepts with visual-motor information is proposed.

As the main ontology language OWL is based on description logics, a usual way to implement the grounding between domain ontologies (or visual ontologies) and image features is the use of concrete domains as shown in Figure 2.

Description logics20 are a family of knowledge-based representation systems mainly characterized

18 On the Interest of Spatial Relations and Pazzy Representations for Ontology-Based Image Interpretation

Table 1. Description logics syntax and interpretation.

Constructor atomic concept

individual Top

Bottom atomic role conjunction disjunction

negation existential restriction universal restriction

value restriction number restriction

Subsumption Concept definition

Concept assertion Role assertion

Syntax A a T JL

r CUD CUD

-rC 3r.C Vr.C

9 r.{A] (>nR) (<nR) C C D C = D

a:C (a,b) : R

Example Human

Lea Thing

Nothing has-age

Human n Male Male U Female

-i Human 3has-child.Girl

Vhas-child. Human 9has-child.{Lea} (> 3 has-child)

(< 1 has-mother) Man C Human

Father = Man n 3 has-child.Human

John:Man (John,Helen):has-child

Semantics A1 CAX

a? € A 1

T I = A I

±.x = $x

Rx C A 1 x Ax

c x nD x

CXUDX

AX\CX

{x € A1 | 3y € A1 : (x,y) € Rx A y € C 1 } {x 6 A z | Vj/ € A z : (x, y) € RT =s- y £ C1} {x € Ax | 3y € A2" : (x, y) € flx => y = a x }

{ x € A x | K l r l f o y ) € « * } ! > * } {xeA 1 ! \{y\(x,y)eRx}\<n}

C1 CD* CX = DX

ax €CX

(ax,bx)eRx

Abstract domain

Rose R3: hasCoior

Concrete domain (image)

Pink (RGB values)

(shape descriptors)

Fig. 2. tion.

Importance of concrete domains in image interpreta-

by a a set of constructors that enable to build complex concepts and roles from atomic ones. A semantics is associated with concepts, roles and individuals using an interpretation J = (A1, -1), where A 1 is a non empty set and -J is an interpretation function that maps a concept C to a subset CJ of A 1 or a role r to a subset R1 of A J x A z . Concepts correspond to classes. A concept C represents a set of individuals (a subset of the interpretation domain). Roles are binary relations between objects. Table 1 describes the main constructors and a syntax for description logics.

Concrete domains are expressive means of description logics to describe concrete properties of real world objects such as their size, their spatial extension or their color. They are of particular interest for image interpretation, as illustrated in Figure 2. Indeed, they allow performing anchoring for a particular application, hence reducing the semantic gap. This grounding approach using description log

ics and concrete domains has been used by several authors21,22 for the automation of semantic multimedia annotation.

4. Importance of spatial relations

Spatial relations between objects of a scene or an image is of prime importance, as highlighted in different domains, such as perception, cognition, spatial reasoning, Geographic Information Systems, computer vision. In particular, the spatial arrangement of objects provides important information for recognition and interpretation tasks, in particular when the objects are embedded in a complex environment like in medical or remote sensing images.23'24 Human beings use extensively spatial relations in order to describe, detect and recognize objects: they allow to solve ambiguity between objects having a similar appearance, and they are often more stable than characteristics of the objects themselves (this is typically the case of anatomical structures).

Many authors have stressed the importance of topological relations, but distances and directional relative position are also important, as well as more complex relations such as "between", "surround", "among", etc. Freeman25 distinguishes the following primitive relations: left of, right of, above, below, behind, in front of, near, far, inside, outside, surround. Kuipers24,26 considers topological relations (set relations, but also adjacency which was not considered by Freeman) and metrical relations (distances and directional relative position).

Isabelle Block, Celine Hudelot and Jamal Atif 19

Spatial reasoning can be defined as the domain of spatial knowledge representation, in particular spatial relations between spatial entities, and of reasoning on these entities and relations (hence the importance of relations). This field has been largely developed in artificial intelligence, in particular using qualitative representations based on logical formalisms. In image interpretation and computer vision, it is much less developed and is mainly based on quantitative representations. In most domains, one has to be able to cope with qualitative knowledge, with imprecise and vague statements, with polysemy, etc. This calls for a common framework which is both general enough to cover large classes of problems and potential applications, and able to give raise to instantiations adapted to each particular application. Ontologies appear as an appropriate tool towards this aim. This shows the interest of associating ontologies and spatial relations for symbol grounding and image interpretation. Figure 3 illustrates a part of an ontology of spatial relations.27

Spatial Relation

X Topological Relation Metric Relation

Adjacent; Directional Relation Distance Relation

Binary Directional Relation

Ternary Directional Relation

Right to 11 Left to | |ln Front otj Close to Far from

hierarchy relation

Fig. 3. Excerpt of the hierarchical organization of spatial relations in the ontology of.27

As mentioned in,28 several ontological frameworks for describing space and spatial relations have been developed recently. In spatial cognition and linguistics, the project OntoSpace(http://www. ontospace.uni-bremen.de/ twiki/bin/view/Main/WebHome) aims at developing a cognitively-based commonsense ontology for space. Some interesting works on spatial ontologies can also be found in Geographic Information Science29 or in medicine concerning the formalization of anatomi

cal knowledge.30"32 All these ontologies concentrate on the representation of spatial concepts according to the application domains. They do not provide an explicit and operational mathematical formalism for all the types of spatial concepts and spatial relations. For instance, in medicine, these ontologies are often restricted to concepts from the mereology theory.31

They are therefore useful for qualitative and symbolic reasoning on topological relations but there is still a gap to fill before using them for image interpretation.

Example: internal brain structures are often described trough their spatial relations, such as: the left caudate nucleus is inside the left hemisphere; it is close to the lateral ventricle; it is outside (left of) the left lateral ventricle; it is above the thalamus, etc. In case of pathologies, these relations are quite stable, but more flexibility should be allowed in their semantics.33

This example raises the problem of assigning semantics to these spatial relations, according to the application domain: what do concepts such as "close to" or "left" mean when dealing with brain images? Should this meaning be adapted depending on the context (possible pathology, etc.)? These questions can be addressed by using fuzzy models.

5. Importance of fuzzy representat ions

Usually vision and image processing make use of quantitative representations of spatial relations. In a purely quantitative framework, spatial relations are well defined for some classes of relations, unfortunately not for intrinsically vague relations (such as directional ones for instance). Moreover they need a precise knowledge of the objects and of the types of questions we want to answer. These two constraints can be relaxed in a semi-qualitative framework, using fuzzy sets. This allows to deal with imprecisely defined objects, with imprecise questions such as are these two objects near to each other?, and to provide evaluations that may be imprecise too, which is useful for several applications, where spatial reasoning under imprecision has to be considered. Note that this type of question also raises the question of polysemy, hence the need for semantics adapted to the domain. This is an important question to be solved in the symbol grounding and semantic gap problems.

Fuzzy set theory finds in spatial information pro-

http://www


cessing a growing application domain. This may be explained not only by its ability to model the inherent imprecision of such information (such as in image processing, vision, mobile robotics...) together with expert knowledge, but also by the large and powerful toolbox it offers for dealing with spatial information under imprecision. This is in particular highlighted when spatial structures or objects are directly represented by fuzzy sets. If even less information is available, we may have to reason about space in a purely qualitative way, and the symbolic setting is then more appropriate. In artificial intelligence, mainly symbolic representations are developed and several works addressed the question of qualitative spatial reasoning (see34 for a survey). For instance in the context of mereotopology, powerful representation and reasoning tools have been developed, but are merely concerned by topological and part-whole relations, not by metric ones.

Limitations of purely qualitative spatial reasoning have already been stressed in,35 as well as the interest of adding semiquantitative extension to qualitative value (as done in the fuzzy set theory for linguistic variables36'37) for deriving useful and practical conclusions (as for recognition). Purely quantitative representations are limited in the case of imprecise statements, and of knowledge expressed in linguistic terms. As another advantage of fuzzy representations, both quantitative and qualitative knowledge can be integrated, using semi-quantitative (or semi-qualitative) interpretation of fuzzy sets. These representations can also cope with different levels of granularity of the information, from a purely symbolic level, to a very precise quantitative one. As already mentioned in,25 this allows us to provide a computational representation and interpretation of imprecise spatial constraints, expressed in a linguistic way, possibly including quantitative knowledge. Therefore the fuzzy set framework appears as a central one in this context. Several spatial relations have led to fuzzy modeling, as reviewed in.23

Spatial reasoning aspects often imply the combination of various types of information, in particular different spatial relations. Again, the fuzzy set framework is appropriate since it offers a large variety of fusion operators38'39 allowing for the combination of heterogeneous information (such as spatial relations with different semantics) according to different fusion rules, and without any assumption on an underlying

metric on the information space. They also apply on various types of spatial knowledge representations (degree of satisfaction of a spatial relation, fuzzy representation of a spatial relation as a fuzzy interval, as a spatial fuzzy set, etc.). These operators can be classified according to their behavior, the possible control of this behavior according to the information to combine, their properties, and their specificities in terms of decision.40 For instance, if an object has to satisfy, at the same time, several spatial constraints expressed as relations to other objects, the degrees of satisfaction of these constraints will be combined in a conjunctive manner, using a t-norm. If the constraints provide a disjunctive information, operators such as t-conorms are then appropriate. It is the case for example for symmetrical anatomical structures that can be found in the left or right parts of the human body. Operators with variable behavior, as some symmetrical sums, are interesting if the aim is a reinforcement of the dynamics between low degrees and high degrees of satisfaction of the constraints. In particular, this facilitates the decision since different situations will be better discriminated.

Let us come back to ontologies from the point of view of uncertain knowledge and imprecise information. A major weakness of usual ontological technologies is their inability to represent and to reason with uncertainty and imprecision. As a consequence, extending ontologies in order to cope with these aspects is a major challenge. This problem has been recently stressed out in the literature. Several approaches have been proposed to deal with uncertainty and imprecision in ontology engineering tasks.41'42 The first approach is based on probabilistic extensions of the standard OWL ontology language(http: / / www.w3.org/TR/owl-features/) by using Bayesian networks.43'44 The probabilistic approach proposes to first enhance the OWL language to allow additional probabilistic markups and then to convert the probabilistic OWL ontology into the directed acyclic graph of a Bayesian network with translation rules. As the main ontology language OWL is based on description logics,20 another approach to deal with uncertainty and imprecision is to use fuzzy description logics.45-48 Fuzzy description logics can be classified according to the way fuzziness is introduced into the description logics formalism. A good review can be found in.49 In particular, a common way for description logics with concrete domains is to intro-

http://www.w3.org/TR/owl-features/

Isabetle Block, Celine Hudelot and Jamal Atif 21

duce fuzziness by using fuzzy predicates in concrete domains as described in.50

Another approach is to introduce fuzziness directly in the concrete domains, which then become fuzzy concrete domains. This is particularly interesting for image interpretation.

Example: Using fuzzy representations of spatial relations in the image domain leads to restricted search area for the caudate nucleus, based on the knowledge that it is to the right and close to the lateral ventricles. This is illustrated in Figure 4-

(a) (b

(c) (d)

Fig. 4. (a) The right ventricle is superimposed on one slide of the original image (MM here). The search space of the object "caudate nucleus" corresponds to the conjunctive fusion of the spatial relations "to the right of the right ventricle" (b) and "close t o the right ventricle" (c). The fusion result is shown in (d).

Example: Typically brain image interpretation may have to cope with abnormalities such as tumors. Our system allows instantiating generic knowledge expressed in the ontology to adapt to the specific patient 's case. The fuzzy representations provide an efficient way to represent inter-individual variability, which are a key point in such situations. They can be further revised or specified according to the visual features extracted from the image and matched with the symbolic representation.

Using fuzzy representations, it is possible to deal with such cases, for instance by enlarging the areas where an object can be found, which amounts to relax the definition of the fuzzy relation.

In summary, fuzzy representations have several

advantages:

• they allow representing the imprecision which is inherent to the definition of a concept; for instance, the concept "close to" is intrinsically vague and imprecise, and its semantics depends on the context in which objects are embedded, on the scale of the object and on their environment;

• they allow managing imprecision related to the expert knowledge in the concerned domain;

• they constitute an adequate framework for knowledge representation and reasoning, reducing the semantic gap between symbolic concepts and numerical information.

6. Towards the integration of ontologies, spatial relations and fuzzy models

To conclude this presentation, we summarize ongoing developments carried out in our team, towards the construction of a spatial relation ontology enhanced with fuzzy representations and its use for image interpretation. This work aims at integrating all important features underlined in this paper. A global scheme of our approach is provided in Figure 5.

Our recent work addresses the important problems highlighted in this paper in several ways.27,52

We propose to reduce the semantic gap between numerical information contained in the image and higher level concepts by enriching ontologies with a fuzzy formalism layer. More specifically, we introduce an ontology of spatial relations and propose to enrich it by fuzzy representations of these relations in the spatial (image) domain. The choice of spatial relations is motivated on the one hand by the importance of structural information in image interpretation, and on the other hand by the intrinsically ambiguous nature of most spatial relations. This ontology has been linked to the part of FMA related to brain structures, as illustrated in Figure 5.

As another contribution, this enriched ontology can support the reasoning process in order to recognize structures in images, in particular in medical imaging. Different types of reasoning become then possible: (i) a quite general reasoning may consist in classifying or filtering ontological concepts to answer some queries; (ii) at a more operational way, the ontology and the fuzzy representations can be used to deduce spatial reasoning operations in the images and to guide image interpretation tasks such


generic knowledge

H'*!••.'! m.iti.r.-':

JV.'"J iil'U-.tUTil

it'c.cipti >n

Knowledge of »pocifi': CHSOS

F.rmn inm'.'

J- M Mt.yl

l l fMltr, • T ^ ^ l Pathoiuqr.il]

;.rt:.o^

In'-I'iavnii lunnis

Graph based representation of the generic model G

Fuzzy modeling of spatial relations

Dealing wi th s specific

•PI H i

r i IP:-!.'.

Leammg jjroesdofn

Or i-i-w -ibifti f ? * f l

L-.J

Stepl: j i*] learning spatial I j relations (adjacency, i j distance, orientation) | j of the generic model

} using healthy cases

Step 2: • learning spatial relation for specif cases * deducing stable relations for each class of patholog *".

Hi&ry *wmtHfafte

Generic model ^ adaptation using ^ ^ knowledge of specific V ^ case and results of the W teaming procedure #

•^•Sil] - . i 'S l / l '•,

kifo JfkA* /?*&

fe5

Graph based propagation process to update the graph and to

represent the tumor impact on the

surrounding structures

Enrichment of the

* :••• pfcirt i is ' l l iSHf » i

"" v : 3 ,Spa t i a l relation ^ j ontology

concepts

Spatial relations between anatomical concepts

faiit!?I' njrl*-js

3pita. patialPclatipnRight Of PJJH Uwal tnt Kit

3pllu<SFMdPe!£ nCIa t To figs I La'rabtntn k

B r a i n a n a t o m v srniasspaiaipda HADO * of n ht ih jam

c o n c e p t s

Fig. 5. Overview of our framework. Ontological engineering is used to represent the symbolic knowledge useful to interpret cerebral images. In particular, a spatial relation ontology is used to enrich the brain ontology by the description of the spatial structure of the brain. A graph based representation of the brain including learned fuzzy representations of spatial relations is derived from the generic model and from an image database. This graph is used to guide the segmentation and the recognition of cerebral structures. This framework is also useful to deal with pathological cases by an adaptation of the knowledge and the reasoning process. The second scheme displays a part of an ontology of brain anatomy (excerpt of the FMA5 1) enhanced with our fuzzy spatial relations ontology. The concepts of the spatial relation ontology are prefixed by p i .

as localization of objects, segmentation, and recognition. An illustration is provided in Figure 6 for the

recognition of internal brain structures. Another enrichment of the model consists of

the representation of domain knowledge by graphs,

Isabelle Bloch, C&line Hudelot and Jamal Atif 23

DomaiR Ont&ftfgj-

' > . - 7 - ^ - T - -

* " * " ' w " : " " l "" " " ™ " '

Fact Base C"!; 5«<jt^,si«frii, ytoiwrews (\

lifcsJKfc JU^^WSrtftto^lattHftjtK *; K?: *3rsy, »Kttfw jx*HisH»wfi ûa^ôjthiasîAMa&ffi^ C^. 'on^^flaatf! ^tj&HHjsns",

fla<* 1WJ«X^„âR'*t*J,.-^3ftKa»i<fU'; Cw; Caw imawRf .Jwwi^wttt^

iia», fa^'»4^„v, j»M8«i»^wt^H***i

Image domain

if5^ <

* a . *Siai - *fr\i i9lh?w^!Sd / /"• I

R4 R2

/ Rl

\ 1

) i m \ \ \ \ \ R3

Fig. 6. The right lateral ventricle corresponds to the spatial region Rl in the image. The domain ontology describes spatial relations between several grey nuclei and the lateral ventricles. These relations are exploited to identify each individual structure.

which include fuzzy models of spatial relations, used to guide the recognition of individual structures in images.53 The inclusion of such structural models, as intermediate representation domains between symbols and images, deals with the physical symbol grounding problem, and also contributes to reduce the semantic gap. However pathological cases may deviate substantially from generic knowledge. We propose to adapt the knowledge representation to take into account the possible influence of pathologies on the spatial organization, based on learning procedures. We also adapt the reasoning process, based on graph based propagation and updating. These features of our approach are detailed in.52 A result is illustrated in Figure 7.

Fig. 7. An axial slice of a 3D MRI, with segmented tumor and some anatomical structures.

The enriched ontology contributes to reduce the

semantic gap and to answer some symbol grounding questions, which are difficult and still open problems in image interpretation. It provides tools both for knowledge acquisition and representation and for its operational use. It has an important potential in model-based recognition that deserves to be further explored, in particular for medical image interpretation. The framework described in this section focuses on spatial relations, but similar principles can be applied to other types of information that could be involved in image interpretation.

A c k n o w l e d g m e n t s : This work has been partly supported by grants from Region Ile-de-France, GET and ANR.

Re fe r ences

1. S. Harnad, Physica 42, 335 (1990). 2. J. Searle, Behavioral and Brain Sciences 3, 417

(1980). 3. M. Taddeo and L. Floridi, Journal of Experimental

and Theorical Artificial Intelligence (2006). 4. S. Coradeschi

and A. Saffiotti, Robotics and Autonomous Systems 43, 85 (2003), Special issue on perceptual anchoring. Online at http://www.aass.oru.se/Agora/RAS02/.

5. I. Bloch and A. Saffiotti, Some similarities between anchoring and pattern recognition concepts, in AAAI Fall Symposium on Anchoring Symbols to Sensor Data in Single and Multiple Robots Systems, 2001.

6. A. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain, IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1349 (2000).

7. D. Roy, Computer Speech and Language 16 (2002). 8. P. Vogt, Artificial Intelligence 167, 206 (2005). 9. A. Cangelosi, Pragmatics and Cognition, Special Is

sue on Distributed Cognition (2006), in press. 10. T. R. Gruber, Towards Principles for the Design

of Ontologies Used for Knowledge Sharing, in Formal Ontology in Conceptual Analysis and Knowledge Representation, eds. N. Guarino and R. Poli (Kluwer Academic Publishers, Deventer, The Netherlands, 1993).

11. P. Zweigenbaum, B. Bachimont, J. Bouaud, J. Charlet and J. Boisvieux, Meth Inform Med 34, p. 2 (1995).

12. A. Jakulin and D. Mladenic, Ontology grounding, in Conference on Data Mining and Data Warehouses, (Ljubljana, Slovenia, 2005).

13. C. Town, Machine Vision and Applications (2006). 14. A. Schreiber, B. D. ans J. Wielemaker and

B. Wielinga, IEEE Intelligent Systems 16, 66 (2001). 15. B. Hu, S. Dasmahapatra, P. Lewis and N. Shadbolt,

http://www.aass.oru.se/Agora/RAS02/


Ontology-based medical image annotation with description logics, in 15th IEEE International Conference on Tools with Artificial Intelligence, 2003.

16. C. Hudelot, N. Maillot and M. Thonnat, Symbol grounding for semantic image interpretation: from image data to semantics, in Proceedings of the Workshop on Semantic Knowledge in Computer Vision, ICCV, (Beijing, China, 2005).

17. W. Z. Mao and D. A. Bell, Integrating visual ontologies and wavelets for image content retrieval, in DEXA Workshop, 1998.

18. V. Mezaris, I. Kompatsiaris and M. G. Strintzis, Eurasip Journal on Applied Signal Processing 2004, 886 (2004).

19. G. Guerra-Filho and Y. Aloimonos, Towards a sensorimotor wordnet: Closing the semantic gap, in Proceedings of the Third International Wordnet Conference, January 2006.

20. F. Baader, D. Calvanese, D. McGuinness, D. Nardi and P. Patel-Schneider, The Description Logic Handbook: Theory, Implementation and Applications (Cambridge University Press, 2003).

21. M. Aufaure and H. Hajji, Multimedia Information Systems , 38 (2002).

22. K. Petridis, D. Anastasopoulos, C. Saathoff, N. Tim-mermann, I. Kompatsiaris and S. Staab, Engineered Applications of Semantic Web Session (SWEA) at the 10th International Conference on Knowledge-Based & Intelligent Information & Engineering Systems (KES 2006) (2006).

23. I. Bloch, Image and Vision Computing23, 89 (2005). 24. B. J. Kuipers and T. S. Levitt, AI Magazine 9, 25

(1988). 25. J. Freeman, Computer Graphics and Image Process

ing 4, 156 (1975). 26. B. Kuipers, Cognitive Science 2, 129 (1978). 27. C. Hudelot, J. Atif and I. Bloch, Ontologie de rela

tions spatiales floues pour l'interpretation d'images, in Rencontres francophones sur la Logique Floue et ses Applications, LFA 2006, (Toulouse, France, 2006).

28. J. Bateman and S. Farrar, Towards a generic foundation for spatial ontology, in International Conference on Formal Ontology in Information Systems (FOIS-2004), (Trento, Italy, 2004).

29. R. Casati, B. Smith and A. Varzi, Ontological Tools for Geographic Representation, in Formal Ontology in Information Systems, ed. N. Guarino (IOS Press, Amsterdam, 1998) pp. 77-85.

30. O. Dameron, Symbolic model of spatial relations in the human brain, in Mapping the Human Body: Spatial Reasoning at the Interface between Human Anatomy and Geographic Information Science, (University of Buffalo, USA, 2005).

31. M. Donnelly, T. Bittner and C. Rosse, Artificial Intelligence in Medicine 36, l(January 2006).

32. S. Schulz, U. Hahn and M. Romacker, Modeling anatomical spatial relations with description log

ics, in Annual Symposium of the American Medical Informatics Association. Converging Information, Technology, and Health Care (A MI A 2000), (Los Angeles, CA, 2000).

33. J. Atif, H. Khotanlou, E. Angelini, H. Duffau and I. Bloch, Segmentation of Internal Brain Structures in the Presence of a Tumor, in MICCAI, (Copenhagen, 2006).

34. L. Vieu, Spatial Representation and Reasoning in Artificial Intelligence, in Spatial and Temporal Reasoning, ed. O. Stock (Kluwer, 1997) pp. 5-41.

35. S. Dutta, International Journal of Approximate Reasoning 5, 307 (1991).

36. L. A. Zadeh, Information Sciences 8, 199 (1975). 37. D. Dubois and H. Prade, Fuzzy Sets and Systems:

Theory and Applications (Academic Press, New-York, 1980).

38. D. Dubois and H. Prade, Information Sciences 36, 85 (1985).

39. D. Dubois, H. Prade and R. Yager, Merging Fuzzy Information, in Handbook of Fuzzy Sets Series, Approximate Reasoning and Information Systems, eds. J. Bezdek, D. Dubois and H. Prade (Kluwer, 1999)

40. I. Bloch, IEEE Transactions on Systems, Man, and Cybernetics 26, 52 (1996).

41. P. da Costa et al. (Eds), Proceedings of the ISWC Workshop on Uncertainty Reasoning for the Semantic Web, 2005).

42. E. Sanchez (ed.), Fuzzy Logic and the Semantic Web (Elsevier, 2006).

43. Z. Ding, Y. Peng and R. Pan, A Bayesian Approach to Uncertainty Modelling in OWL Ontology, in International Conference on Advances in Intelligent Systems-Theory and Applications (AISTA20O4), (Luxembourg-Kirchberg, Luxembourg, 2004).

44. Y. Yang and J. Calmet, Ontobayes: An ontology-driven uncertainty model, in International Conference on Intelligent Agents, Web Technology and Internet Commerce (IAWTIC05), 2005.

45. S. Holldobler, T. Khang and H. Storr, Proceedings InTech/VJFuzzy 2002, 25 (2002).

46. Y. Li, B. Xu, J. Lu, D. Kang and P. Wang, A family of extended fuzzy description logics, in 29th Annual International Computer Software and Applications Conference (COMPSAC'05), (IEEE Computer Society, Los Alamitos, CA, USA, 2005).

47. G. Stoilos, G. Stamou and J. Pan, Handling imprecise knowledge with fuzzy description logic, in International Workshop on Description Logics (DL 06), (Lake District, UK, 2006).

48. U. Straccia, Description logics with fuzzy concrete domains, in 21st Conference on Uncertainty in Artificial Intelligence (UAI-05), eds. F. Bachus and T. Jaakkola (AUAI Press, Edinburgh, Scotland, 2005).

49. M. d'Aquin, J. Lieber and A. Napoli, Etude de quelques logiques de description floues et de formal-ismes apparentes, in Rencontres Francophones sur la

Isabelle Bloch, Celine Hudelot and Jamal Atif 25

Logique Flout et ses Applications, (Nantes, France, 2004).

50. U. Straccia, A fuzzy description logic for the semantic web, in Fuzzy Logic and the Semantic Web, ed. E. SanchezCapturing Intelligence (Elsevier, 2006) pp. 73-90.

51. C. Rosse and J. L. V. Mejino, Journal of Biomedical Informatics 36, 478 (2003).

52. J. Atif, C. Hudelot, I. Bloch and E. Angelini, From Generic Knowledge to Specific Reasoning for Medical Image Interpretation using Graph-based Representations, in International Joint Conference on Artificial Intelligence IJCAI'07, (Hyderabad, India, 2007).

53. O. Colliot, O. Camara and I. Bloch, Pattern Recognition 39, 1401 (2006).

26

A Technique for Creating Probabilistic Spatio-Temporal Forecasts

V. Lakshmanan

University of Oklahoma and National Severe Storms Laboratory

E-mail: lakshmanOou.edu

Kiel Ortega

School of Meteorology University of Oklahoma

Keywords: Probability; Atmospheric Science

Probabilistic forecasts can capture uncertainty better and provide significant economic benefits because the users of the information can calibrate risk. For forecasts in earth-centered domains to be useful, the forecasts have to be clearly demarcated in space and time. We discuss the characteristics of a good probability forecast - reliability and sharpness - and describe well-understood techniques of generating good probability forecasts in the literature. We then describe our approach to addressing the problem of creating good probabilistic forecasts when the entity to be forecast can move and morph. In this paper, we apply the technique to severe weather prediction by formulating the weather prediction problem to be one of estimating the probability of an event at a particular spatial location within a given time window. T h e technique involves clustering Doppler radar-derived fields such as low-level shear and reflectivity to form candidate regions. Assuming stationarity, the spatial probability distribution of the regions is estimated, conditioned based on the level of organization within the regions and combined with the probability that a candidate region becomes severe. For example, the probability that a candidate region becomes tornadic is estimated using a neural network with a sigmoid output node and trained on historical cases.

1. Motivation

A principled estimate of the probability that a threat will materialize can be more useful than a binary yes/no prediction because a binary prediction hides the uncertainty inherent in the data and predictive model from users who will make decisions based on that prediction. A principled probabilistic prediction can enable users of the information to calibrate their risk and can aid decision making beyond what simple binary approaches yield.1 Probabilistic forecasts can capture uncertainty better and provide significant economic benefits because the users of the information can calibrate risk.

Techniques to create good probabilistic forecasts are well understood, but only in situations where the predictive model is a direct input-output relationship. If the threats in consideration move and change shape, as with short-term weather forecasts, the well-understood techniques can not be used directly. For forecasts in earth-centered domains to be useful, the forecasts have to be clearly demarcated in space and time. This paper presents a data mining approach to address the problem of creating principled probabilistic forecasts when the entity to be forecast can

move and change shape. The rest of the paper is organized as follows.

The characteristics of good probabilistic forecasts, and standard data mining approaches to create such probabilistic forecasts are described in Section 2. The limitations of the standard data mining approaches in creating clearly demarcated forecasts in space and time are explained, and the first part of our predictive model (to create principled spatial forecasts) is explained in Section 3. The second part of model, to create principled temporal forecasts that can be tied with the spatial forecast, is explained in Section 4. The way of tying together the two probabilities is explained in Section 5. Results of applying this approach to predicting liquid content are described in Section 6.

2. Probabilistic Forecasts

A good probability forecast has two characteristics (See Figure 1): (a) it is reliable. For example, 30% of all the times that the threat is forecast as 30% likely to occur, the threat should occur and (b) it is sharp, i.e. the probability distribution function of the forecast probabilities should not be clustered around

http://lakshmanOou.edu

V. Lakshmanan and Kiel Ortega 27

P(f) ,/

Fig. 1. A good probability forecast needs to be both reliable (left) and sharp (right).

the a priori probability. Instead, there should be lots of low and high-probability events and relatively few mid-probability events.

In many cases, there are three probabilities of interest: (a) the probability that an event will occur, (b) the probability that the event will occur at a particular location (c) the probability that the event will occur at a particular location within a specified time window. When stated like this, most decision makers will aver that it is the third probability that is of interest to them,2 but that is not what they are commonly provided. Even though research studies implicitly place spatial and temporal bounds on their training set, it has long been unclear how to explicitly form the spatial and temporal variability of the probability field. Thus, if probabilities are presented to decision maker, those probabilities are usually of the first type. The probability that an event will occur is commonly estimated through the formation of a mapping between input features and the desired result on a set of historical cases .3 This mapping function may take the form of a neural network, a support vector machine or a decision tree. In the case of a neural network, the choice of a sigmoid function as the output node is sufficient to ensure that the output number is a probability, provided that the training set captures the a priori probabilities of the true input feature space.4 In the case of maximum marginal methods such as support vector machines or bagged, boosted decision trees, scaling the output using a sigmoid function yields the desirable property that the output is a probability .5

3. Spatial probability forecast

Thus, techniques to estimate the probability of an event are well understood. The estimation of spa

tial probabilities when the threat under consideration is stationary can be performed using the standard formalisms, for example,6 on soil variability. If the threat is stationary, then the value of each of the input fields at a pixel or statistical or morphological values computed in the neighborhood of a pixel may be used as the input features to a classifier. If the classifier is a neural network with a sigmoid output node or a Platt-scaled maximum marginal technique, then the output of the classifier yields a spatial probability.

One approach to probabilistic forecasts when the threat areas are not stationary is to use kriging,7 a geospatial interpolation approach. Naturally, such an approach is limited to studies where the kriging resolution can be finer than the rate of transformation. This is a condition that exists in slow-moving systems such as diseases (studied by8), but not in fast moving systems such as thunder storms.

For fast moving threats, no principled approach to estimating probability fields exists. This is because the input features at a point in space and time affect the threat potential at a different point N minutes later. Thus, a simple input/output mapping is insufficient because which location will be affected, and when it will be affected needs to be known. Yet, in practical situations, the time and location that will be affected is not known with certainty. Consequently, it is necessary to assume a spatial probability distribution associated with the locations that will be affected at a given time instant. Once this second probability is introduced, prior work on principled probability estimates is no longer applicable. A new formulation is needed in the development of a spatiotemporal framework to estimate the probability of occurrence of moving or spreading threats to account for the dynamics of spatial probability field

28 A Technique for Creating Probabilistic Spatio- Temporal Forecasts

of area at risk over time. One can develop the spatiotemporal formulation

first by building an ontology of threat precursors with identified features. Each of the features signals a probability distribution of threats in space and time. Threat probabilities can be combined from multiple features and dynamics among these features to estimate the probability of a threat occurring at a particular location within a specific time window. One factor that needs to be considered is that even if the spatial and temporal distribution of the locations that will be affected by a particular feature is estimated, whether the feature will lead to the threat still needs to be estimated. This, of course, is a problem that has been thoroughly addressed in the literature on data mining algorithms.

The formulation can be easily described by comparing to an analogue of a set of billiards balls (See Figure 2) and asking: what is the future position of any ball on the table? Each grid point in the set of continuous fields will be considered akin to a ball, and the driving force of a set of points, such as a storm cell boundary will be considered akin to the ring.

9l\

Fig. 2. The probabilistic formulation assumes rigid bodies operated upon by a driving force that permits individual degrees of motion.

We will make the simplified assumption of space-time stationarity: the probability distribution of a particle in space is identical to the probability distribution of the particle in time (See Figure 3). Loosely, this assumption is similar to assuming that the probability distribution obtained by tossing 50 coins is the same as the probability distribution obtained by

tossing the same coin 50 times. This assumption has to be made on faith in situations where one does not have 50 coins; we will never have 50 identical thunderstorms or hazardous events.

By making the assumption of space-time stationarity and from the motion estimated by the hybrid technique described in the next section, one can formulate the spatial probability distribution of a ball on the billiards table.

If motion estimates of a moving potential threat are available, the variability of the motion estimates themselves can be used to gauge the probability distribution of the motion estimates (Figure 3). Under stationarity assumptions, the historical probability distribution can be used to estimate the future spatial probability distribution of this current billiard ball.

There is one further complication, however. The future expected position of a billiard ball depends on how tightly the balls are packed or on how much control the forcing function exerts. If the balls are loosely packed, the movements of different balls are independent. If the balls are tightly packed, all the balls move together. The actual movement of the balls will be somewhere in between the two extremes. The actual balance depends on the problem at hand and will likely be estimated from the data.

If X is the event of interest (such as the probability that there will be a tornado at a particular location within the next 30 minutes), then simple Bayesian analysis yields:

P(X) = P(X\packed) x P(packed)+ P(X\notpacked) x (1 - P(packed)) ^ '

The first quantity, P(X\packed): the probability of lightning given that all the grid points associated with a thunderstorm move together - can be estimated quite readily using numerical integration of the probability distribution corresponding to motion vectors that will yield a lightning strike at this location based on where the threats are currently present.

The third quantity, P(X\notpacked): the probability of lightning given that the grid points associated with a thunderstorm move independently (in other words, the state transition from a cell to its neighbor is independent of other cells in a thunderstorm) can be computed through probability analy-


p=P(ui)P(vi;

*x

/ Fig. 3. By assuming that the probability distribution in space is identical to the probability distribution in time (stationarity), the estimated probability distribution in time from historical data can be used to create a spatial probability field. The figure shows how the data values of one component of the velocity vector in time (top) is used to estimate a probability distribution of that component (bottom left). This probability distribution, though estimated from the time field, is considered the spatial distribution in order to formulate the probabilistic location of a grid cell (bottom right) at time T into the future.

Sis:

P{X\notpacked) = 1 - ( (1 - P{xl))x (1-P(x2))x

(2)

(l-P(xN)))

where P(xl), P(x2), etc. are the spatial probabilities that the individual grid points will end up impacting this location. The weighting factor P(packed) needs to be estimated from historical data. We anticipate that the P (packed) will depend upon the type of event under consideration and needs to be estimated from the data - a single number may not work for all problems.

4. Estimating Movement

There are two broad methods of estimating movement: (a) optical flow methods including cross-correlation and spectral methods that minimize a displacement error with sub-grids of two frames of a sequence of images9'10 (b) object tracking methods that identify objects in the frames of a sequence and correlate these objects across time.11 The object tracking methods provide the ability to track small-scale features and extract trends associated with those features but miss large-scale transformations. In comparison, the optical flow methods yield more accurate estimates over large scales, but miss fine-scale movements and do not provide the ability to extract trends.12

Nevertheless, trends are very important in a

data-mining predictive approach. A measured value may bear no correlation with whether a threat will materialize, but sharp increases or decreases of some measurements are often reliable indicators of coming threats. Thus, even though optical flow methods can yield more accurate estimates of movement than object-tracking methods, optical flow methods cannot be djrectly applied to problem domains in which object-specific motion and trend information is critical.

One way to achieve the high accuracy of optical flow methods while retaining the trending capability of object-based tracking is to create a hybrid technique. It is possible to find objects by clustering the input features using a multi-scale hierarchical approach.12 This hybrid technique will not correlate objects across frames as in a typical object-tracking scenario. Instead, the objects should be correlated in the current frame with the images in the previous frame using the current shape of the object itself as the domain in which displacement error will be minimized. To extract trends, the object domain can then be simply displaced by the temporally adjusted amount in previous frames and the change in statistical properties of the object computed and used for prediction. Such a hybrid technique, as Figure 4 illustrates, neatly sidesteps the problems associated with splitting and merging that are commonly associated with object-tracking methods. Motion estimates from pairs of frames are smoothed across time using a Kalman filter.13


• & ,

o O

Fig. 4. A hybrid tracking technique of correlating objects in the current frame to pixels in the previous frame can yield trend information as well as the high accuracy associated with optical flow methods.

Fig. 5. Top row: Values of Vertical Integrated Liquid (VIL) in kg/m2 and reflectivity (Z) in dBZ at T= t 0 - Middle row: Probabilistic forecast of VIL I 20, Z I 0, Z i 30 at T=4 0 4- 30 Bottom row: Actual values of VIL and Z at T = t 0 + 30

5. Crea t ing a spa t io- tempora l probabi l i ty forecast

In the short term (under 60 minutes), extrapolation forecasts of storm location often provide reasonable results. Therefore, a data mining approach to probabilistic forecasts of severe weather hazards holds promise of being able to predict where already-occurring severe weather is likely to be. At the same time, research studies have shown that it Is possible to predict, with a high degree of skill, which storms are likely- to initiate lightning (one measure of storm severity) and where new storms are likely to form. What has been missing Is a formal mechanism of combining these two disparate sets of studies. Our

spatlotemporal formalism provides the framework to combine the two probabilities to yield, for example, a lightning probability forecast.

The suggested approach is to apply the various components described above to train the system as follows:

® Cluster the input remotely sensed data into candidate regions

m Using the hybrid motion estimation technique, associate current severe weather to threat signals that occurred N minutes ago.

® Train the associated threat signals to the severe weather activity using a neural network capable of providing predictive probabilities.


Probabilistic Forwcasf Reiiabiiity

t o

0.3 0 4 0.5 0.6 0.7 Forecast probability

Reliability-

sharpness of Probabilistic Frxecas!

01 02 03 04 05 08 07 OS 0.9 1 Forecast Probabtlrty

Sharpness

Fig. 6. Reliability and sharpness of Vertical Integrated Liquid (VIL in kg/m2) and radar reflectivity (Z in dBZ) forecasts for a Mar 8, 2002 case in the central United States.

• Use the spatiotemporal framework and the motion estimates derived from the hybrid technique to estimate spatial probabilities.

• Estimate the weighting factor P (packed) to create optimal (in terms of reliability and sharpness) probability factors.

The trained model and pattern recognition techniques can be applied in real-time on routinely arriving satellite, radar and model data to predict up to N minutes into the future. The resulting probabilistic forecasts can be evaluated using three metrics: (a) reliability (b) sharpness (c) Radar operating characteristics (ROC) curves, created by setting risk thresholds that vary from 0 to 1 and determining the probability of detection and false alarm rate on a grid-point by grid-point basis.

6. Results and Conclusion

The technique of the paper was applied to predicting the spatial distribution of vertical integrated liq-uid(VIL;14) and radar reflectivity 30 minutes into the

future. Examples of the original, forecast and actual images are shown in Figure 5 The reliability and sharpness diagrams when evaluated over a 6 hour period is shown in Figure 6.

Reflectivity forecasts were sharp, but not reliable. The reflectivity forecasts were usually underestimates of the true probability. The VIL probability forecasts were both reliable and sharp, so a probabilistic forecast of VIL would be of high utility to users of weather information. The lead time, of 30 minutes, is higher than possible using deterministic forecast techniques. Thus, the technique described in this paper works for severe weather situations (such as VIL), but not for any weather (as highlighted by the poor performance on reflectivity). Future work will involve testing against more useful diagnostics of severe weather - the initiation of lightning and the potential for a tornado.

Acknowledgments

Funding for this research was provided under NOAA-OU Cooperative Agreement NA17RJ1227. The statements, findings, conclusions, and recommendations are those of the authors and do not necessarily reflect the views of the National Severe Storms Laboratory (NSSL) or the U.S. Department of Commerce.

References

1. A. H. Murphy, The value of climatological categories and probabilistic forecast in the cost-loss ratio situations, Monthly Weather Review , 803 (1977).

2. I. Adrianto, T. M. Smith, K.A.Scharfenberg and T. Trafalis, Evaluation of various algorithms and display concepts for weather forecasting, in 21st Int'l Conf. on Inter. Inf. Proc. Sys. (IIPS) for Meteor., Ocean., and Hydr., (Amer. Meteor. Soc, San Diego, CA, Jan. 2005).

3. M. Richard and R. P. Lippman, Neural network classifiers estimate bayesian a posteriori probabilities, Neural Computation 3, 461 (1991).

4. C. Bishop, Neural Networks for Pattern Recognition (Oxford, 1995).

5. J. Piatt, Advances in Large Margin Classifiers (MIT Press, 1999), ch. Probabilistic outputs for suppot vector machines and comparisons to regularized likelihood methods.

6. J. Prevost and R. Popescu, Constitutive relations for soil materials, Electronic Journal of Geotechnical Engineering (1996).


7. M. Oliver and R. Webster, Kriging: a method of interpolation for geographical information systems, Int. J. Geographical Information Systems 4, 313 (1990).

8. P. Goovaerts, Geostatistical analysis of disease data: visualization and propagation of spatial uncertainty in cancer mortality risk using poisson kriging and p-field simulation, Int. J. Health Geography 5 (2006).

9. J. Barron, D. Fleet and S. Beauchemin, Performance of optical flow techniques, Int'l J. Comp.Vis. 12, 43 (1994).

10. J. Tuttle and R. Gall, A single-radar technique for estimating the winds in tropical cyclones, Bull. Amer. Met. Soc. 80(Apr. 1999).

11. M. Dixon, Automated storm identification, tracking and forecasting - a radar-based method, P h D thesis, University of Colorado and National Center for Atmospheric Researchl994.

12. V. Lakshmanan, R. Rabin and V. DeBrunner, Multi-scale storm identification and forecast, J. Atm. Res. , 367(July 2003).

13. R. Kalman, A new approach to linear filtering and prediction problems, Trans. ASME - J. Basic Engr. , 35(March 1960).

14. D. R. Greene and R. A. Clark, Vertically integrated liquid water - A new analysis tool, Mon. Wea. Rev. 100, 548 (1972).

PART C

Biometrics

An Efficient Measure for Individuality Detection in Dynamic Biometric Applications

B. Chakraborty

Faculty of Software and Information Science Iwate Prefectural University

Japan 020-0193 E-mail: [email protected]

Y. Manabe

Graduate School of Software and Information Science Iwate Prefectural University

Japan 020-0193

Individuality detection from dynamic information is needed to counter the threat of forgery in biometric authentication system. In this work a new similarity measure has been proposed for discrimination of genuine online handwriting from forgeries from dynamic time series signal. The simulation experiment shows that the measure is more effective compared to DP matching for measuring intraclass and interclass variation among several writers for writer verification by online handwriting.

Keywords: Dynamic Biometrics, Individuality Detectio

1. Introduction

Biometric technologies are rapidly gaining importance with the growing need of information security. Current biometric technologies for personal identity verification are mostly based on static information such as face, fingerprint, iris, blood vessel etc. Due to increasing threat of forgery it is becoming necessary to extend biometric technologies to include dynamic information like hand gesture, body movement or online handwriting.

The main point of an efficient authentication system is to minimize false acceptance rate as well as false rejection rate. One of the key problems in biometrics is the intraclass variation. Nearly no two structures are completely identical although they are of same origin. To achieve a successful identity detection, intra class variation should be as minimum as possible while inter class variation being maximum. In the case of dynamic biometric technologies such as writer verification by on-line handwriting, one generally has to deal with multivariable time series to include dynamical information. In the process of verification, we need to find out similarity or dissimilarity between two time series (signal) in order to discriminate between genuine handwriting with forgeries. The similarity measures used for static feature values (expressed as multidimensional vectors) in classification/identification problems can not be used directly for calculating similarities between two

Translation Error, Similarity Measure

multivariate time series signal. In this work a similarity measure for compar

ing two trajectories has been proposed which can be applied for individuality detection by distinguishing online handwriting of different persons based on dynamic information. The proposed measure is based on translation error defined by Wayland in 1 for detecting determinism in a time series. Simulation experiments has been done for the problem of writer identification by online handwriting and the usefulness of the measure has been shown in comparison to popular DP matching method in measuring intraclass and interclass variation among several writers. The next section represents a brief introduction to writer verification by online handwriting followed by proposed algorithm of measuring similarity between trajectories generated by pen-tip movement on writing tablet. Section 3 contains simulation experiments and results. Section 4 represents conclusion and future direction.

2. Dynamic Biometrics: Online Handwriting

Online signature verification as a means of person identification is under intense investigation for a long time and a lot of research paper has been published in this area 23 .4 Automatic writer authentication methods using online handwriting are broadly classified


36 An Efficient Measure for Individuality Detection in Dynamic Biometric Applications

into two categories : function based and parameter based .5 In parametric approaches, only the parameters abstraced from the signal are used. Though they are simpler with high computation speed but task of selecting right parameters is difficult and during the processing the dynamical information is lost. In functional approach the two complete signals are compared which generally yields better results.6 For the past two decades the use of DTW (Dynamic Time Warping) based on DP matching algorithm to find the best matching path, in terms of the least global cost, between an input signal and a template, has become a major technique in signature verification,7

8 Though DTW is quite sucessful it has two main drawbacks: heavy computational load and warping of forgeries. Other popular recent methods are based on HMM.9

In this paper we consider online handwriting as a multivariate time series signal generated by the pen movement on writing pad, the variables are x,y coordinates of pen-tip movement, writing pressure, pen inclination with respect to x and y co-ordnates etc. The path of the hand movement during pen-up position is also interpolated in the time series along with the movement of the pen-down position. A similarity measure for measuring similarity between generated time series signals corresponding to two samples of same writing has been proposed and is explained in the following section.

2.1. Proposed Algorithm of Measuring Similarity

The time series signal generated by online handwriting is considered to be originated from the dynamics of the hand movement. The trajectories of a writing piece of sample should be similar for the same writer but different for different writers as handwriting is considered to be dependent on individual. Here we propose a novel measure for evaluating similarity of time series signals based on Translation Error, originally proposed by Wayland. Wayland test1 is a widely used method for detecting determinism from a time series. We have modified Wayland's algorithm in order to evaluate similarity between handwriting time series. The proposed measure is based on the following concepts.

by delay coordinate embedding method from a online handwriting signal.

(2) Constructed trajectory reflects the individuality of the time series.

(3) Several multi-dimensional trajectories constructed from the several handwriting time series of same characters generated by the specific writer have almost same dynamics within a certain error.

(4) Thus, similarity between time series signals can be evaluated by translation error of constructed trajectories.

A deterministic time series signal [^(i)]^!, can be embedded as a sequence of time delay co-ordinate vector vs(t), known as experimental attractor, with an appropriate choice of embedding dimension m and delay time r for reconstruction of the original dynamical system as follows:

vs(t + l) = f(vs(t)) vs(t) = \s(t), s(t + r),...,s(t + (m- 1 ) T ) ) ]

where / denotes reconstructed system having one-to-one correspondence with original system. Though the present work is not concerned with detecting determinism in time series, delay vector embedding is used here for extracting the local wave pattern of the time series as shown in Fig. (1) which act as a local feature.

• t v. (1)1

vf(2) 4 4 4

v.O) •

(1) Multi-dimensional trajectory can be constructed Fig. 1. Delay Coordinate Embedding and Local Wave Form

B. Chakraborty and Y. Manabe 37

•

Fig. 2. Translation in Multi-Dimensional Phase space

We propose a measure for calculating distance between trajectories for a th sample and b th sample of the same piece of writing from the translation of delay vectors in the reconstructed space based on translation error defined in Wayland test as follows:

(1) Si(t) is defined as the time series signal generated from online handwriting and vSi (t) is the corresponding embedded vector for i th sample, vSa (t) and vSb(t) denote the delay vectors respectively for a th and b th samples.

(2) A random vector vSa(k) is chosen from vSa(t). Let the nearest vector of vSa(k) from vSb(t) be vSb(k'), where \k — k'\ < Tth, where Tth denotes a threshold value ensuring a small region for k nearest neighbours (shown in Fig. (2) )

(3) For the vectors (vSa(k) and vSb(k')), the transition in the each orbit after one step are calculated as follows;

V3a(k) = vSa(k+l)-vSa(k), (1)

VSb(k') = vSb(k' + l)-vSb(k'). (2)

(4) Translation error etrans, is calculated from VSa(k) and VSb(k') as

i(\v.A*)-v\ , \v.>W)-v\ etrans ~ n\ |y-| "•" ly-i /> V"/

where V denotes average vector between VSa (k)

andVS(1(fc')-

(5) etrans is calculated for L times for different selection of random vector vSa (k) and the median of e\rans (i = 1,2,..., L) is calculated as

M(etrans) = Median(e\rans,..., e tL

rans). (4)

The final translation error Etrans is calculated by taking the average, repeating the procedure M times to supress the statistical error generated by random sampling in the previous step.

1 M

Et will be different if we interchange a th and 6 th sample in step 2 of the above algorithm i,e. random vector is chosen from delay vectors of b th sample. Wayland demonstrated that the translation error Et tends to 0 if the time series is deterministic and tends to 1 if it is random.

2.2. Measure for Individuality Detection

The translation error Et defined above is used here to define a measure of variation between same and different writer's handwriting samples as follows:

(1) Intra sample variation of similarity measure from a single writer.

(2) The distance between intra sample distribution and inter sample distribution of several writers.

Now for calculation of (1) online handwriting samples for a particular piece of writing for a particular writer has to be taken iV times. The translation error between (N x N — N) samples has to be computed to get the distribution of individual variation for a particular writer. The co-efficient of variation for N samples is defined as

CV = °Sim

fJ-Sim — N(N-l) 2 ^ i = l a i m ,

asim = y N(N-I) ]LiIi (Sinn - ^Sim)2

where er| im and [isim denote the variance and mean of the distribution of translation error for N samples.

For calculation of (2), online handwriting sample for a particular piece of writing for several (K) writers have to be taken N times. The difference between the distribution of translation error for same writer

38 An Efficient Measure for Individuality Detection in Dynamic Biometric Applications

(number of samples N(N — 1), group P) and the distribution of translation error for this writer and the other (K-l) writers ( number of samples NK, group Q) are calculated by Fratio as follows:

MS- . — bbinUr lvl Winter — 2 —1

(Lf?- , — , ssi"i'" M&tntra — N(N-l) + NK-2

SSinter = N(N - l)(fiP - fiPQ)2 + NK^Q -

V-PQ?

SSintra = N(N - l)a2P + NKa2

Q

UP, HQ, fipQ represent average of group P, group Q and group P and Q pulled together as one group while op and OQ represent variance of group P and Q respectively. For correct identification of a writer, MSinter should be as high as possible while MSintra

should be as low as possible, making Fr high for good authetication results.

3. Simulation Experiment and Results

3.1. Data Preparation

In order to ebvaluate the efficiency of the proposed similarity measure a small scale simulation experiment has been done with 5 (A, B, C, D, E) writers, all of them wrote the word 'Software' in katakana, a Japanese alphabet system known to be difficult for capturing individuality of handwriting. A piece of sample in shown in Fig. (3) The writers used WACOM tablet " intuose 3" for writing and each writer wrote 10 times in two sittings producing 100 samples per writer. In this study we used only x-y coordinates of the pen tip movement. WACOM tablet "intuosjj 3" produces 200 points /sec which is too many for computation.

Fig. 3. Handwriting Character samples

3.2. Pre-processing

A preprocessing has been done to sample non equidistant points such that points in the curve region is taken into account more than the points in the relatively straight portion. If the angle between the line joining ( xt-\,yt-\) and (xt,yt) and the line joining (xt-i,yt-i) and (xt+i,yt+i) is less than a certain threshold (Qth) the point (xt,yt) is dropped. Secondly the time series values of the co-ordinates x(t) and y(t) are normalized to lie between 0 and 1.

3.3. Simulation Experiment

Translation error (TE) and DP matching distance values are calculated for 90 trajectory pairs corresponding to 10 samples for each particular writer and 400 trajectory pairs for a particular writer and other writers. That is here, group P has 90 samples and group Q had 400 samples. For calculation of translation error (TE) the delay vectors of the trajectories are constructed with simple assumption of embedding dimemsion m = 3 and delay T = 1. The other parameters are set as follows:

Qth = 1.0, L = 50 and M = 10 (in step 5 of proposed algorithm).

3.4. Simulation Results

Table. 1 represents the intra writer variability values using translation error and DP matching as the measures. The ratio of standard deviation and average value of the distribution of both the measurements is taken for calculating the index. It is shown that for both the time series, translation error is a better measure for identity detection. Table. 2 represents the avarage Fr of each writer in comparison with rest of the writers. Here also translation error is found to be better than DP matching. Specially for writers C and D the low values of Frati0 using DP measure for x co-ordinate time series indicate that intra writer variation compared to inter writer variation is high enough to discover any individuality and thus can not be used for personal identification. Fr

for y(t) values seems better but FT based on T E measure for both the time series are far better than those values based on DP measure. Thus translation error can be used to detect the individuality of writer C and D as Fr corresponding to TE of writer C and

B. Chakraborty and Y. Manabe 39

D are quite high. In fact in our experiment writer C

and D are novice in using writing tablet and their

sample writings show greater variation from sample

to sample compared to writer A, B and E. It seems

tha t in spite of variation in sample writing, T E is

bet ter measure than D P to identify the particular

writer.

4 . C o n c l u s i o n

In this work translation error (TE) , a measure for

detecting determinism in a t ime series, is used to

define a distance measure and an algorithm for cal

culating the distance between the trajectories of two

online handwrit ing samples of same piece of writing

has been proposed. The effectiveness of the measure

for possible use as a dynamic biometric technolgy for

identity detection has been examined in comparison

to the one of the popular technique, D P matching

algorithm, in writer verification problem with online

handwrit ing.

Table 1. Intra writer variability for different writers

Writer A Writer B Writer C Writer D Writer E Average

Time Series X TE

0.096 0.152 0.160 0.118 0.126 0.130

DP 0.213 0.232 0.334 0.223 0.237 0.248

Time Series Y TE

0.104 0.139 0.126 0.102 0.127 0.120

DP 0.252 0.301 0.212 0.193 0.253 0.242

In the simple experiment conducted here it is

found that translation error based similarity or dis

tance calculation of trajectories are bet ter than D P

matching based measure. This measure can also be

used for feature evaluation for multivariate time se

ries. In our experiment the values in the Tables in

dicate tha t x trajectories are bet ter than y trajec

tories for identity detection. This measure also can

be extended to consider multivariate t ime series for

authentication problems using all the t ime series in

formation combined together. At present we are con

ducting experiments with a larger number of writers

and would like to apply this measure for wri ter ver

ification problem from online handwrit ing.

Table 2. FTatio of one writer with rest of the writers

Writer A Writer B Writer C Writer D Writer E Average

Time Series X TE

346.88 210.11 104.76 167.06 244.39 212.84

DP 178.24 185.85 4.68 0.14 85.04 90.79

Time Series Y TE

235.61 144.34 91.88 142.75 92.91 141.30

D P 104.11 141.84 107.30 39.49 210.85 120.72

R e f e r e n c e s

1. R. Wayland et al, Physical Review Letters 70(5), 580 (1993).

2. V. S. Nalwa, Proc. of IEEE85,215 (1997). 3. A. Khalmatov and B. Yanikoglu, Pattern Recogni

tion Letter 26(15),2400 (2005). 4. h t t p : / / i r i s . u s e . e d u / V i s i o n - N o t e s /

Bibliography/char1012.html 5. F. Lecrec and R. Plamondon,International Journal

of Pattern Recognition and Artificial Intelligence, 8(3), 643 (1994).

6. R. Plamondon and G. Lorette,Pattern Recognition 22, 107 (1989).

7. P. Zhao et al., IEICE Trans. Inf. & Syst E79-D(5),535 (1996).

8. H. Feng and C. C. Wah, Pattern Recognition Letter 24(16) ,2943 (2003).

9. M. M. Shafie and H. R. Robiee.Proc. of ICDAR 2003 (2003).

http://iris.use.edu/Vision-Notes/

40

Divide-and-Conquer Strategy Incorporated Fisher Linear Discriminant Analysis: An Efficient Approach for Face Recognition

S. Noushath*, G. Hemantha Kumar and V. N. Manjunath Aradhya

Department of Studies in Computer Science University of Mysore

Mysore-570006, INDIA E-mail: nawali-naushad@yahoo. co. in*

P. Shivakumara

Department of Computer Science School of Computing

National University of Singapore Singapore

E-mail: [email protected]

Fisher linear discriminant analysis (FLD) is one of the most popular feature extraction methods in patter recognition and can obtain a set of so-called projection directions such that the ratio of the between-class and the within-class scatter matrices reaches its maximum. However in reality, the dimension of the patterns will be so high t h a t the conventional way of obtaining Fisher projections makes the computational task a tedious one. To alleviate th is problem, in this paper, divide-and-conquer strategy incorporated FLD (dcFLD) is presented with two objectives: one is to sufficiently utilize the contribution made by local parts of the whole image and the other is to still follow the same simple mathematical formulation as FLD. In contrast to the traditional FLD method, which operates directly on t he whole pattern represented as a vector, dcFLD first divides the whole pattern into a set of subpatterns and acquires a set of projection vectors for each partition to extract corresponding local sub-features. These local sub-features a re then conquered to obtain global features. Experimental results on several image databases comprising of face and object reveal the feasibility and effectiveness of the proposed method.

Keywords: Divide-and-conquer Strategy; Fisher Linear Discriminant Analysis; Principal Component Analysis; Face Recognition; Object Recognition

1. Introduction

Principal Component Analysis (PCA)1 and Fisher Linear Discriminant analysis (FLD),2 respectively known as eigenface and fisherface method, are the two state-of-the-art subspace methods in face recognition. Using these techniques, a face image is efficiently represented as a feature vector of low dimensionality. The features in such subspace provide more salient and richer information for recognition than the raw image. It is this success, which has made face recognition (FR) based on PCA and FLD very active, although they have been investigated for decades.

To further exploit the potentiality of PCA and FLD methods, new techniques (Refs. 3,4 are two similar methods found in the literature referred respectively as 2DFLD and 2DLDA. In all our discussions, whenever we refer to,4 it also implies3 and vice-versa.) called 2DPCA5 and 2DLDA3,4 were respectively proposed. Although these methods proved to be efficient in terms of both computational time and

accuracy, a vital unresolved problem is that these methods require huge feature matrix for the representation. To alleviate this problem, the (2£>)2LDA6

method was proposed which gave the same or even higher accuracy than the 2DLDA method. Further, it has been shown in6 that, 2DLDA essentially works in the row-direction of images. In this way, alternative-2DLDA (A2DLDA)6 was also proposed which works in the column direction of images. By simultaneously combining both row and column directions of images, 2-Directional 2DLDA, i.e. (2D)2LDA was proposed. Unlike 2DLDA and A2DLDA methods, the DiaFLD7

method seeks optimal projection vector along the diagonal direction of face images by enlacing both row and column information at the same instant. Furthermore, 2DFLD and DiaFLD were combined together in7 to achieve efficiency in terms of both accuracy and storage requirements.

In spite of the success achieved due to the above mentioned variations3'4'6'7 of the original FLD method, there are still some serious flaws that needs


S. Noushath, G. Hemantha Kumar, V. N. Manjunath Aradhya and P. Shivakumara 41

to be addressed. The main disadvantage of the FLD method is that, when the dimensionality of given pattern is very large, extracting features directly from these large dimensional patterns cause some processing difficulty such as the computational complexity for large scale scatter matrices constructed by the training set. Furthermore, due to utilizing only the global information of images, this is not effective under extreme facial expression, illumination condition and pose, etc. Hence, in this paper, we have made a successful attempt to overcome the aforesaid problems by first partitioning a face image into several smaller sub-patterns (sub-images), and then a single FLD is applied to each of them. It was reported that the changes due to lighting condition and facial expressions emphasize more on specific parts of the face than others.8 Consequently, variations in illumination or expression in the image will only affect some sub-images rather than the whole image, and therefore the local information of a face image may be better represented. Furthermore, the sub-pattern dividing process can also help in increasing the diversity, making the common and class-specific local features be identified more easily.9 Meaning that, the different contributions made by different parts of images are more emphasized which in turn helps to enhance the robustness to both illumination and expression variation. These are the reasons which motivated us to go for subpattern dividing process to overcome the aforementioned drawbacks of the original FLD method.

In the first step of this method, an original whole pattern denoted by a vector is partitioned into a set of equally sized sub-patterns in a non-overlapping way and all those sub-patterns sharing the same original feature components are respectively collected from the training set to compose a corresponding sub-patterns training set. In the second step, FLD is performed on each of such sub-patterns training set to extract its features. At last, a single global feature is synthesized by concatenating each sub-patterns FLD projected features together. Finally, a nearest neighbor classifier is used for subsequent recognition. The experiment on different image databases will provide the comparison of the classification performance of dcFLD with other linear discrimination methods.

The rest of the paper is organized as follows: The algorithm is detailed in section 2. Experiments

are carried out in section 3 to evaluate dcFLD and other subspace analysis methods using wide range of image databases. Finally, conclusions are drawn in section 4.

2. Proposed dcFLD

There are three main steps in the proposed dcFLD algorithm: (1) Partition face images denoted by a vector into sub-patterns, (2) FLD is performed subpattern-by-subpattern to extract features from a set of large dimensional patterns, and (3) classify an unknown image.

2.1. Image Partition

Suppose that there are iV training samples Ak(k = 1,2,..., iV), denoted by m-by-n matrices, which contain C classes, and the ith class d has n* samples. Now, an original whole pattern denoted by a vector is partitioned into K d-dimensional subpatterns in a non-overlapping way and reshaped into d-by-K matrix Aj = (Aji, Aj2, • • •, AJK) with Aij being the j t h

subpattern of Ai and i — 1,... ,N,j = I,... ,K. Now to form the j t h training subpattern set TSj, we respectively collect j t h subpattern of Ai and i=l,..,N, j=l,..,K. In this way, K separate subpatterns are formed.

2.2. Apply FLD on K subpatterns

Now according to the second step, conventional FLD is applied to the j t h subpattern set [TSj]to seek corresponding projection sub vectors Uj = (uji,Uj2,--,Ujt) by selecting t eigenvectors corresponding to t largest eigenvalues based on maximizing the ratio of the determinants of the between-class and the within-class scatter matrices of the projected samples. Analogous to the fisherface method,2 define the j t h between-class and within-class sub scatter matrices,Gbj and GWJ , respectively as follows:

c

Gbj = Y,n*(Av - Ai)(AH ~ Ai)T (!) t = i

c GwJ = H E (Akj - AijKAv - Aijf (2)

i = l Akj&Ci

Here, A, = i YLi=\ Aij 3 = 1 ,2,3, . . . , K are sub-pattern means, Aij = ^- Y^Ji=i Aij is ith class j t h

42 Divide-and-Conquer Strategy Incorporated Fisher Linear Discriminant Analysis: An Efficient Approach for Face Recognition

subpattern mean and Akj is the j subpattern of kth sample belonging to the ith class. After obtaining all individual projection sub vectors from the partitioned sub patterns, extract corresponding sub features Yj from any subpattern of given whole pattern Z = (Zi, Z2 , . . . , ZK) using the following equation:

Yj = UjZj (3)

Now synthesize them into a global feature as follows:

Y = (¥?,...,¥?)? = (Z?UU...,ZTUK)T (4)

It is interesting to note that when K = 1 and d = m x n, dcFLD reduces to the standard Fisherface method. Thus we can say that the Fisherface method is a special case of the proposed dcFLD method.

2.3. Classification

In this process, in order to classify an unknown face image I, the image is first vectorized and then partitioned into K sub-patterns (Ii,h, • • • ,IK) in the same way as explained in section 2.1. Using the projection sub vectors, sub features of the test sample I is extracted as follows:

Fj=Ujlj Vj = l,2,...,K (5)

Since one classification result for the unknown sample is generated independently in each subpattern, there will be total K results from K subpatterns. To combine K classification result from all subpatterns of this face image I, a distance matrix is constructed and denoted by D(I) = (dij)NxK, where di} denotes the distance between the corresponding j t h patterns of the I and the ith person, and d^ is set to 1 if the computed identity of the unknown sample and the ith persons identity are identical, 0 otherwise. Consequently, a total confidence value that the test image I finally belong to the ith person is defined as:

K

Td{I) = Y,di} (6)

And the final identity of the test image I is determined as follows:

Identity(I) = argtmax(TCi(I)) lî^N (7)

2.4. Image Reconstruction

In the whole-pattern based approach, the feature vector and the eigenvectors can be used to reconstruct

the image of a face. Similarly, in this sub-pattern based approach, a face image can be reconstructed in the following way:

A ^ (Uj . FVJ) + Aj Vj = l,...,K (8)

Where Uj indicates the projection vectors of j t h sub-pattern obtained through PCA, FVJ is the feature vector of the image which we are seeking to reconstruct, and Aj is the j t h sub-pattern mean.

3. Experiments

In this section, a series of experiments are presented to evaluate the performance of the proposed dcFLD by comparing with existing methods.1 '2 '4^710 ,11 All our experiments are carried out on a PC machine with P4 3GHz CPU and 512 MB RAM memory under Matlab7 platform. For all the experiments, the nearest neighbor classifier is employed for classification. If extra explanation is not given, it is understood that the experiment is repeated for 25 times by varying the number of projection vectors t (where i=l,2,..,20,25,30,35,40,45). Since t has a considerable impact on classification performance as well on the dimension of subpattern (for the proposed dcFLD method), we choose that t which corresponds to best classification result on the image set.

3.1. Image databases

The aforementioned algorithms are tested on several image databases comprising of face and object. We carried out the experiments on two face databases: ORL12 and the Yale2 and also on an object database namely COIL-20.13 In ORL database, there are 400 images of 40 adults, 10 images per person while Yale database contains 165 Images of 15 adults, 11 image per person. Images in Yale database features frontal view faces with different facial expression and illumination condition. Besides these variations, images in ORL database also vary in facial details (with or without glasses) and head pose.

The COIL-20 is a database of 1440 gray-scale images of 20 objects. The objects were placed on a motorized turntable against a black background. The turntable was rotated through 360 degrees to vary object pose with respect to fixed camera. Images of the objects were taken at pose interval of 5 degrees, which corresponds to 72 images per object.


Table 1. Best recognition accuracy for varying training sample number and dimension of subpattern

Dimension of subpatter(d-.fC)

92-112

112-92

161-64

322-32

644-16

2576-4

2

88.25(11)

87.00(12)

87.00(06)

87.00(10)

85.50(07)

82.00(20)

4

94.50(11)

94.75(18)

94.75(13)

94.50(14)

94.25(18)

92.75(19)

5

97.00(08)

96.00(08)

97.25(15)

96.50(09)

96.50(12)

94.00(18)

6

99.00(06)

98.50(12)

99.00(09)

98.25(09)

98.00(09)

95.75(08)

8

99.75(04)

100.0(06)

99.75(05)

100.0(06)

99.75(08)

99.25(09)

3.2. Results on the face databases

We first conduct an experiment with ORL database. As noted above, 40 people, 10 images of each person with the size of 112 x 92 are used. Our preliminary experiments show that the classification performance of the proposed method is impacted by the dimension of subpattern (d). In order to determine the effect of dimension of subpattern on the available data, we check the classification performance by varying both number of training samples and dimension of subpattern. For this, we randomly chose p images from each class to construct the training database, the remaining images being used as the test images. To ensure sufficient training a value of at least 2 is used for p. It can be ascertained from Table 1 that the recognition accuracy of the proposed method is greatly influenced by the size of the subpattern dimension (d). It is apparent from the table that, smaller the d (or larger the number of subpatterns K), better is the recognition accuracy. Value in the parenthesis denotes the number of projection vectors used to attain the best recognition accuracy. It is also observed that the recognition accuracy is comparatively better when d and K take on the values 161 and 64 respectively. Hence in all our later experimentations on ORL, we use these values for d and K.

The next set of experiments on ORL database is conducted by each time randomly selecting 5 images per person for training, the rest 5 per person for testing. This experiment is independently carried out 40 times, and the averages of these experiments results are tabulated in Table 2.

Experiments on the Yale database are carried out by adopting leave-one-out strategy, that is, leaving out one image per person each time for testing and all of the remaining images are used for training. Each image is manually cropped and resized to

235 x 195 pixels in our experiment. This experiment is repeated 11 times by leaving out a different image per person every time. Results depicted in Table 2 are the average of 11 times results. Here the values of d and K are empirically fixed to 235 and 195 respectively, to yield optimal recognition accuracy. In table 2, /x is defined as follows:

number of selected eigenvectors fi = : — - f- : x 100 number of all the eigenvectors

Here all the eigenvectors are sorted in the descending order of their corresponding eigenvalues, and selected eigenvectors are associated with the largest eigenvalues.

It can be determined from Table 2 that on ORL, dcFLD achieves better performance improvement over FLD and other methods. Moreover, the experiments on Yale database significantly exhibits the efficiency of proposed method under varied facial expressions and lighting configurations. It can be seen from the table that, on Yale database, dcFLD achieves up to 4-10% performance improvement over PCA, up to 4-9% improvement over the Fisherface method, up to 1-4% improvement over A2DLDA method. In contrast to 2DLDA and (2D)2LDA methods, significant improvement in accuracy is achieved (i.e. up to 13%). Thus we can say that not only does dcFLD prevails stable under aes-thetical conditions (for ORL), but also exhibits efficiency and high robustness under the condition when there is wide variations in both lighting configurations and facial expressions (for Yale).

Finally, to further exhibit the essence of this sub-pattern based approach over the conventional whole-pattern based approach, we conduct a simple reconstruction test. Taking one image from the ORL as

44 Divide-and-Conquer Strategy Incorporated Fisher Linear Discriminant Analysis: An Efficient Approach for Face Recognition

Table 2. Accuracy comparison of various approaches.

Database

ORL

Yale

H

10.0

12.5

20.0

6.66

10.00

13.33

16.66

20.00

PCA

90.43

92.46

93.61

83.47

86.77

85.95

87.60

87.60

FLD

93.73

94.00

93.93

88.42

88.42

86.77

88.42

86.77

2DLDA

93.80

94.73

92.86

88.42

85.12

85.12

85.12

84.29

A2DLDA

93.67

94.88

93.88

93.38

92.56

92.56

92.56

92.56

2D2LDA

93.33

92.71

93.88

90.08

91.73

88.42

85.12

80.16

2DPCA

93.95

93.78

93.00

86.77

87.60

87.60

87.60

87.60

DiaFLD

93.80

94.66

94.00

89.25

89.25

88.42

87.60

87.60

DiaFLD+2DFLD

93.73

92.48

93.00

92.56

92.56

92.56

90.90

89.25

dcFLD

94.63

94.70

94.15

93.38

92.56

95.04

93.38

93.38

II IMB 9H m i H i I

Fig. 1. five reconstructed images by whole-pattern based approacfa(Top-row) and sub-pattern based approach (Bottom-row)

an example, we can determine its five reconstructed images for varying projection vectors t (where 1=10, 20, 30, 40,50). These images are shown in Fig. 1. It is quite evident that the sub-pattern dividing approach yields higher quality image than the whole-pattern based approach, when using similar amount of principal components. Note that our objective is to demonstrate the effectiveness of the subpattern based approach over the whole-pattern based approach. Hence in both the approaches, we considered the PCA features for reconstruction. It is also well known that, PCA has better image reconstruction accuracy than the FLD method.

3.3. Results on the object database

Inspired by the conviction that the successful method developed for FR1 '5 should be extendable for object recognition as in10,11 respectively, in this section, we verify the applicability of the dcFLD method

for objects by considering COIL-20 database. This database contains gray level images of size 128 x 128 pixels. In this experiment, we empirically fixed the values of both d and K to 128.

For comparative analysis of various approaches, we consider first 36 views of each object for training and the remaining views for testing. So, size of both the training and the testing database is 720. Table 3 gives the comparison of nine methods on top recognition accuracy, corresponding dimension of feature vector/matrices and running time costs. It reveals that the top recognition accuracy of the proposed dcFLD method is higher than other methods. These results certify the pertinence of the dcFLD method for object database apart from the face images. But, the only demerit of the proposed- method is that it consumes more time compared to other contemporary methods such as 2DPCA, 2DLDA, etc. This is due to the fact that the proposed method involves the drudgery of subpattern set formation and then obtaining corresponding projection vectors and features. But, as the present-day computers have ample processor speed, it has no practical influence on the classification performance of the proposed method.

4. Conclusions and future work

In this paper, a novel subspace analysis method called dcFLD is proposed for efficient and robust face/object recognition. The proposed method utilizes the separately extracted local information from each sub-pattern set and thereby possesses robustness and better recognition performance. Indeed, the practicality of the dcFLD is well evidenced in the experimental results for the Yale face database. In that,


Table 3. Comparison of Different Methods on COIL-20 database.

Methods Top Recognition Accuracy Running Time(s) Dimension

PCA 1 0

FLD

2DPCA 1 1

2DLDA

A2DLDA

2D2LDA

DiaFLD

DiaFLD+2DFLD

dcFLD

93.89

91.25

94.30

93.05

88.88

94.72

93.75

94.10

95.77

157.91

177.29

30.45

33.69

29.94

59.69

29.91

61.64

129.52

35

40

128 x 5

128 x 7

9 x 128

11x11

128x6

9x9

19x128

dcFLD improves recognition accuracy by a minimum

of 3-4% than the other discrimination methods. Fur

thermore, the results on COIL-20 object database

also show tha t the proposed method is feasible and

effective. Also we believe tha t this method is equally

effective in scenarios where a person's face is oc

cluded by sunglass, scarf etc. which is an interesting

issue for future work.

Nevertheless, there are still some aspects of

dcFLD method tha t deserve further study. Is the pro

posed approach the best way of subdividing the full

pa t tern . In addition, dcFLD needs more coefficients

for image representation than FLD. Is there any way

to reduce the number of coefficients and in the mean

t ime to keep up the same accuracy? Finally, it is still

unclear as how to choose the optimal number of sub-

pat te rns to obtain the bet ter recognition accuracy.

These are some crucial issues which give scope for

future work.

R e f e r e n c e s

1. M. Turk and A. Pentland, Journal of Cognitive Neu-roscience 3, 71 (1991).

2. P. Belhumeur, J. Hespanha and D. Kriegman, IEEE Transactions on Pattern Analalysis and Machine Intelligence 19, 711 (1997).

3. H. Xiong, M. Swamy and M. Ahmed, Pattern Recognition 38, 1121 (2005).

4. M. Li and B. Yuan, Pattern Recognition Letters 26, 527 (2005).

5. J. Yang, D. Zhang, A. Frangi and J. Yang, IEEE Transactions on Pattern Analalysis and Machine Intelligence 26, 131 (2004). S. Noushath, G. H. Kumar and P. Shivakumara, Pattern Recognition 39, 1396 (2006). S. Noushath, G. Kumar and P. Shivakumara, Neu-rocomputing 69, 1711 (2006). A. M. Martnez, IEEE Transactions on Pattern Analalysis and Machine Intelligence 24, 748 (2002). X. Tan, S. Chen, Z. Zhou and F. Zhang, IEEE Transactions on Neural Networks 16, 875 (2005).

10. H. Murase and S. Nayar, International Journal of Computer Vision 14, 5 (1995). P. Nagabhushan, D. Guru and B. Shekar, Pattern Recognition 39, 721 (2006). www.uk.research.att .com/facedatabase.html.

13. wwwl.cs.Columbia.edu/CAVE/research/softlib/ coi l -20.html.

6

7.

9

11

12

http://www.uk.research.att.com/facedatabase.html

Ear Biometrics: A New Approach

Anupam Sana and Phalguni Gupta

Indian Institute of Technology Kanpur Kanpur(U.P.), India-208016

E-mail: {sanupam,pg}@iitk.ac.in

Ruma Purkait

Department of Anthropology Saugor University

Saugor-470003, India E-mail:r. [email protected]

Abstract. The paper presents an efficient ear biometrics system for human recognition based on discrete Haar wavelet transform. In the proposed approach the ear is detected from a raw image using template matching technique. Haar wavelet transform is used to decompose the detected image and compute coefficient matrices of the wavelet which are clustered in its feature template. Decision is made by matching one test image with 'n' trained images using Hamming distance approach. It has been implemented and tested on two image databases pertaining to 600 individuals from IITK and 350 individuals from Saugor University, India. Accurcy of the system is more than 96%

Keywords: Haar wavelet; Wavelet Coefficient Matrices; Ear detection; Template matching; Hamming distance.

1. Introduction

Biometrics is the automated method of identifying or verifying the identity of an individual on the basis of physiological and behavioral characters. Ear which is easily detectable in profile images can be advocated as a recognition tool. Ear has few advantages over facial recognition technology. It is more consistent compared to face as far as variability due to expression, orientation of the face and effect of aging, especially in the cartilaginous part is concerned. Its location on the side of the head makes detection easier. Data collection is convenient in comparison to the more invasive technologies like iris, retina, fingerprint etc.

The possibility of ear as a tool for human recognition has been first recognized by the French Criminologist, Alphonse Bertillon 1 . Nearly a century later, Alfred Iannarelli2 has devised a non-automatic ear recognition system where more than ten thousand ears have been studied and found no two ears are exactly alike. Burger and Burger3'4 have proposed an ear biometrics approach which is based on building neighborhood graph from Voronoi diagrams of the detected edges of ear. But its main disadvantage is the detection of erroneous curve segments i.e. this system may not be able to differentiate the real ear edges and non edges curves. Choras5 also has used an approach for feature extraction based on contour

detection, but the method too suffers from the same disadvantage of erroneous curve detection. Hurley et a/.6 have proposed an approach on force field transformation to determine the energy lines, wells and channels of ear. Although it has been experimented on a small data, the results are found to be quite promising. Victor et al.7 have presented an approach based on the Principal Component Analysis (PCA) on face and ear recognition. Face based system has given a better performance than ear. In a similar experiment by Chang et al.8 no significant difference between the performance of ear and face has been found. Moreno et al.,9 have analysed ear by neural classifiers and macro-features extracted by compression networks. It has achieved better recognition result (without considering rejection thresholds) using only compression network. Chen and Bhanu10 have worked on 3D ear recognition using local surface descriptor for representing ear in 3D. The system performance is evaluated on a real range image database of 52 subjects.

This paper has proposed a novel approach for feature extraction from the ear image. Wavelet is an emerging technique for image analysis. Haar wavelet is simple and reliable in this field. So in this paper discrete Haar Wavelet transformation is applied on ear images and wavelet coefficients are extracted. Section 2 discusses the proposed approach. The experimen-


tal results are presented in Section 3. The proposed conclusions are given in the last section.

2. T h e Proposed Approach

This section discusses a new approach for feature extraction in ear biometrics. The ear images generally have large background, so there is a need to detect ear which is done by template matching and detected image is scaled into constant size. R*om these ear images features are extracted as Haar wavelet coefficient from wavelet decomposed image and feature template is stored for each training image. So testing feature template which is extracted using above mention way is matched with large set of training template and gives best set of matches based on individual matching score. The present approach is implemented in four major steps. Step 1 is image acquisition where ear image is captured in camera in the laboratory environment. Step 2 is image preprocessing where EGB image is converted in gray-scale image, ear is detected and then scale is normalized. Step 3 is feature extraction where unique features of ear are extracted and are stored as trained template. Then in Step 4 matching is performed to get the best matches.

2 .1 . Image Acquisition

Ear images are collected at two centers, Indian Institute of Technology Kanpur (IITK) and Saugor University. The images are captured in the laboratory environment where moisture and humidity are normal and illumination is constraint. One bright light source is used to illuminate the whole ear to reduce the shading of ear edges that generally appear in the image. A stand is used to place chin of a person over there that reduces the rotational effect on image. At IITK, a CCD camera (JAI CV-M90 3-CCD RGB color camera) from a distance of 28 to 28 cm has been used to take the images of the ear. Using ITI camera configurator three images of each ear have been acquired for six hundred subjects. Another set of images have been taken of three hundred and fifty subjects at the Anthropology department laboratory in Saugor University. These have been captured using a digital camera (Kodak Easy Share CX7330, 3.2Megapixel) from a distance of 24 to 28cm from the subject. For each of these subjects, three images of the ear are taken. Out of the three images so cap-

Anupam Sana, Phalguni Gupta and Ruma Purkait 47

tured, two has been used for training and one for testing purpose.

Uj (h)

Fig. 1. Image database (a) IITK database, (b) Saugor University database.

2.2. Image Preprocessing

The raw image may not be suitable for feature extraction due to its large background so some preprocessing are required to make it suitable. The important steps involved are: Grayscale conversion. Ear Detection, Scale Normalization.

2.2.1. Grayscale Conversion

RGB images are converted into grayscale by using the following formula:

Jp = 0.2989*/(B)+0.5870*I(G)+0.1140*7(B) (1)

where I(R) = red channel value of RGB image, 1(G) = green channel value of the RGB image, 1(B) = blue channel value of RGB image, Ig = an intensity image with integer values ranging from a minimum of zero to a maximum of 255.

2.2.2. Ear Detection

Ear detection is implemented using simple template matching technique. Here first a set of images are manually cropped to get the different sizes of ear images. These images are decomposed into level 2 using Haar wavelet. These decomposed images are trained as templates. And then the input raw image is also decomposed into level 2 using same technique. Thereafter, each template is retrieved from the

48 Ear Biometrics: A New Approach

y *

1

$(x) :

0 i s

Fig. 3. Haar scaling function $(x).

trained set and matched with the same sized overlap-, ping block of the decomposed input image. Thus for each trained template the best matched block in input image is traced. Among those blocks the best matched block is chosen and the corresponding region from the original image is extracted (Fig. 2),

scaling function <j>(x) has values one for closed interval of time(x) from 0 to 1 and zero for other interval of time. </>(x) is formulated as follows:

<f>(x) {i if 0 < x < 1

otherwise (4)

Haar wavelet function &(x) is a step function has value 1 for time (x) greater than equals to 1 and less than | . 9(x) has also value -1 for time (x) greater than equal to | and less than 1. \P(x) has 0 values for other interval. W(x) is discontinuous at time(x) equal to 0, | , 1. 9(x) is defined by

1 Vx € [0,1/2) tf(x) = { - l V x € [1/2,1)

0 otherwise (5)

Fig. 2. (a)Raw ear image, (b) Ear traced, (c) Detected ear.

2.2.3. Scale Normalization

The cropped ear image may be of varying size so the feature set of images may also vary. Hence the images are normalized to a constant size by resizing technique. If two images are in different sizes, e.g. one is of size (x'xy*) and the other is of size (x" xy"), then two images are mapped to constant size.

2.3.

I(xfJy

,) = I(x1y)

I(x\yn = I(x,y)

Wavelet Tmnsform and Featum Extraction

(2)

(3)

This section discusses wavelet transform and introduces the method for the feature extraction from wavelet coefficients.11 Every wavelet is defined by one scaling function and one wavelet function. The Haar

TOO

1/2

1 x

Fig. 4. The Haar wavelet #(a?).

A standard decomposition of an image (two dimensional signals) is easily done by first performing a one dimensional transformation on each row followed by one dimensional transformation of each column. Thus the input ear image is decomposed (Figure 3) into approximation (CA), vertical (CV), horizontal (CH) and diagonal (CD) coefficients using the wavelet transformation and the approximation coefficient (CA) is further decomposed into four coefficients. The sequences of steps are repeated to get the coefficients from the four level wavelet transform. After analyzing the coefficients at all the four levels it is found that the coefficients of level 1 to level 4 are almost the same. In order to reduce redundancy from among the coefficients of different levels only one of them is chosen. Thus all the diagonal, vertical and horizontal coefficients for the fourth level are chosen to reduce space complexity and discard the redundant information. The feature vector comprises of following coefficients [CD4 CV4 CH4], This coefficient matrix which represents the unique ear

Anupam Sana, Phalguni Gupta and Ruma Purkait 49

l"LevelHaar WTavefet transform

Oil CD1 CHI OT

2"1 Level Haar wsvelet transform

CA2 CD2 CH2

Fig. 5. Wavelet decomposition level.

pattern is binarised comparing the sign (negative or positive) of its element. The binary feature template of approximataion coefficient matrix from level two wavelet transform is shown in Fig. 6.

^ ^ ^ Haar Wavelet l ^ y ^ l

Fig. 6. Wavelet transform of Ear into level two.

2.4. Matching

Testing binary template (S) matched with the trained template (T) of the database using Hamming distance. Hamming Distance(HD) between two templates (nxm) calculated using the equation

HD = 5 ^ T i j ® $ j n x m

(6)

Here templates T and S are XOR-ed element wise and HD is computed, which is the matching score between training and testing templates. Therefore, for each trained template there is a matching score and best matches can be chosen. The matching score of all the templates are statistically analyzed in the next section.

3. Experimental Result

The experiment is conducted on two different databases, IITK and Saugor University. The IITK image database has images belonging to 600 individuals, with three images acquired per person (600x3). While the database collected at Saugor University

consists of 350 (350x3) individuals. In Fig. 7 false acceptance rate (FAR) and false rejection rate (FRR) is plotted at different threshold values (i.e. in between 0 and 1). From the two curves in this figure it is found that the system is giving equal error rate (EER) 3.4% and 2.4% at threshold 0.303 and 0.323 for IITK and Saugor University image database respectively. In Fig. 8 the accuracy is plotted at different threshold values and it is found that at EER threshold accuracy value is 96.6% and 97.6% respectively and system is giving maximum accuracy of 97.8% and 98.2% at 0.29 and 0.28 threshold value for IITK and Saugor University image database respectively. The Receiver Operating characteristic (ROC) is shown in Fig. 9. From two curves in this figure it is observed that the system is giving more than 95% genuine acceptance at FAR lower than 1%.

Plot of FAR /FRR curve

Fig. 7. FAR and FRR curve for IITK database and Saugor University database.

4. Conclusion

The proposed ear biometrics has certain advantages over those reported earlier. This paper has proposed a new approach of ear biometrics. As two training images are sufficient for database preparation the time taken to test with the testing image is significantly reduced. The ear pattern template size being small make it easy to handle. For this reasons it can be applicable on any broad scale or small scale society. Our approach achieved more than 96% accuracy for the two databases indicating a fairly reliable system.

50 Ear Biometrics: A New Approach

Plot of accuracy cuive

Fig. 8. Accuracy Curve for IITK database and Saugor University database.

Plot of ROC Curve

99.5

99

£98.5

™ 98 tx 0)

g 97.5

S 97

.1 96.5

S * 95.5

95

! \

:

':

;-'.-./_

;

1 i

— ROC-IITK - -ROC- Sagour

• ^ - : - - - - : ~ - - -

'..-S" s

f\ * • /

;

.jr

Jill

/

10 10 10 10 10

False Acceptance Rate (%) -- >

Fig. 9. Receiver Operating Characteristic Curve for IITK database and Saugor University database.

5. A c k n o w l e d g e m e n t

The study is supported by the Ministry of Commu

nication and Information Technology and University

Grants Commission, Govt, of India. The authors ac

knowledge Ms. Hunny Mehrotra for preparing the

manuscript and for assistance in conducting the ex

periments and Mr. Pradeep Nayak for assistance in

conducting the experiments.

References

1. A. Bertillon, La photographie judiciaire,avec un ap-pendice sur la classification et identification anthro-pometriques, in Gauthier-Villars, (Paris, 1890).

2. A. Iannarelli, Ear identification, in Proceedings of International Workshop Frontiers in Hand/writing Recognition, (Paramont Publishing Company, Fremont, California, 1989).

3. M. Burge and W. Burger, Personal identification based on pea, in Proceedings of the 21st Workshop of the Austrian Association for Pattern Recognition, 1997.

4. M. Burge and W. Burger, Ear biometrics in computer vision, in Proceedings of the 15th International Conference on Pattern Recognition, 2000.

5. M. Choras, Ear biometrics based on geometrical feature extraction, in Electronic Letters on Computer Vision and Image Analysis, 2005.

6. D. J. Hurley, M. S. Nixon and J. N. Carter, A new force field transform for ear and face recognition, in Proceedings of the IEEE International Conference of Image Processing, 2000.

7. B. Victor, K. Bowyer and S. Sarkar, An evaluation of face and ear biometrics, in Proc. Intl Conf. Pattern Recognition, 2002.

8. K. Chang, S. S. K. W. Bowyer and B. Victor, Comparison and combination of ear and face images in appearance-based biometrics, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003.

9. B. Moreno, A. Sanchez and J. Velez, On the use of outer ear images for personal identification in security applications security technology, in Proceedings. IEEE 33rd Annual 1999 International Carna-han Conference, 1999.

10. H. Chen and B. Bhanu, Contour matching for 3d ear recognition, in Seventh IEEE Workshops on Application of Computer Vision(WACV/MOTION-05), 2005.

11. S. Pittner and S. Kamarthi, Feature extraction from wavelet coefficients for pattern recognition tasks, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999.

51

Face Detection using Skin Segmentation as Pre-Filter

Shobana L., Anil Kr. Yekkala and Sameen Eajaz

Philips Electronics India Ltd. E-mail: {shobana.lakshminarasimhan, anil.yekkala, sameen.eajaz

Face Detection has been a topic of research for several years due to its vast range of applications varying from Security Surveillance to Photo Organization. Face detection algorithms are also used by Face Recognition algorithms for locating faces in an image, before performing the recognition. In the available literature on Face Detection, there exists a Face Detection algorithm provided by Viola and Jones, very robust in terms of detection rate, but its speed is not very satisfactory. In this paper, we propose a Pre-flltering step using skin segmentation and some minor modification to the algorithm provided by Viola Jones. The proposed Pre-filtering step and modifications improve the speed of the algorithm approximately by 2.8 times. It also improves the precision of the algorithm by reducing the number of false detections.

Keywords: Face Detection; Viola Jones; Skin Segmentation; Pre-Filter

1. Introduction

Several methods have been proposed in the literature for performing Face Detection. The available methods for Face Detection can be broadly categorized under two approaches. The first approach1'4'9'12 is based on conducting an exhaustive search of the entire image using windows of varying sizes and finding the face regions using features like Haar Cascades, Eigen Features and others. The second approach2,3 is based on the skin segmentation using color properties and simple texture measures, then finding connected regions and finally, filtering the non-face regions using face profiles and facial features. Even though the first approach based on the exhaustive search provides a very robust algorithm in terms of detection, its speed is not very satisfactory. For example, using the Viola Jones approach takes almost 2-3 seconds to detect a face in an image of size 1024x768 on a Pentium processor. Though the second approach based on skin segmentation provides very satisfactory results in terms of speed, it fails to provide a robust algorithm, since the algorithm fails for images with Complex backgrounds. Also, the false detection rate is quite high, since most of the time skin-colored objects are also detected as faces. In this paper, we present an improvement to the Viola Jones based Face Detection algorithm to improve its speed and precision as well by reducing the number of false detections, without affecting the recall rate. These improvements are achieved by using skin segmentation as a Pre-Filter and with some modification to the approach followed by Viola Jones.

The following sub-sections provide a brief

overview of the skin segmentation techniques in the available literature on Face Detection and also an overview of the Viola Jones Face Detection algorithm.

1.1. Skin Segmentation based Face Detection

Based on several studies6'13'14 conducted in the past, it has been found that in color images the skin regions can be detected based on their color properties. It has been observed that in case of RGB color models the proportion of red, green and blue are within some prescribed limits for skin regions. Similarly, in case of YC(,Cr model the chrominance components C;, and C r are within the prescribed limits for skin regions. Simple texture measures are used to improve the skin segmentation performance. Skin segmentation based Face Detection is done in three steps. 1. Skin segmentation based on the RGB color properties of the skin. 2. Finding the connected regions 3. Filtering out the non-face regions. The advantages of skin segmentation based Face Detection is its simplicity and its being not affected by the geometric variations of the face. Its main disadvantage is its sensitivity to the illumination conditions.

1.2. Viola Jones Method

The Viola Jones1 approach is based on conducting an exhaustive search of the image using varying size windows, starting from a base window of size 24x24, gradually increasing the window size by a fixed factor of 1.25 (or any value ranging from 1.2 to 1.3).

52 Face Detection using Skin Segmentation as Pre-Filter

A

C

B

1

D

1

2

4

Fig. 2. Integral Sum.

Each window region is classified as a face or non-face region based on simple features. The features used are reminiscent of Haar basic functions proposed in.11 Three kinds of features are used, namely, two-rectangular features, three-rectangular features and four-rectangular features. In two-rectangular features the difference between the sums of pixel intensities of the two rectangular regions is considered. In case of three-rectangular features the sums of the pixel intensities within the outside rectangles is subtracted from that of the center rectangle. Finally, in case of four-rectangular feature the difference between the sum of the pixel intensities of the diagonal pairs of rectangles is considered. Figure 1 shows the set of rectangular features used.

Ill SB

Fig. 1. Rectangular Features.

In order to make the computations of rectangular features faster, an intermediate representation of the image called integral image is used. The integral image at any point (x,y) is the sum of the pixels above and to the left of (x,y).

y

ii(x,y) = 2_J2-Î{J,k) j=0 fc=0

Using integral image, the sum of the pixel intensities within any rectangular region in the image can

be computed with just three sum operations. For example, in Figure 2 the sum of the pixel intensities of the rectangle D can be computed with four array references. The value of the integral image at location 1 is the sum of the pixel intensities in rectangle A. The integral image value at location 2 is A + B, at location 3 is A + C, and at location 4 is A + B + C + D. The intensity sum within D can be computed as 4 + 1 — (2 + 3).

Viola Jones Face Detector uses a boosting cascade of rectangular features. The cascade contains several stages. The stages are trained in such a way that they progressively eliminate the non-face candidates. The initial stages in the cascade are simple so that they can be evaluated fast and can reject maximum number of non-face candidates. The later stages are comparatively complex.

Each stage consists of a set of features. Each input window is evaluated on the first stage of the cascade, by computing the set of features belonging to the first stage. If the total feature sum of the stage is below a pre-defined threshold, then the window is discarded, otherwise it passes to the next stage. The same procedure is repeated at each stage. Since most of the windows in an image do not contain facial features, most of the windows get discarded in the initial stages, hence reducing the number of computations.

As the Viola Jones face detector is insensitive to small variations in scale and position of the windows, a large number of detections occur around a face region. To combine these regions into a single region, a grouping algorithm is applied. The grouping algorithm groups all the overlapping rectangles that lie within an offset of 25% of their size into a single rectangle. A region is detected as a face only when a minimum of three face rectangles are obtained around that face region. This confidence check is used to reduce the number of false detections. For a non-face region, the probability of getting three or more overlapped rectangles is relatively low.

2. Proposed Approach for Face Detection

We present an improvement to the Viola Jones algorithm by using skin segmentation as a pre-filter and modifying the search strategies and merge and group methods. With the proposed modification, the algorithm is performed in three steps. Figure 3 gives the overview of the proposed algorithm.

Shobana L., Anil Kr. Yekkala and Sameen Eajaz 53

11, s(x,V) = ^2Y2Iss(j,k)

I •miMg.

j=0 fe=0

Fin ting Skin Regions

Feature based Filtering

Fig. 3. Overview of the proposed algorithm.

2.1. Skin Segmentation

In the proposed method, skin segmentation is used as the pre-filtering step to reduce the number of windows to be searched. Skin segmentation is performed by detecting the skin pixels using color properties in the RBG color space. For each pixel, the algorithm checks whether the proportions of red and blue are within the prescribed limits. The method can be described by the following equation.

if ((r < 0.34; and (r < 0.58; and (g > 0.22; and (s < o.3s;;

hs = 1 else

hs =0 end

where, r - R ' ~ fi+G+B' 9 =

G R+G+B

A pixel with the Skin Segmentation output Iss=l indicates that it is a skin pixel and that with 0 indicates a non-skin pixel. The skin segmentation is designed in such a way that rejecting a skin region is considered costlier then accepting a non-skin region. Once the skin segmentation is performed, the output binary image Iss is represented as an integral image, similar to the intensity sum. This representation results in faster computation of the number of skin pixels by using just 3 sum operations. Since skin segmentation is based on color properties, this pre-filtering step can be used only for color images.

2.2. Searching For Face Regions

Unlike the Viola Jones algorithm, the proposed approach does not perform an exhaustive search to locate the face regions. Instead, it finds the probable face regions based on the number of skin pixels present in the region. If the number of skin pixels present in a region is above a pre-defined threshold, the region is considered to be a candidate face region. Based on experimental results, this threshold was found to be 60%. The Haar features are computed for each candidate face region using the same approach as Viola Jones detection to check whether the region is a face region or not. Unlike the Viola Jones approach, the Face Detection starts with the largest possible window size depending on the image size and gradually scales down to a window size of 24x24. The advantage of starting from the largest window size is that if a bigger window is detected as a face region, the smaller windows within that face region that are lying at an offset of more than 25% of the bigger window size can be discarded. Note that the regions that are lying within an offset of 25% are not discarded because these windows are required for the Merge and Group algorithm.

2.3. Merge and Group

The merge and group technique is similar to that followed in the Viola Jones Face Detection. Once the merging is performed according to Viola Jones method, the algorithm checks for overlapping face regions with percentage of overlap greater then 50%. If such overlapping regions exist, then the region with lower average value of feature sum is eliminated. This results in improved precision.

3. Resul ts

A comparative study on the performance in terms of speed of the proposed method with respect to standard Viola Jones method is shown in Table 1.

The study was conducted on a set of 452 images obtained from Caltech Image database. The comparison is done based on the percentage of windows processed in each stage using the proposed method and

54 Face Detection using Skin Segmentation as Pre-Filter

Table 1. Percentage of regions processed in each stage.

Stage

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

No of

features

9

16

27

32

52

53

62

72

83

91

99

115

127

135

136

137

155

159

169

181

196

197

199

200

211

% of Windows

searched

in Viola Jones

method

100

68.15636

37.44727

23.80884

12.71177

5.70148

2.563128

0.885991

0.49522

0.300651

0.200682

0.117771

0.083492

0.05391

0.036424

0.026771

0.021732

0.015506

0.012858

0.010599

0.008351

0.007104

0.006146

0.005452

0.004961

% of Windows

searched

Proposed

method

36.738

25.92015

14.18722

8.981527

4.884061

2.211218

1.001229

0.341391

0.19646

0.122056

0.083506

0.049768

0.037018

0.024601

0.017609

0.013751

0.011666

0.00891

0.007694

0.00662

0.005504

0.004849

0.004308

0.003919

0.003656

that using the Viola Jones method. The comparison clearly shows that in each stage the number of features computed using the proposed method is approximately around 2.8 times lesser then the number of features computed using the standard Viola Jones method. Hence there is an improvement by a factor of 2.8 in speed. It can be further verified that the average number of features computed using the standard Viola Jones method is 2.8 times the average number of features computed using the proposed method. Hence it can be concluded that the proposed method results in an improvement of speed by a factor of 2.8. This has been further verified by

the performance analysis. Table 2 shows that proposed method has not af

fected the recall rate much, but has improved the precision rate.

Table 2. Detection Rate.

Recall

Precision

Viola Jones method

0.97

0.86

Proposed Method

0.96

0.95

The Precision and Recall are computed as follows. Rprnll Noof facesdetected*100

Noof facesdetected+Noof facesmissed

Precision = Noofcorrectdetections*100 Noofcorrectdetections+Noofwrongdetections

The results of detection can be seen in Figure 4.

4. Conclusions

In this paper we have proposed modifications to the Viola Jones approach of Face Detection. The modifications include addition of a Pre-Filtering step using Skin Segmentation, modification in the search algorithm by starting the search from the largest window and excluding the smaller windows from the search if they are within a detected face region. In addition, a modification to the Merging algorithm has been suggested. With the proposed modification the speed of the algorithm has improved by a factor of 2.8 and the precision rate has also improved by 10% without affecting the recall rate much. Hence it can be concluded that using Skin segmentation as a pre-filter for detecting faces in color images and starting the search from a larger window size are useful extensions to the Viola Jones method. The proposed approach of using skin segmentation as a pre-filter can be easily extended to any face detection algorithms that use sliding window techniques, such as face detection technique based on eigen features.

References

1. Micheal J Jones, "Robust Real Time Face Detection", International Journal of Computer Vision, 57(2), 137-154 (2004).

Shobana L.t Anil Kr. Yekkala and Sameen Eajaz 55

Fig. 4. Face Detection Results

2. Pedro Fonseca, Jan Nesvadba, "Face Detection in the Compressed domain", International Conference on Image Processing , (2004).

3. R. L. Hsu, M. Abdel Mottaleb and A. K. Jain, "Face Detection in Color Images", IEEE Transactions on Pattern Analysis and Machine Intelligence, v, o (1). 24, no. 5, pp. 696-706, May 2002.

4. Diedrik Marius, Sumita Pennathur and Klint Rose, "Face Detection using color thresholding and Eigen-image template matching".

5. Rapheal Feraud, Olivier J. Beraier, Jean Emmanuel Viallet and Michel Collobert, "A Fast and Accurate Face Detector Based on Neural Networks", IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 23, No. 1 (Jan 2001).

6. Jean-Christophe Terrillon, Mahdad N. Shirazi, Hideo Fukamachi and Shigeru Akamatsu, "Comparative Performance of Different Skin Chrominance Models and Chrominance Spaces for the Automatic Detection of Human Faces in Color Images".

7. Sung K and Poggio T, "Example-based learning for viewbased face detection", IEEE Pattern Analysis Machine Intelligence. , (1998).

8. Rowley H, Baluja S and Kanade T., "Neural network-based face detection", IEEE Pattern Anal

ysis Machine Intelligence. , (1998). 9. Shneiderman H and Kanade T, "A statistical

method to 3D object detection applied to faces and cars", International Conference on Computer Vision. , (2000).

10. Roth D, Yang M and Ahuja N, "A snowbased face detector", Advances in Neural Information Processing Systems. , (2000).

11. Papageorgiou C, Oren M and Poggio T, "A general framework for object detection", International Conference on Computer Vision. , (1998).

12. M. Turk, A. Pentland. "Eigenfaces for Recognition", Journal of Cognitive Neuroscience. Vol. 3 N o . 1, 71-86 (1991).

13. Son Lam Phung, Abdesselam Boezerdoum and Douglas Chai, "Skin Segmentation using Color Pixel Classification: Analysis and Comparision", IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 27 No . 1, (Jan 2005).

14. Filipe Tomaz and Tiago Candeias and Hamid Shah-bazkia, "Improved Automatic Skin Detection in Color Images", Proc. Vllth Digital Image Computing: Techniques and Applications. , 10-12 (Dec 2003).

56

Face Recognition using Symbolic KDA in the Framework of Symbolic Data Analysis

P. S. Hiremath

Department of Computer Science Gulbarga University,

Gulbarga-585106, Karnataka, India E-mail: [email protected]

C. J. Prabhakar

Department of Computer Science Kuvempu University

Shankarghatta-577451,Karnataka, India E-mail: [email protected]

In this paper we present one of the symbolic factor analysis method called as symbolic kernel discriminant analysis (symbolic KDA) for face recognition in the framework of symbolic data analysis. Classical factor analysis methods (specifically classical KDA) extract features, which are single valued in nature to represent face images. These single valued variables may not be able to capture variation of each feature in all the images of same subject; this leads to loss of information. The symbolic KDA Algorithm extracts most discriminating non-linear interval type features; they optimally discriminate among the classes represented in the training set. The proposed method has been successfully tested for face recognition using two databases, ORL and Yale Face database. The effectiveness of the proposed method is shown in terms of comparative performance against popular classical factor analysis methods such as eigenface method and fisherface method. Experimental results show that symbolic KDA outperforms the classical factor analysis methods

Keywords: Symbolic Data Analysis; Face Recognition; Interval Type Features; Symbolic Factor Analysis Methods.

1. Introduction

Of the appearance based face recognition methods,4 '5 '1 2 , 1 6 , 2 6 , 2 9 those utilizing LDA techniques 10 ,11 ,28 '35have shown promising results. However, statistical learning methods including the LDA based ones often suffer from the so called small sample size (SSS) problem encountered in high dimensional pattern recognition tasks where the number of training samples available for each subject is smaller than the dimensionality of the sample space. Therefore numerous modified versions of the LDA were proposed. These modified versions have shown promising results as it is demonstrated in 3,34,6,30,20,23,33 There are two ways to address the problem. One option is to apply linear algebra techniques to solve the numerical problem of inverting the singular within class scatter matrix. For example, Tian et al, utilize the pseudo inverse to complete this task. Also, some researchers15'34 recommended the addition of a small perturbation to the within class scatter matrix so that it becomes non-singular. However, the above methods typically computationally expensive since the scatter matrices are very large. The second option is a subspace approach, such as the one fol

lowed in the development of the Fisherfaces method3

where PCA is firstly used as a preprocessing step to remove the null space of within class scatter matrix and then LDA is performed in the lower dimensional PCA subspace. However, it has been shown that the discarded null spaces may contain significant discriminatory information19 to prevent this from happening solutions without a separate PCA step, called direct LDA methods have been proposed recently in.6 '30 '23

Although successful in many cases, linear methods fail to deliver good performance when face patterns are subject to large variations in view points, which result in a highly non convex and complex distribution. The limited success of these methods should be attributed to their linear nature. As a result, it is reasonable to assume that better solution to this non linear problem could be achieved using non linear methods, such as the so called kernel machine techniques.17'25 Among them, kernel principal component analysis (KPCA)27 and kernel Fisher discriminant analysis (KFD)24 have aroused considerable interest in the fields of pattern recognition and machine learning. KPCA was originally developed by Scholkopf et al., in 1998, while KFD was first



P. S. Hiremath and C. J. Prabhakar 57

proposed by Mika et a/., in 1999.24 Subsequent research saw the development of series of KFD algo-rithms.2 '24-31 '32 '35

The denning characteristic of KFD based algorithms is that they directly use the pixel intensity values in a face image as the features on which to base the recognition decision. The pixel intensities that are used as features are represented using single valued variables. However, in many situations same face is captured in different orientation, lighting, expression and background, which lead to image variations. The pixel intensities do change because of image variations. The use of single valued variables may not be able to capture the variation of feature values of the images of the same subject. In such a case, we need to consider the symbolic data analysis (SDA),1 '7 '8 , 9 '1 8 in which the interval-valued data are analyzed.

In this paper, new appearance based method is proposed in the framework of Symbolic Data Analysis (SDA),1 '9 namely, symbolic KDA for face recognition, which are generalization of the classical KDA to symbolic objects. In the first step, we represent the face images as symbolic objects (symbolic faces) of interval type variables. The representation of face images as symbolic faces accounts for image variations of human faces under different lighting conditions, orientation and facial expression. It also drastically reduces the dimension of the image space without losing a significant amount of information. Each symbolic face summarizing the variation of feature values through the different images of the same subject. In the second step, we applied symbolic KDA algorithm to extract interval type non-linear discriminating features. According to this algorithm, In the first phase, we applied kernel function to symbolic faces, as a result a pattern in the original input space is mapped into a potentially much higher dimensional feature vector in the feature space, and then perform in the feature space to choose subspace dimension carefully. In the second phase, Symbolic KDA is applied to obtain interval type non-linear discriminating features, which are robust to variations due to illumination, orientation and facial expression. Finally, minimum distance classifier with symbolic dissimilarity measure1 is employed for classification. Proposed method has been successfully tested using two standard databases ORL and Yale face database.

The remainder of this paper is organized as fol

lows: In section 2, the idea of constructing the symbolic faces is given. Symbolic KDA is developed in section 3. In section 4, the experiments are performed on the ORL and Yale face database whereby the proposed algorithm is evaluated and compared to other methods. Finally, a conclusion and discussion are offered in section 5.

2. Construction of Symbolic Faces

Consider the face images T i , r 2 , . . . ,T„, each of size N x M from a face image database. Let ft = {Ti, T2, • •., r „ } be the collection of n face images of the database, which are first order objects. Each object T; e ft,/ = 1 , . . . , n, is described by a feature vector (Yi,...,Yp), of length P = NM, where each component Yj,j = l , . . . , p , is a single valued variable representing the intensity values of the face image Ti.

An image set is a collection of face images of m different subjects; each subject has same number of images but with different orientations, expressions and illuminations. There are m number of second order objects(face classes) denoted by E = { c i , . . . , c m } , each consisting of different individual images T; e ft, of a subject. We have assumed that images belonging to a face class are arranged from right side view to left side view. The view range of each face class is partitioned into q sub face classes and each sub face class contains r number of images. The feature vector of kth sub face class ck of ith face class Cj, where k = 1,2,.. . , q, is described by a vector of p interval variables Yi,...,Yp, and is of length p = NM. The interval variable Yj of kth sub face class ck of ith face class is described as

Yj(ck) = [xk

j,xk

j}. (1)

where x% and xk< are minimum and maximum in-tensity values, respectively, among j t h pixels of all the images of sub face class ck. This interval incorporates information of the variability of j t h feature inside the kth sub face class c\. We denote

Xk = (Y1(ck),...,Yp(c

k)),i = l,...m,k=l,...,q.

(2) The vector Xk of interval variables is recorded for each kth sub face class ck of ith face class. This vector is called as symbolic face and is represented as:

X(ck) = (ak1,...,a

kp). (3)

58 Face Recognition using Symbolic KDA in the Framework of Symbolic Data Analysis

where a^ = Y}(c?) = [x^x^] j = 1 , . . . ,p, k = 1 , . . . ,q and i = 1 , . . . , m. We represent the qm symbolic faces by a matrix X of size p x qm , consisting of column vectors Xf, i = 1, . . . ,m and k = 1 , . . . ,q.

3. Acquiring Non Linear Subspace using symbolic KDA Method

Let us consider the matrix X containing qm symbolic faces pertaining to the given set fi of images belonging to m face classes. The centers x^ € 3? of the intervals ahj = Yj(c^) = [x^,x*j], are given by

kc _ fiij + Sij] V ~ 2

where j = 1 , . . . ,p, k = 1 , . . . , q and i = 1 , . . . , m. The p x qm data matrix X c containing the centers

(4)

r. . a;?, e 3? of the intervals for qm symbolic faces. The

p- dimensional vectors X]° = (x^0,..., xkip ), X% =

(s.ii ,...,Xip) and Xt = (xkt,..., xk

p) represents the centers, lower bounds and upper bounds of the qm symbolic faces, respectively. Let $ : W —> F be a nonlinear mapping between the input space and the feature space, the nonlinear mapping, $, usually defines a kernel function. Let K e ^gmxim define a kernel matrix by means of dot product in the feature space:

Kij = (HXi) • *{Xj)) (5)

In general, the Fisher criterion function in the feature space F can be defined as

VTS*V J(V) = (6)

VTS*V

where V is a discriminant vector, Sf and 5 * are the between class scatter matrix and the within class scatter matrix respectively. The between class scatter matrix and the within class scatter matrix in the feature space F are defined below:

face class i, mf is the mean of the mapped symbolic faces in face class i, m* is the mean across all mapped qm symbolic faces. From the above definitions, we have S* = Sf + S*. The discriminant vectors with respect to the Fisher criterion are actually the eigenvectors of the generalized equation SfV = \S®V. According to the theory of the reproducing kernel, V will be an expansion of all symbolic faces in the feature space. That is, there exist coefficients bi(L = 1 , . . . , m) such that

qm

V=J2bL$(xt) = HA (9) L=l

H = ^C

Mx^)...^(x!c),...,^(xm^...^(xic)} where A = (bi,...,bgm). Substituting equation (9) into equation (6). We can obtain the following equation:

ATKWKA J(A) = (10)

ATKKA

where K is a kernel matrix, W = diag(W\,..., Wm), Wi is a ft x ft matrix whose elements are ^-. From the definition of W, it is easy to verify that W is a qm x qm block matrix. In fact, it is often necessary to find s discriminant vectors, denoted by a\,..., as, to extract features. Let V = [a\,..., as}. The matrix V should satisfy the following condition:

V = argmax \vTsbv\ \vTstv\ (11)

Sf = - E ^m? - <)("»* - mof (7) HX?)), (*(XmC)

where Sb = KWK and St = KK. Since, each symbolic face Xk is located between the lower bound symbolic face X_t and upper bound symbolic face Xt, so it is possible to find most discriminating non-linear interval type features [Bj, Bi ]. The lower bound features of each symbolic face X* is given by

B! = V?*(X2),l = l,...,s. (12)

where *(X*) = [ ^ ( X ^ ) • * (*? )> . . . ,*(xf) •

$(Xk),..-, « « C ) i = l

1 m Q i

st =—E E(*(x"c) - m?)(*(xin - mt)

*(2C?))]. Similarly the upper bound features of each sym

bolic face Xjfc is given by

»=i fc=i (8)

where X-f denotes the kth symbolic face of ith face class, qi is the number of training symbolic faces in

B,fc = Vf * ( * * ) , / l , . . . , s . (13)

Let Ctest = [ r i , . . . , r ; ] be the test face class contains face images of same subject with different expression, lighting condition and orientation. The test

symbolic face Xtest is constructed for test face class Ctest as explained in the section 2. The lower bound test symbolic face of test symbolic face Xtest is described as X_test ~ (£ies t! • • • ,£pest)-Similarly, the upper bound test symbolic face is described as X~test = (x\est,.. •,xt

pest).

The interval type features [ 5 t e s t; B ] of test sym

bolic face Xtest are computed as:

Rtest = V^{Xtest). (14)

Btest = V^(Xtest). (15)

where I = 1 , . . . , s.

4. Experimental Results

Face recognition system using symbolic KDA method identify the face by computing nearest face image for a given unknown face images using a minimum distance classifier with Minkowsky's symbolic dissimilarity measure proposed by De Carvalho and Diday [1] is employed for classification. The proposed symbolic KDA method with polynomial kernels is experimented with the face images of the ORL and Yale databases. The effectiveness of proposed methods is shown in terms of comparative performance against five popular face recognition methods. In particular, we compared our algorithms with eigenfaces29 fisherfaces3 symbolic PCA13 and symbolic KPCA.14 The experimentation is done on system with CPU:Pentium 2.5 GHz.

4.1. Experiments using ORL database

We assess the feasibility and performance of the proposed symbolic LDA on the face recognition task using ORL database. The ORL face database is composed of 400 images with ten different images for each of the 40 distinct subjects. All the 400 images from the ORL database are used to evaluate the face recognition performance of proposed methods. We have manually arranged the face images of same subject from right side view to left side view. Six images are randomly chosen from the ten images available for each subject for training, while the remaining images are used to construct the test symbolic face for each trial. Table 1 presents the experimental results for each method corresponding to ORL database. The experimental results show that the proposed method with polynomial kernel of degree


three outperforms the classical factor analysis methods.

Table 1. Comparison of classification performance using ORL Database.

Methods

Fisherfaces

Eigenfaces

Symbolic PCA

Symbolic KPCA

Symbolic LDA

Symbolic KDA

Training time (sec)

98

102

38

87

85

19

F earure Dimension

86

189

71

109

34

28

Recognition Rate (%)

92 8

87.65

94 85

89 15

96 00

98 50

4.2. Experiments on the Yale Face database

The experiments are conducted using Yale database to evaluate the excellence of the symbolic KDA for the face recognition problem. The Yale Face database consists of a total 165 images obtained from 15 different people, with 11 images from each person. In our experiments, 9 images are randomly chosen from each class for training, while the remaining two images are used to construct test symbolic face for each trial. The recognition rates, training time and optimal subspace dimension are listed in Table 2. From Table 2, we note that the symbolic KDA method with polynomial kernel of degree using smaller number of features outperforms the classical factor analysis methods with a larger number of features.

Table 2. Comparison of classification performance using Yale Face database.

Methods

Fisherfaces

Eigenfaces

Symbolic PCA

Symbolic KPCA

Symbolic LDA

Symbolic KDA

Trarung time (sec)

59

85

35

43

98

12

Feature Dimension

23

110

41

32

56

15

Recognition Rate (%)

8995

82.04

91 15

92 00

94 55

98 15

5. Conclusion

In this paper, we introduce a novel symbolic KDA method for face recognition. Symbolic data representation of face images as symbolic faces, using interval variables, yield desirable facial features to cope up with the variations due to illumination, orientation and facial expression changes. The feasibility of

60 Face Recognition using Symbolic KDA in the Framework of Symbolic Data Analysis

the symbolic KDA has been tested successfully on

frontal face images of ORL and Yale databases. Ex

perimental results show tha t symbolic KDA method

with polynomial kernel of degree three leads to su

perior recognition rate as compared to classical fac

tor analysis methods. The proposed symbolic KDA

outperforms symbolic PCA, symbolic LDA and sym

bolic K P C A under variable lighting conditions, ori

entations and expressions.

The proposed symbolic KDA has many advan

tages compared to classical factor analysis methods.

The drawback of classical factor analysis methods is

tha t in order to recognize a face seen from a par

ticular pose and under a particular illumination, the

face must have been previously seen under the same

conditions. The symbolic KDA overcomes this limi

tat ion by representing the faces by interval type fea

tures so tha t even the faces seen previously in differ

ent poses, orientations and illuminations are recog

nized. Another important merit is tha t we can use

more than one probe images with inherent variabil

ity of a face for face recognition. Therefore, symbolic

KDA improve the recognition accuracy as compared

to classical factor analysis methods at reduced com

putat ional cost. This is clearly evident from the ex

perimental results. Further, the symbolic KDA yields

significantly bet ter results than other symbolic factor

analysis methods.

R e f e r e n c e s

1. Bock, H. H., Diday, E. (Eds.) 2000. Analysis of Symbolic Data. Springer Verlag.

2. Baudat, Anouar, "Genralized Discriminant Analysis using Kernel Approach", Neural Computation, 2385-2404, 2000.

3. P. Belhumeur, J. Hespanha, D. Kriegman,1997. Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Transaction on PAMI.19 (7): 711-720.

4. Bruneli and Poggio: Face Recognition: Features versus Templates, vol -15. IEEE Trans. Pattern Analysis and Machine Intelligence, (1993) 1042-1052.

5. Chellappa, Wilson, Sirohey, 1995. Human and machine recognition of faces: A survey, vol - 83(5). Proc. IEEE) 705-740.

6. Chen, Liao, Ko, Lin, Yu,2000. Anew LDA based face recognition system which can solve the small sample size problem. Pattern Recognition 33, 1713-1726.

7. Choukria, Diday, Cazes, 1995. Extension of the principal component analysis to interval data. Presented at NTTS'95: New Techniques and Technologies for statistics, Bonn.

8. Choukria, Diday, Cazes, 1998. Vertices Principal Component Analysis with an Improved Factorial Representation. In: A. Rizzi, M. Vichi, H. Bock (eds.): Advances in Data Science and Classification. Pp.397-402, Springer Verlag.

9. Diday, 1993. An Introduction to symbolic data analysis. Tutorial at IV Conf. IFCS.

10. K. Etemad and R. Chellappa, 1997. "Discriminant Analysis for Recognition of Human Face Images", J. Optical Soc. Am. vol-14, pp 1724-1733.

11. Fisher, 1938. The statistical utilization of multiple measurements, Ann. Eugenics, 8, 376-386.

12. M. A. Grudin, 2000. "On internal representations in face recognition systems", Pattern recognition, vol.33, no.7, pp.1161-1177.

13. Hiremath. P. S, Prabhakar. C. J, 2005. "Face Recognition Technique using Symbolic PCA Method", Proc. Int. Conf. on Pattern Recognition and Machine Intelligence (PreMI'05), Kolkata, 266-271, Springer Verlag.

14. Hiremath. P. S, Prabhakar. C. J, "Face Recognition Technique using Symbolic kernel PCA Method", Proc. Int. Conf. on Cognition and Recognition (COGREC'05), Mysore, 801-805, Allied Publishers (2005).

15. Hong, Yang,1991. Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognition 24(4),317-324.

16. Kirby, Sirovich, 1990. Applications of the Karhunen-Loeve procedure for the characterization of human faces, v-12(l). IEEE Trans. Pattern Anal. Machine Intell. 103-108.

17. Kernel Machines http:/ /www.kernel-machines, org, 2000.

18. Lauro, Verde, Palumbo, 1997. Analysis of symbolic data, Bock and Diday (Eds), Springer Verlag.

19. Liu, Cheng, Yang, 1992a. A generalized optimal set of discriminant vectors. Pattern Recognition 25,731-739.

20. Liu, Wechsler, 2002. Gabor feature based classification using the enhanced Fisher linear discriminant model for face recognition. IEEE Trans. Image Process. 11(4), 467-476.

21. Liu, Wechsler, 2000: Robust coding schemes for indexing and retrieval from large face databases, IEEE Trans. On image processing 9, 132-137.

22. Liu, Cheng, Yang, 1993. Algebraic feature extraction for image Recognition based on an optimal discriminant criterion, Pattern Recognition, 26, 903-911.

23. Lu, Plataniotis, Venetsanopoulos, 2003b. Face Recognition using LDA based algorithms. IEEE Trans Neural Networks, 1(1), 195-200.

24. Mika, Ratsch, Scholkopf, Muller, "Fisher Discriminant Analysis with kernels" Proc. IEEE Int Workshop Neural Networks for signal Processing, 41-48,1999.

25. Muller.Mika.Ratsch, Tsuda, Scholkopf,2001, " An

http://www.kernel-machines


Introduction to kernel based Learning Algorithms", IEEE Trans. Neural Networks, voll2, 1810201.

26. A. Pentland, B. Moghaddam and T. Starner, 1994. "View based and modular Eigenfaces for Face Recognition" , Proc. Computer Vision and Pattern Recognition, pp.84-91.

27. Scholkopf and A. Smola, and K. Muller, "Nonlinear Component Analysis as a kernel Eigenvalue Problem", Neural Computation, vol.10, pp.1299-1319, 1998.

28. D. Swets, J. Weng, 1996. Using discriminant eigen-features for image retrieval. IEEE. Transactions on PAMI.18, 831-836.

29. Turk, Pentland, 1991. Eigenfaces for Recognation, v-3. J Cognitive Neuro Science, 71-86.

30. Yu, Yang, 2001. A Direct LDA algorithm for high dimensional data with application to face recognition,

Pattern Recognition, 34(7), 2067-2070. 31. M. H. Yang, "Kernel Eigenfaces vs. Kernel Fish-

erfaces: Face Recognition using Kernel Methods", Proc. Fifth IEEE Int'l Conf. Automatic Face and Gesture Recognition, pp.215-220,2002.

32. M. H. Yang, N. Ahuja and D. Kriegman, " Face Recognition Using Kernel Eigenfaces", Proc. IEEE Int'l Conf. Image Processing, 2000.

33. Ye, Li, 2004. LDA/QR. An efficient and effective dimension reduction algorithm and its theoretical foundation. Pattern Recognition, 27(9), 1209-1230.

34. Zhao, Chellappa, Phillips, 1999. Subspace linear discriminant analysis for face recognition. Technical Report, CS-TR4009, University of Maryland.

35. Zhao, Chellappa, Phillips, Rosenfeld, 2003. Face Recognition: A literature survey, ACM Comput. Surveys, 35(4), 399-458.

62

Minutiae-Orientation Vector Based Fingerprint Matching

Li-min Yang*, Jie Yang and Yong-liang Zhang

Institute of Image Processing and Pattern Recognition Shanghai JiaoTong University (SJTU),

Shanghai, 200240, P.R. China E-mail: [email protected]. cri*

Fingerprint matching is an important problem in fingerprint identification. In this paper, we propose a novel minutiae-orientation vector(MOV) for fingerprint matching. MOV combines both orientation information and neighborhood minutiae information. A set of reference point pairs is identified based on MOV. Alignment is done coordinately and directionally based on these reference point pairs. Matching score is computed. The experimental results show that the proposed matching scheme is very competitive as compared with those algorithms, which have been participated in FVC2004 DB2.A.

Keywords: Fingerprint alignment; Fingerprint matching; Minutiae-orientation vector.

1. Introduction

Fingerprints have been used for identifying individual from the ancient time. Fingerprint identification is the most widely used biometric authentication method at present. Many fingerprint matching methods have been reported in literatures. Among them the minutiae-based methods are most popular. Fingerprint can be represented by a set of minutiae, including ridge endings and bifurcations. Each minu-tia can be described by its spatial location associated with the direction and the minutia type.

A novel fingerprint feature named the adjacent feature vector(AFV) was defined in [1]. An AFV consists of four adjacent relative orientations and six ridge counts of a minutia. An AFV-based fuzzy scoring strategy is introduced to evaluate similarity levels between matched minutiae pairs. [2] proposed a triangular matching method to cope with the deformation and validate the matching by dynamic time warping. [3] used orientation-based minutia descriptor for minutiae matching. [4] integrated point pattern and texture pattern by firstly aligning two fingerprints using point pattern and then matching their texture features. [5] defined a novel feature vector for each fingerprint minutia based on the global orientation field. These features are used to identify corresponding minutiae between two fingerprint impressions by computing the Euclidean distance between vectors. [6] proposed a minutiae matching technique, which used both local and global struc

tural information. [7] developed a method to evenly align two sets of minutiae based on multiple reference minutiae (MRM). MRM is distributed in different regions of fingerprint and two sets of minutiae are globally and evenly aligned. Thus a pair of unpaired corresponding minutiae far away from the reference minutiae in SRM-based method maybe paired in MRM based method.

In this paper, we propose a novel minutiae-orientation vector(MOV) for each minutia, which combines both neighborhood minutiae and surrounding orientation information. Based on MOV, one main-reference point pair and a set of associate-reference point pairs are identified. Alignment is performed according to these reference point pairs. Matching score is computed thereafter. Experiments are conducted on the public domain collection of fingerprint images, DB2.A in FVC2004. Comparing with earlier works like [1], the strength of the proposed scheme lies in that the MOV structure combines both neighborhood minutiae and surrounding orientation information. MOV has limited dimensions, which means shorter computational time. Further more, the multi-reference point pairs mechanism is more robust to deformations than one-reference point pair method.

The rest of this paper is organized as follows. Section 2 describes the proposed minutiae-orientation vector. Our fingerprint matching method is presented in Sec. 3. The experimental results are reported in Sec. 4. Section 5 concludes this paper.


Li-min Yang, Jie Yang and Yong-liang Zhang 63

2. Definition of the Novel Minutiae-Orientation Vector(MOV)

In general, a minutia point M from a fingerprint minutiae set can be described by a feature vector given by:

F={x,y,w). (1)

where (x, y) is its coordinate, w is its orientation. Generally, the orientation of minutia is in the

range [—n/2, it/2\. Considering that the difference of 30° may result in the difference of 150° due to the effect of rotation of fingerprint image on the directions of ridges, the orientation difference of two minutia wi and o)2 can be calculated as follows:5

d(o)i,0)2) =

if -it/2 < (wi - o>2) < Tt/2,

if -it < (wi - (j2) < -Tt/2, 0)i — 0)2 — 7T

if it/2 < (o)i - o)2) < it.

Let M(x, y, o>) denotes an arbitrary minutia from a fingerprint image, we define our novel minutiae-orientation vector(MOV) as follows:

Firstly, draw a circle C around M with center M(x, y) and radius 6 * T (T is the local average ridge distance, empirically determined). Let 6\ = o), #2 = o)+7r/4, 03 = W+7I-/2 and 04 = o)+37r/4. We plot four lines l\,h,h, and Z4 along the angles 61,82,03, and 64 with respect to X axis through the minutia point M. Label the eight intersection points of circle C and lines h,h,h, and Z4 as C?,C?,--- ,C8

M. C f has coordinate (XCM,yc\i) and orientation O)CM, where i = 1,2, • • • , 8 (illustrated in Fig. 1).

.'-, ? ,

Secondly, search around M to find 8 selective minutiae, M : , M 2 , - - - ,M 8 . Af" has coordinate (xl,yl) and orientation o)1, where i = 1,2, • • • ,8. It is worth to notice that each selective minutia M%

should satisfy the following two conditions (illustrated in Fig. 1):

DMMi = min DMMk

where k = 1,2, ••• ,N,f(i) = [o)+(i-l)x45°,o)-Mx 45°), o)M M i denotes the direction from M to minutia M%, and DMM1 denotes the distance between M and minutia Ml, i = 1,2, • • • ,8, TV is the number of minutiae on this fingerprint.

The proposed novel minutiae-orientation vec-tor(MOV) at minutia point M can be defined as:

MOV(M) = { d ( o ) , o ^ ) , d ( o ) , o / ) ,

d(o>, UMMi), DMMi }i=l,2,- • ,8 (3)

where ^MCM € (—7r/2,7r/2) is the direction of M to

This vector MOV(M) of the minutia M is invariant to rotation and translation of the fingerprint images. As we know, if two minutiae from two images of the same finger are corresponding minutiae, they will have similar MOVs.

3. Fingerprint Matching Based on M O V

A minutia can be described by its position, direction and type (ridge ending or ridge bifurcation). Only minutia position and minutia direction are used in this paper. In order to align two point sets before calculating the matching score, we need to identify a set of reference point pairs.

3.1. Reference Point Pairs Identification

Suppose Tfe and Qi (where k = 1,2, •••if, I = 1.2, • • • L, K,L are the numbers of minutiae detected on T and Q, respectively.) are two minutiae from

Fig. 1. Illustration of minutiae-orientation vector(MOV)

64 Minutiae-Orientation Vector Based Fingerprint Matching

template fingerprint and query fingerprint, respectively. The similarity level between Tk and Qi is calculated as:

S{Tk,Qi) Thr~TZk'Ql) if <£(Tk,Qi)<Thr,

0 otherwise. (4)

where <£(Tk,Qt) = \MOV{Tk) - MOV{QL)\ is the Euclidean distance between Tk and Qi's MOVs, and Thr is the threshold. The similarity level S(Tk,Qi), 0 < S(Tk,Qi) < 1, describes a matching certainty level of two MOVs instead of simply matched or not. S(Tk,Qi) = 1 implies a perfect match while S(Tk,Qi) = 0 implies a total mismatch.

Note that not all dimensions of the MOV exist. When calculating the similarity level between Tk and Qi, only the dimensions existed in both minutiae are employed to decide the similarity level.

The best matched minutiae pair (A, B) is obtained by maximizing the similarity level as:

S(A,B) = max(S(Tk,Ql)). (5)

Minutiae pair (A, B) is identified as the main-reference point pair, while A and B's surrounding intersection point pairs (Cf,Cf), i = 1,2, •••8 are identified as associate-reference point pairs.

3.2. Fingerprint Alignment

Aligning two minutiae sets T and Q is to rotate and shift one of them in order to make minutiae in T overlap approximately the corresponding counterparts in

Q. After obtaining main-reference point pair

(A(xa,ya,wa), B(xb,yb,Ub)), and associate-reference point pairs (C/*(a;?,j/?,u;?),Cf(zj.yj.wj)), i = 1,2, • • • 8, the rotation parameter Aw between template minutiae set T and query minutiae set Q is computed by:

AW = a ( W 6 - W a ) + / 3 , ^ K b - < ) . (6) i

where a and $ are weights, and a + ]>3/?i = l , i =

l ,2 , - - -8 . Rotate Cf to Cf by angle Aw with respect to

reference point B, according to the following formula:

cos Aw sin Aw \ ( x\ — Xb sin AUJ - cos Aw J \yf - Vb

(7)

where (Xf, Yf) is the coordinate of Cf . Then the translation vector (Ax, Ay)T between

T and Q can be computed by :

(8)

where 7 and 77* are weights, and -y + ^2rji — 1, i =

1,2, ••-8. After obtaining the translation and rotation vec

tor (Ax, Ay) and Aw, we translate and rotate all the minutiae on query set Q, according to the following formula:

(cos Aw sin Aw 0 sin Au> — cos Aw 0

0 0 1 (9)

where (xi,yi,Wi)T, represents minutiae on the query minutiae set and (xf ,yf,wf)T represents the corresponding aligned minutiae.

Let Q' denote the new minutiae set of the query fingerprint after transformation with the estimated translation and rotation parameters.

3.3. Matching Score Computation

For the transformed minutiae set Q', we re-compute the MOVs of each minutia. By calculating and comparing the similarity level of minutiae's MOVs, we find the corresponding minutiae pairs between the transformed minutiae set {Q'[}i=i,2,••• ,L and the originally extracted minutiae set {Tfc}fe=ii2)... tK- Let {rc„,<5cn}n=i,2,-,AP, N' < mm(K,L), denote the corresponding minutiae pairs, then the final matching score Macore between the query and template fingerprints can be calculated according to the following equation:

JV'

Ms, "£s(TCn,Q'cJ. (10) n = l

where the similarity level S(TCn,Q'Cn) is computed according to Eq. (4).

Li-min Yang, Jie Yang and Yong-liang Zhang 65


The experiments reported in this paper have been conducted on the public domain collection of fingerprint images, DB2.A of FVC2004.8 The fingerprints of DB2_A are acquired through optical sensor "U.are.U 4000" by Digital Persona, and each fingerprint has a size of 328x364 pixels by 500 dpi. This database contains 800 fingerprints captured from 100 different fingers, eight impressions for each finger. According to FVC rules,8 each fingerprint is matched against the remaining fingerprints of the same finger in genuine match. Hence, the total number of genuine tests is (8x7xl00)/2 = 2800. In imposter match, the first fingerprint of each finger is matched against the first fingerprint of the remaining fingers. Hence, the total number of false acceptance tests is (100x99)/2 = 4950.

hta. DtakhK*!'..

Fig. 2. ROC curve on FVC2004 DB2.A obtained by the proposed algorithm

Table 1. Comparison of the proposed algorithm with P039 and P071 on DB2-A

Algorithm

P039 Our Algorithm P071

EER (%)

1.58 2.04 2.59

FMR 100(%)

2.18 2.23 3.14

FMR 1000(%)

5.79 3.30 4.96

Average Match Time(s)

0.83 0.45 0.57

We tested the relationship between FMR(False Match Rate) and FNMR(False Non-Match Rate) on DB2.A. The Receiver Operating Characteristic

(ROC) curve obtained by the proposed algorithm is illustrated in Fig. 2. In Table 1, we compare the performance of the proposed scheme with the results of the two best algorithms called "P039" and "P071" on DB2.A of FVC2004. According to the ranking rule in terms of EER in FVC2004, the proposed scheme is in the second place on DB2_A.

5. Conclusions

In this paper, we propose a MOV-based matching scheme. For each minutia, a minutiae-orientation vector(MOV) is constructed. It is rotation and translation invariant. Based on MOV, a set of reference point pairs between template and query fingerprint minutiae sets is identified. Alignment is done according to these reference point pairs. Fingerprint matching based on the proposed MOVs is processed. Experimental results on the public domain collection of fingerprint images, DB2-A of FVC2004 show that fingerprint matching based on the proposed MOV can achieve good performance.

References

1. Xifeng Tong, Jianhua Huang, Xianglong Tang, Darning Shi, Fingerprint minutiae matching using the adjacent feature vector, Pattern Recognition Letters 26 (2005) 1337-1345

2. Z. M. Kovcs-Vajna, A fingerprint verification system based on triangular matching and dynamic time warping, IEEE Trans. Pattern Anal. Mach. Intell. 22 (11) (2000) 1266-1276

3. M. Tico, P. Kuosmanen, Fingerprint matching using an orientation-based minutia descriptor, IEEE Trans. Pattern Anal. Mach. Intell. 25 (8) (2003) 1009-1014

4. A. Ross, A. K. Jain, J. Reisman, A hybrid fingerprint matcher, Pattern Recognition 36 (7) (2003) 1661-1673

5. Jin Qi, Suzhen Yang, Yangsheng Wang, Fingerprint matching combining the global orientation field with minutia, Pattern Recognition Letters 26 (2005) 2424-2430

6. Jiang, X., Yau, W. Y., Fingerprint minutiae matching based on the local and global structures In: Proc. Internat. Conf. Pattern Recognition, ICPR2000, Barcelona, Spain, vol. 2, pp. 1042-1045

7. En Zhu, Jianping Yin, Guomin Zhang, Fingerprint matching based on global alignment reference minutiae, Pattern Recognition 38 (2005) 1685-1694

8. Biometric Systems Lab., Pattern Recognition and Image Processing Lab., Biometric Test Center., [Online] Available:http://bias.csr.unibo.it/fvc2004/

http://bias.csr.unibo.it/fvc2004/

66

Recognition of Pose Varied Three-Dimensional Human Faces Using Structured Lighting Induced Phase Coding

Debesh Choudhury

Instruments Research and Development Establishment Photonics Division,

Raipur Road, Dehradun 248008, Uttaranchal, India E-mail: [email protected]

Face recognition received a lot of attention in the recent past. Most of the reported techniques are based on use and analysis of the two-dimensional (2D) information available in the 2D facial images. Since human faces are originally three-dimensional (3D) objects, association of 3D sensed information can make face recognition more robust and accurate. This paper reports a method for recognition of faces by utilizing phase codes obtained from projected structured light patterns onto the faces. The phase differences associated with the distorted projected patterns are detected and the computed phase maps are utilized to synthesize complex signature functions, spatial frequency distributions of which are directly proportional to the computed phase maps and hence to the original 3D face shape. The synthesized signature functions of the test faces are compared with that of the target face by digital cross-correlation. Analyses of the cross-correlation intensities (squared modulus) complete the recognition process. Preliminary experimental results are presented for faces with wide variations of pose (out-of-plane head rotations).

Keywords: Biometrics; face recognition; 3D object recognition.

1. Motivation

Out of all the biometric authentication areas, the topic of human face recognition research received perhaps the most attention.1'2 A lot of techniques exist in the literature for face recognition utilizing feature analysis, eigenface approach, neural processing, global two-dimensional (2D) matching of 2D images etc.3'4 All these techniques involve utilization or analysis of information obtained from 2D facial images. Since a human face is originally a 3D object, direct association of 3D shape information would make face recognition more accurate and robust.

2. Introduction

Three-dimensional face recognition is a relatively recent trend, although face recognition based on 3D modeling of the 2D facial image have been well-addressed. The face recognition research, in general, is experiencing a paradigm shift towards 3D face recognition that utilize directly (or indirectly) sensed 3D shape information.2 A recent literature review5 points to a variety of methods those utilized range images (and or stereo images) and applied several techniques, such as segmentation, feature analysis, principal component analysis, Hausdorff distance method, iterative closest point approach etc. for the purpose of 3D face recognition.

The face recognition in the 3D regime has sev

eral advantages. 3D face recognition overcomes the limitations due to pose (viewpoint or orientation) and lighting variations. It may also solve the problem of discriminating live face from fake faces6 because faking a 3D face would be more difficult than to fake a 2D image. However, different 3D recognition schemes may require different forms of 3D signatures. Moreover, standard 3D face databases are yet to be available freely. Nevertheless, superior performance of 3D face recognition techniques over the image-based ones calls for more studies in this area. In what follows we consider a technique for recognition of 3D faces using directly sensed 3D shape induced phase coding.7

A structured light pattern (SLP) is projected onto the faces. The 3D depth variations in the faces induce proportional distortions in the projected patterns. The SLP projected face images are captured using a CCD camera. The phase differences associated with the distortions of the SLP projections are extracted by using a Fourier-fringe analysis method.8 The extracted phase values characterize the 3D shape of the faces. The detected phase maps are utilized as shape codes to synthesize 2D spatial harmonic functions. The encoded spatial functions can be used as unique signatures for the face objects. The synthesized signature functions of the test faces are compared with that of the target face by digital cross-correlation (CrossCorr). The CrossCorr inten-


Debeah Choudhury 67

Camera SLP projected face image

Signature data base

Aexp{i2npî(x,y)fi.x,y)}

Aexp{i2Tt$$2(x,y)j{x,y)}_

Aexp {i2n$Mx,y)flx,y)}

Asxp{i2nPfn(x,y)fi.x,y)}

Phase map ty0(x,y)

v

o Spatial coding

Aexp{ i2n$$0(x,y)fix,y)}

H Cross-correlation of the signature functions

Aexp {i2nP$t(x,y)fix,y)} ® Aexp {i2itffy0(x,y)f{x,y)}

1} I Noise Correlation peak

or Noise and no peak

Peak

Pig. 1. Schematic diagram of the proposed face recognition system along with the algorithm of the phase coded cross-correiatkm.

sities (squared modulus) are analyzed. The signal-to-noise-ratio (SNR) of CrossCorr results gives a quantitative measure for recognition or rejection of a face class.

3. Theory

Figure 1 represents the schematics of the proposed face recognition system. The configuration consists in a white-light projector connected with a computer. A SLP from the computer, say, a sinusoidal grating, is projected on the human face object positioned in front of a uniform background screen. A CCD camera is connected with the same computer that captures the projected patterns with and without the face object. The captured images are processed to compute the associated phase difference using Fourier-fringe analysis.8 In short, the first order Fourier spectra of the distorted and the undistorted SLP images are

selected by spatial filtering and their inverse Fourier transforms are computed. Now, the associated phase difference is extracted by computing the arctangent of the product of the inverse Fourier transform of the first order spectrum of the distorted SLP and the conjugate of that of the undistorted SLP.8 This computed phase map characterizes the 3D shape of the face and the phase values are, in general, wrapped, i.e., the phase values may have 2n ambiguity due to large depth variations in the face objects. The wrapped phase maps are used for synthesizing the signature functions by spatial coding. The phase coding algorithm and the associated computational process are described in the lower dashed block of Fig.l. If 4>t(x, y) and <l>o(x, y) represent the computed phase map corresponding to the SLP projected target face image and a general test object face image respectively, we define the signature functions st(x, y) and sa(x, y) of the target and the test object faces respectively as

st(x,y) = Aexp{i2n@<l>t(x,y)f(x,y)} , (1)

sa(x,y) = Aexp{i2ir0(t>o(x,y)f(x,y)} , (2)

where (x,y) represent the rectangular spatial coordinates, f(x, y) is a function of the coordinates, i = \ /^- l , f$ is a positive factor (fraction or whole number) and A is a constant. The function f(x,y) is used for spatial coding, which can be any suitable function of the coordinates, linear or powered. The signature functions st(x, y) and s0(x, y) represent the 3D shape information of the face objects spatially encoded in terms of spatial modulations. The spatial frequency distributions of the signatures are proportional to the computed phase map corresponding to the face objects. The signature functions of t he faces contain high spatial frequency information, hence we prefer to compare them using the frequency domain correlation pattern recognition method9 instead of the weE-known image domain techniques such as principle component analysis. The cross-correlation operation between the signature functions of the target face and the test face may be expressed as

c(x,y)= / l St(u,v)S*(u,v)N(u,v)

x exp[—i2w(ux + vy)]dudv (3)

where (u, v) is the spatial frequency coordinates corresponding to the spatial coordinates (x,y), St(u,v)

68 Recognition of Pose Varied Three-Dimensional Human Faces Using Structured Lighting Induced Phase Coding

and S0(u, v) represent the Fourier transforms of the signature functions St(x,y) and s0{x,y) respectively, * signifies complex conjugate, and N(u, v) is a notch type spatial filter that blocks the zero order components of St(u,v) and passes the higher orders satisfying the relation

N(u,v) = 1 — circ v ^

M t \min (4)

The minimum value of \4>t\ is used, because the spatial filter distribution pertaining to the computed phase map corresponding to the target face falls within that range. The j3 factor may be used to control the sensitivity of the spatial coding, and may be suitably selected after some test experiments. The Fourier transform St (u, v) may explicitly be written

as 10

+oo +oo

St(u,v) •= \ \ st{x,y)exp{-i2ir(ux+ vy)}dxdy

—oo —oo

+ o o + o c

= x//-p| i iM».(,.,)/fc»)-«x-n.}K*. — oo —oo

(5) Referring to the well-known method of stationary phase,11 it can be noticed that the main contribution to the integral of equation (5) comes from the points where the phase

ip(x,y)= (3(j)t (x,y)f(x,y)-ux- vy (6)

is stationary, i.e.,

s - ' S - 0 - (7)

Now, partially derivating equation (6) with respect to x and y and applying the method of stationary phase,11 we can have

u = (3

v = (3

,df(x,y) dMx,y) 4>t{x,y)—zz— +f(x^y)

{x,y)

dx

df(x,y) dy

+ f{x,y)

dx

d<l>t{x,y) dy

, (8)

• (9)

The face signature points (x, y) that give a dominant contribution to the spectrum distribution at (u, v) are governed by equations (8) and (9). Similarly, the spectrum distribution [S0(u,v)\ of the signature function [s0(x,y)] due to the general test face can also be expressed. If 4>0 ^ <j>t, the spatial frequency spectrum distributions of the general test face signature and the target face signature are different, and therefore they are uncorrelated. On the otherhand, if 4>o ~ 4>t, the spectrum distributions of the test face signature and the target face signature overlap in the spatial frequency domain and they are correlated.

3.1. Personalized spatial coding

It is interesting to point out the possibility of assigning different spatial coding functions for different persons. As for example, the coding functions fi(x,y), h{x,y), h{x,y) may be used to synthesize the signature functions of the persons 1, 2, 3 respectively. The signature function of the same face (say person 1) will not be the same for a different coding function [say with f2(x, y)}, because according to equations (8) and (9), the spatial frequency distribution of a face signature function is dependent of the coding function f(x,y). Person-specific coding function will render the spatial coding more stringent and will help to reinforce secure face recognition in an authentication scenario.

4. Results and discussions

The feasibility is experimented with faces of 20 persons. We put f(x,y) = x - y in equations (1) and (2). Movies of SLP projected face sequences are captured with variations in poses (i.e., out-of-plane head rotations) in both left-right and up-down directions (about 90 degrees) using a commercial digital camera (Sony Cybershot).

Debesh Choudhury 69

Fig. 2. Experimental results: (a) SLP projected 1st face; (b) same with rotated head; (c) SLP projected 2nd face; (d) phase map of (a); (e) phase map of (b); (f) phase map of (c); (g) signature of (a); (h) signature of (b); (i) signature of (c); (j) CrossCorr of (g) and (h); (k) CrossCorr of (h) and (i); (1) CrossCorr of (i) and (g);

The extracted frames are processed using the proposed spatial phase coding algorithm (Fig.l). The SLP projected face images are of 128x128 pixel size. The value of /3 is kept 0.05 throughout. Experimental results with two face samples are shown in Fig.2. The SLP projected face images are shown in Fig.2(a), Fig.2(b) (faces of the same person with different poses, i.e., faces of the same class) and Fig.2(c) shows that of a different person. The associated phase map corresponding to the depth-induced distortions in Fig.2(a), Fig.2(b) and Fig.2(c) are computed and are shown in Fig.2(d), Fig.2(e) and Fig.2(f) respectively. These phase maps are utilized to synthesize the signature functions. The real parts of the signature functions of the face images of Fig.2(a), Fig.2(b) and Fig.2(c) are shown in Fig.2(g), Fig.2(h) and

Fig.2(i) respectively. The normalized intensity of the CrossCorr function of Fig.2(g) and Fig.2(h), that of Fig.2(h) and Fig.2(i), and that of Fig.2(i) and 2(g) are shown respectively in Fig.2(j), Fig.2(k) and Fig.2(l). The high peak in Fig.2(j) exemplifies correct recognition between face signatures of pose varied faces of the same class, whereas Fig.2(k) and Fig.2(l) contain no well-defined peak but only noise, which clearly demonstrate discrimination of faces of the false class.

We present some more results to show the effects of pose variation on the cross-correlations. Figure 3 shows some sample results of two face objects. The SLP projected pose varied faces of a person are shown in Fig.3(a) - Fig.3(f). The SLP projected frontal face image of the same person is shown in Fig.3(g), whereas Fig.3(h) shows the same of another person. The CrossCorr between the signature functions of the faces of Fig.3(a) - Fig.3(f) with that of the face of Fig.3(g) are respectively shown in Fig.3(i) - Fig.3(n) which contain sharp correlation peaks. These sharp peaks are the evidences of correct recognition with pose varied faces of the same class. Figures 3(o) - 3(t) show the CrossCorr between the signature functions of the faces of Fig.3(a) - Fig.3(f) with that of the face of Fig.3(h). The CrossCorr in Fig.3(o) - Fig.3(t) contain no well-defined peak but only noise, which signify mismatch for faces of different class. Therefore, the proposed spatially coded signature functions of the 3D faces can be utilized to recognize the true class faces and to reject the false class faces.

We also show an example of personalized spatial coding by using two different spatial coding functions [f(x, y)\, results of which are shown in Fig.4. The SLP projected face images are shown in Fig.4(a), Fig.4(b) and Fig.4(c), all of which belong to the same class, i.e. they are faces of the same person. The signature functions corresponding to the faces of Fig.4(a) and Fig.4(b) are shown in Fig.4(d) and Fig.4(e) respectively with a coding function f(x, y) = x — y. Figure 4(f) shows the signature function corresponding to the face image of Fig.4(c) but with a coding function f{x,y) = 0.5a; - y. The normalized intensity of the CrossCorr function of Fig.4(d) and Fig.4(e), that of Fig.4(e) and Fig.4(f), and that of Fig.4(f) and Fig.4(d) are shown respectively in Fig.4(g), Fig.4(h) and Fig.4(i). In these CrossCorr results, only Fig.4(g) shows a sharp peak, but the other


Y Z>' y ' /

(j)

/f" if

(k) (1) (m) (n)

' " > > • - .

' *>,-; (o) (P)

r * *W (q) (r) (s) (t)

Fig. 3. More experimental results with pose variations: (a) - (f) SLP projected pose varied face of a person; (g) SLP projected frontal face of the same person; (h) SLP projected frontal face of another person; (i) - (n) CrossCorr of signatures of (a) - (f) with that of (g); (o) - (t) CrossCorr of signatures of (a) - (f) with that of (h);

CrossCorr show only noise, i.e., the face of the same person with different coding functions don't correlate and match. Therefore, the signature functions can be made different by using different spatial coding functions. This gives a choice for personalized coding that can render more secure recognition.

Since, a correct recognition in our technique is evidenced by a high correlation peak around the centre, we can define the SNR as the ratio of the maximum correlation peak intensity around the centre to the mean noise in a rectangular area (128x128 pixel) around the centre (excluding a 21x21 pixel area at the centre where the correct peak is situated). The SNR of CrossCorr is computed for the pose variant face signatures. The frontal face signature is cross-correlated with the pose varied face signatures of the same person (true class) for head rotations in both left-right and up-down directions. The frontal face signature of a second person (false class) is also cross-

correlated with the pose varied face signatures of the first person. Plots of the computed SNR values versus the pose varied (left-right) face signature numbers for two human objects are shown in Fig.5. The SNR of CrossCorr between the faces of the same person show high values (more than 100), whereas the SNR of CrossCorr for different persons' face is low (less than 20). Therefore, it is possible to recognize pose varied faces of the same class (person) and reject the pose varied faces of different class (person) by analyzing the SNR of CrossCorr. Similar plots can be obtained with pose variation in the up-down direction. The rate of false recognition is nil in our feasibility experiments using face objects of 20 persons with wide variations of out-of-plane head rotations (about 90 degrees) in both left-right direction (25 poses approx) and up-down direction (25 poses approx).

We have utilized the correlation pattern recog-

Debesh Choudhwry 71

\Afe - * -,JSk* ..•#&•;*' (d) (e) (f)

V •**- *x '>3-^.yi**

(g) (h) (i)

Fig. 4. Experimental results with different coding functions: (a) SLP projected 1st face; (b) same with rotated head; (c) almost same as (a); (d) signature of (a); (e) signature of (b); (f) signature of (c) with different coding function; (g) CrossCorr of (d) and (e); (h) CrossCorr of (e) and (f); (i) CrossCorr of (f) and (d);

o "5 200

True Class -3K-False Class - -O

*-*"*"•

>-<)-oTo-o--{>-o--'ô-cv-p-A'>-<>,

<— Left side head rotation Frontal Right side head rotation —5*

Pose varied signature number

Fig. 5. SNR versus pose varied face signature plot.

nition9 method for matching, although other popular techniques, such as principal component analysis (PCA) or FisherFaces method, might have been tried. While PCA and FisherFaces work in the image domain, correlation pattern recognition works in the frequency domain and offer advantages such as shift-invariance, the ability to accommodate in-class image variability.12 Also, the spatial coding in our proposed

method creates rather higher spatial frequencies in the coded signatures, which calls for a spatial frequency sensitive matching technique and frequency domain correlation pattern recognition method satisfies that. In-plane rotation and scale are not considered in the present study, which can be surmounted using matured techniques based on wavelet and circular transforms.13,14

Although, the results presented here are based on the test experiments carried out on a very limited database of our own, because standard 3D face databases are yet to be available freely, the feasibility of the proposed 3D face recognition algorithm is successfully proved. Preliminary tests with wide pose variations show the promise of the proposed technique. It is worth mentioning that explicit reconstruction of the 3D shape is not required which is a tedious job. However, the system has to be calibrated as any other structured lighting based system.15 More rigorous testing are required with changed conditions, such as for expression variations, with and without glasses, beards, intentional disguise etc. Since, the 3D shape of a face may change with age (years), because the flesh and the tissues may change drastically, so the 3D signature database must be updated regularly for authentication purposes.

5. Conclusion

A technique for recognition of pose varied three-dimensional faces is presented. It utilizes the three-dimensional shape cues of the faces obtained by structured light pattern projection induced spatial phase coding. Experimental feasibility is tested using pattern projected face images with wide variations in pose captured by a commercial digital camera. The test experiments bring out the novelty with excellent results of true class face recognition and false class face rejection.

Acknowledgments

All the people who agreed to expose their faces for the experiments deserve sincere thanks and gratitudes. Some helps from Mr. S. K. Chaukiyal are acknowledged. The author is indebted to Dr. A. K. Gupta and Director, IRDE for permission to present this work.


References

1. S. Z. Li and A. K. Jain, eds., Handbook of Face Recognition (Springer, 2005).

2. Reports on Face Recognition Grand Challenge initiatives, http://www.frvt.org/FRGC.

3. W. Zhao, R. Chellappa, A. Rosenfield and J. Phillips, ACM Comput. Surv. 12, 399 (2003).

4. S. K. Zhou and R. Chellappa, J. Opt. Soc. Am. A 22, 217 (2005).

5. K. W. Bowyer, K. Chang and P. Flynn, Comput. Vis. Imag. Underst. 101, 1 (2006).

6. J. Li, Y. Wang, T. Tan and A. K. Jain, in Biometric Technology for Human Identification, A. K. Jain, N. K. Ratha, Eds., Proc.SPIE 5404, 296 (2004).

7. D. Choudhury and M. Takeda, Opt. Lett. 27, 1466 (2002).

8. M. Takeda and K. Mutoh, Appl. Opt. 22, 3977 (1983).

9. B. V. K. Vijaya Kumar, A. Mahalanobis and R. D. Juday, Correlation Pattern Recognition (Cambridge, New York, 2006).

10. D. Choudhury and M. Takeda, in Optical Information Systems, B. Javidi and D. Psaltis., eds., Proc.SPIE 5202, 168 (2003).

11. M. Born and E. Wolf, Principles of Optics (Perga-mon, 1989).

12. K. Venkataramani, S. Qidwai and B. V. K. Vijayaku-mar, IEEE Trans Systems Man Cybernetics: Part C 35, 411 (2005).

13. Y. Sheng and D. Roberge, Appl. Opt. 38, 5541 (1999).

14. S. Roy, H. H. Arsenault and D. Lefebvre, Opt. Eng. 42, 813 (2003).

15. R. Legarda-Saenz, T. Bothe and W. P. Juptner, Opt. Eng. 43, 464 (2004).

http://www.frvt.org/FRGC

Writer Recognition by Analyzing Word Level Features of Handwritten Documents

Prakash Tripathi and Bhabatosh Chanda

Ecetronics and Communication Sciences Unit Indian Statistical Institute

Kolkata 700108, India E-mail: [email protected]

Bidyut Baran Chaudhuri

Computer vision and Pattern Recognition Unit Indian Statistical Institute

Kolkata 700108, India

Writer recognition based on handwriting is important from security as well as judiciary point of view. In this paper, based on static off-line data, we have proposed a novel writer recognition methodology using word level micro-features and simple pattern recognition technique. The result is reasonably good and encouraging, and shows usefulness of computational feaures proposed here.

Keywords: writer recognition, handwritten document, word bounding box, word level feature, K-nearest neighbour.

1. Introduction

Analysis of handwritten documents is taken up by the research community to deal with two kinds of problems: (i) What is written and (ii) who has written the concerned document. In this work we deal with the second problem. This has immense application potential in security system and judicial framework. For example, sample of handwriting may be considered as biometric of the writer.2 So this may be used as a means of authentication. Analysis of handwriting from the viewpoint of determining the writer has great bearing on the criminal justice system. Writer individuality rests on the hypothesis that each individual has components of handwriting that is distinct from those of another individual. Handwriting has long been considered individual property, as evidenced by the importance of signatures in documents. However, this hypothesis has not been subjected to rigorous scrutiny with the accompanying experiment, testing, and peer review.

Each writer can be characterized by his own handwriting, by the reproduction of details and unconscious practices. This is why in certain cases the handwriting samples may be treated as a biometric, like fingerprints. The problem of writer identification arises frequently in the court of justice where the judge has to come to a conclusion about the authenticity of a document (e.g. a will, bill or receipt). Need also arises in banks for signature verification, or in

some institutes that analyze manuscripts of ancient authors and are interested in the genesis of the texts. In order to come to a conclusion about the identity of an unknown writer, two tasks may be considered:

(1) The writer identification task concerns the retrieval of handwritten samples from a database using the sample under study as a graphical query. It provides a subset of relevant candidate documents, which the expert (s) will concentrate on.

(2) The writer verification task must come to a conclusion about two samples of handwriting and determines whether they are written by the same writer or not.

When dealing with large databases, the writer identification task can also be viewed as a filtering step prior to the verification task.

Analysis of handwriting by computer has a long history3 and one of its first applications was in signature verification.4 However, that was a very constrained search and templates of handwritten word(s), which is usually distinctive from person to person, are used for verification. Identifying and/or verifying writer based on handwriting style (not limited to a constrained set of words) is relatively new area with pioneering work done by Srihari.5 The basic tasks are (i) extraction of style or forensic dependent features6 from documents, and (ii) then the features are analyzed with the help of some pattern clas-


74 Writer Recognition by Analyzing Word Level Features of Handwritten Documents

sification / recognition algorithms.7'8 For example, clustering and HMM based methodologies are proposed by Bensefia et a/.9'10 These are mainly off-line methods based on static data. On-line methods are primarily based on dynamic features like pen pressure, direction of strokes, etc. l,lx Some approaches adopt both static and dynamic features.12

The objective of this work is to develop a novel writer recognition system for handwritten documents using the static features relevant to writing style. Writing style consists of character shape (e.g., loop formation, curvature, corners, etc. ), word formation (e.g, inter-character gap, aspect ratio, interword gap, etc.) and line formats (e.g., skew, straight-ness, inter-line spacing, etc. ). Particularly in this paper we have explored the capabilities of word level features and employed a simple pattern recognition methodology in writer recognition. The experiment is carried on the documents written in an Indian language named Bangla.

The remaining of this paper is organized as follows. Section 2 describes the data acquisition strategy. Proposed methodology is described in section 3. Experimental results and discussion are given in section 4 followed by concluding remarks in section 5.

2. Data Acquisition

For such experiment the data acquisition is very critical to draw any meaningful conclusion. Thus, our objective was to obtain a set of samples that would capture variations in handwriting among and within writers. It means that we need handwriting samples from multiple writers, as well as multiple samples from each writer. The handwriting samples of the sample population should have the following properties.

(1) They should be sufficient in number to exhibit normal writing habits and to portray the consistency with which particular habits are executed.

(2) For comparison purposes, they should have similarity in texts, in writing circumstances, and in

writing purposes. (3) The text content should be meaningful and it

should have continuity.

Several factors may influence handwriting style, e.g., gender, age, ethnicity, handedness, the learing system of handwriting, the subject matter (content), writing protocol (written from memory, dictated, or copied out), writing instrument (pen and paper), changes in the handwriting of an individual over time, etc.5 We tried to ensure that document content captures as many features as possible. Only some of these factors were considered in the experimental design. The other factors will have to be part of a different study. However, the same experimental methodology can be used to determine the influence factors not considered.

2.1. Source Documents

One of the two almost similar source documents in Bangla script, which were to be copied by each writer is shown in Fig.l. They are concise, containing 158 words and 120 words respectively and are composed of most frequently occurring words in Bangla language. A table of some such words and their frequencies obtained from a moderately large Bangla Corpus is shown in Fig.2. In addition, the source documents also contain punctuation, distinctive letter and a general document structure that allowed extracting macro-document attributes such as word and line spacing, line skew, etc. Each participant (writer) was required to copy one of the source documents four times in his/her most natural handwriting, using plain, unlined sheets, and a ballpoint pen. The repetition was to determine, for each writer, the variation of handwriting from one occasion to the next. We collected documents from 30 writers resulting in a total of 120 documents. Each handwritten document was scanned and converted into a two-tone image using a resolution of 300 dpi. An example of such image is shown in Fig.3.

Prakash Tripathi, Bhabatosh Chanda and Bidyut Baran Chaudhuri 75

?RJ> Ft tfPft ^ ^Tl 33 *5RJ £J>R *R$ f SRSI f%f F SR t%|. SOT ^ 3 Cot *RR 3

• r*ff w iroi «fprai?r «rq*u ^ ^ is$8 *

T "Q ?wsr i$*ro?f «npr *>nft # R ^ ^ p jt^fo ^ i faft ^rftft $M *KrRi w H R J ^rtw $n ¥R ww, ^M ^ TO ifirai «* ICTI CT *nw «iftiwR 1*R 13 *fo «r sclera WRJ w*r ftg ifi f ^5^5 ^

Fig. 1. A portion of the original document used to collect handwriting.

- - \ sT7

2 S > .

1\>- !"'

^W\

*.?7'^ ^1 ^ / x

; svju -

s - > ^

1 tSOl

*r

',• -,; ,

*

j

'V

>

"w

M^2 i > V

• z - s ^

1'. s

' i d ' . •<

• ]f„)H

1 0 ( ^ 1 " ,

l ' , ^ ' / -

,-,-

»- ;

5

-

-i r f

^)' (^

. ~ ~. 1

• r .

< - ? 2 >

Oi ~*

S1 , 1

^ 7 1

82v

Fig. 2. Some most frequently used Bangla words and their frequencies obtained from a corpus.

3. Proposed methodology

A digital binary image J is denned over a discrete domain Z2 obtained by sampling the continuous domain along two orthogonal directions such that the value of I{i,j) is either 0 or 1. In this section we describe preprocessing, feature extraction and pattern recognition techniques adopted for the development of the intended system for writer recognition.

3.1. Preprocessing

Preprocessing comprises of following basic steps.

(1) Noise removal: Stray black pixels (positive noise) of / are removed by binary morphological opening with 3 x 3 structuring element (SE). Let the result be I.

(2) Estimation of line count: Connected components of I are labeled. Statistics of run-length of 0-pixels within and between the words are obtained. A rectangular SE is designed based on these statistics and the image is closed with this SE to form word-blobs (see Fig.4). Let the result be Ic- Then Ic is vertically scanned to estimate the line number and line spacingi.13

Fig. 4. Illustartes word blobs obtained by closing.

(3) Extraction of word bounding box: Based on the line count and line spacing as well as the vertical scan of each word-blob, we determine the word bounding box. An example of word bounding boxes obtaind by the system is shown

-*^&> SYTJM'S' "3^?srr, «?v&<crj £=sr?i ^ i P '

2*-(X3r -xz-f^s,, -35f?sf-^cJ rasf5?-ajTR

Fig. 3. An example of handwritten document.


in Fig.5. The result of this step is not 100% accurate; however, it is not very critical for the final results.13

J^<P QfS^T:^\ Fig. 5. Illustartes word bounding boxes obtained by the system. Note the errors occured though noise is cleaned in step 1.

(4) Thinning: The / is thinned for the extraction of structural features. Let the thinned image be

It-14

(5) Medial axis transform and skeletonization: The / also undergoes medial axis transform from which local maxima are extracted. Let the results be Im and Is, respectively.

3.2. Feature extraction

We distinguish between two types of features: conventional features and computational features. Conventional features are the handwriting attributes that are commonly used by the forensic document examination community. These features are obtained from the handwriting by naked eye under microscope. Computational features are those having known software/ hardware techniques of extraction. Computational features can be divided into macro-and micro-features, depending on whether they pertain globally to the entire handwritten sample, e.g., uniformity in line spacing and straightness of lines, or are extracted locally, e.g., inter-character gaps and loop formation. Macro-features can be extracted at the document level (entire handwritten manuscript) or at least a paragraph; while micro-features may be extracted at line, word, and character levels. As mentioned earlier, in this work we confine ourselves to word level features, which we consider of two types:

(1) Box level features (e.g., spread of stroke, depth of stroke, aspect ratio and density) that are computed over the entire word bounding box, and

(2) Cell level features (e.g., isolated points, corner, crossing, branch, end point, pixel density) that

are computed over non-overlapping division of word bounding box.

3.3. Box level features

Let the fc-th bounding box of a word be represented by Wk with two diagonal points (ri ,ci) and (r2,C2) in the image. Then the box level features are computed from the preprocessed images are as follows.

(1) Thickness of stroke Ts\ This roughly refers to the variable thickness of stroke caused due to speed of writing and is computed as

'E(r,c)EWkI(r'C)

(2) Width of stroke Ws: This this may be another way of computing the thickness of stroke due to speed of writing and is computed as

z2(r,c)ewk -f(r>c) Ds = l -

s(r,c)eWk Im(r,c)

(3) Spread of stroke Ss: This roughly refers again to the thickness of stroke due to speed of writing and is computed as

2^(r,c)€Wk Is\T,c) S. = l -

i(r,c)eWk Im(r,C)

(4) Aspect ratio A: This gives an idea of elongation of word due to speed and style of writing and is computed as

A = (r2 -n)/{c2 - c i )

mt Fig. 6. Division of bounding box into cells.

3.4. Cell level features

Each word bounding box is divided into m x n cells. An example of such division for 3 x 4 cells is shown in Fig.6. On each of the cells six features are computed.

(1) Number of isolated points: Isolated points are those pixels which have no neighbours. This is computed from the thinned image It-

Prakash Tripathi, Bhabatosh Chanda and Bidyut Baran Chaudhuri 77

(2) N u m b e r of e n d points : End points are those

pixels which have only one neighbour. This is

computed from It-

(3) N u m b e r of cross po ints : Cross points are

those pixels where two strokes cross or have V

junction. This is computed based on the connec

tivity number 1 4 using It.

(4) N u m b e r of branch points : Branch points are

those pixels which have ' T ' or 'Y' junction. This

is computed based on the connectivity number

using It.

(5) P i x e l dens i ty: This is the rat io of number of

black pixels and the cell size. This is computed

using the thinned image I.

(6) N u m b e r of corners: This is related to the

smoothness of the strokes. This is computed

using the thinned image / based on the facet

model.1 5

Thus, a total of N = 6mn + 4 features per word is

computed, which will be used for writer recognition.

It may be noted tha t since the features reflect dif

ferent kinds of geometrical and structural at tr ibutes,

we do not employ any transformation (e.g., PCA) for

feature reduction.

3 .5 . Recognition

Training set is a set of document images for which

writers ' identity are known. It contains all the fea

tures computed for all the words of all the writers.

Writer information is at tached with the feature vec

tor of every word in training set. Each word of train

ing documents is represented by a point in the N-

dimensional feature space. For testing a table con

sisting of all the writers along with individual counter

is created. The counters are initialized to zero. For

a test document, every word is used to recognize

the writer using K-nearest neighbour algorithm. The

counter for corresponding K writers is incremented.

After examining all the words of the test document,

the writer having maximum count is taken as the

writer of the document. The K-nearest neigbour ap

proach is simple and gives consistently good results.

4. E x p e r i m e n t a l R e s u l t s a n d D i s c u s s i o n

The da ta set consists of 4 copies of 30 different writ

ers. For each writer, out of four documents, three

were considered as training da ta set and one as test

data. Words were extracted automatically from ev

ery training document image. For each word 76 word

level features (4 box level and 72 cell level features

considering m = 3 and n = 4) were extracted. So,

N = 76. Euclidean distance is employed to measure

the similarity and K is taken to be 3. Using leave one

out strategy, all of the 120 documents are used as

test set, and out of these 120 documents writers are

correctly recognized in 114 cases making the recog

nition rate of 95%. It may be noted tha t if the value

of K is chosen 1 or 2, the recognition ra te goes down

to 67.2% and 70%, respectively. This implies tha t to

make the system invariant to the intra-writer varia

tion high value of the K should be taken. Here K = 3

is a reasonable choice, as we have three handwri t ten

documents of each writer in the training set. Recog

nition score for the words for a randomly selected

writers is shown in Table 1.

5. Conc lus ion

Here we have presented a novel algorithm for writer

recognition based on handwri t ten documents. In this

work only micro-features at word level are used. Sim

ple K-nearest neighbour algorithm with Euclidean

distance measure has been used and the result ob

tained is very encouraging.

Table 1. Recognition score for the words for some writers (ts: Test, tr: Training).

ts

tr

1 2 3 4 5 6 7 8

1

539 100 65 48 62 29 39 71

2

24 502 12 113 115 86 65 127

3

97 46 505 53 40 42 51 92

4

14 77 20 687 104 85 128 104

5

27 107 19 126 958 54 143 54

6

12 60 3 85 44 345 62 66

7

18 70 26 134 181 76 591 83

8

30 105 43 73 56 94 82 575

References

1. K. Yu, Y. Wang and T. Tan, Writer identification using dynamic features, in Proc. Intl. Con}, on Bioin-formatics and its Applications (ICBA '04), (Florida, USA, 2004).

2. A. K. Jain, B. Ruud and P. S. (eds), Biometrics: Personal identification in networked society (Kluwer Academic, Boston, 1999).

3. E. C. Greanias, P. F. Meagher, R. J. Norman and P. Essinger, IBM J. Res. Develop. 7, 14 (1963).


4. K. K. Ho, H. Schroder and G. Leedham, Codebooks for signature verification and handwriting recognition, in Proc. of the Australian and New Zealand Intelligent Information Systems Conference, (Australia, 2006).

5. S. N. Srihari, S.-H. Cha, H. Arora and S. Lee, J. Forensic Science 47 (2002).

6. M. Tapiador and J. A. Sigenza, Writer identification method based on forensic knowledge, in Proc. Intl. Conf. on Bioinformatics and its Applications (ICBA'04), (Florida, USA, 2004).

7. K. Steinke, Pattern Recognition 14, 357 (1981). 8. E. N. Zois and V. Anastassapoulous, Pattern Recog

nition 33, 101 (2000). 9. A. Bensefia, T. Paquet and L. Heutte, Pattern Recog

nition Letters 26, 2080 (2005).

10. A. Bensefia, T. Paquet and L. Heutte, Electronic Letters on Computer Vision and Image Analysis 5, 72 (2005).

11. K. P. Zimmerman and M. J. Varady, Pattern Recognition 18, 63 (1985).

12. W. Jin, Y. Wang and T. Tan, Text-independent writer identification based on fusion of dynamic and static features, in Proc. Intl. Workshop on Biometric Recognition Systems, (Beijing, China, 2005).

13. A. K. Das, A. Gupta and B. Chanda, Image Processing and Communication 3, 85 (1997).

14. B. Chanda and D. D. Majumdar, Digital Image Processing and Analysis (Prentice-Hall of India, New Delhi, 2000).

15. R. M. Haralick and L. G. Shapiro, Computer and robot Vision (Addison-Wesley, Reading, 1992).

PARTD

Clustering Algorithms

81

A New Symmetry Based Genetic Clustering Technique for Automatic Evolution of Clusters

Sriparna Saha and Sanghamitra Bandyopadhyay

Machine Intelligence Unit Indian Statistical Institute

Kolkata, India Email: {sripama-r, sanghami}@isical.ac.in

In this article, a new symmetry based genetic clustering algorithm is proposed which automatically evolves the number of clusters as well as the proper clustering of any data set. Strings comprise both real numbers and the don't care symbol in order to encode a variable number of clusters. A newly proposed index, Sj/m-index, is used as a measure of the validity of the clusters that is optimized by the genetic algorithm. Sym-'mdex uses a new point symmetry based distance. The algorithm is therefore able to detect both convex and non-convex clusters that possess the symmetry property. Kd-tree based nearest neighbor search is used to reduce the complexity of finding the closest symmetric point. Effectiveness of the proposed method is demonstrated for several artificial data sets and one real-life data set.

Keywords: Clustering, Genetic Algorithms, Point Symmetry Based Distance, Real encoding

1. Introduction

Clustering1 is a fundamental problem in data mining with innumerable applications spanning many fields. In order to mathematically identify clusters in a data set, it is usually necessary to first define a measure of similarity or proximity which will establish a rule for assigning patterns to the domain of a particular cluster centroid. One of the basic feature of shapes and objects is symmetry. Su and Chou have proposed a point symmetry (PS) distance based similarity measure.2 This work is extended in3 in order to overcome some of the limitations existing in.2 It has further been shown in4 that the PS distance proposed in3 also has some serious drawbacks and a new PS distance (dps) is defined in4 in order to remove these drawbacks. For reducing complexity of point symmetry distance computation, Kd-tree based data structure is used.

if-means is a widely used clustering algorithm that was used in2 in conjunction with the earlier PS based distance. However, if-means has three major limitations: it requires the a priori specification of the number of clusters (K), it often gets stuck at sub-optimal solutions based on the initial configuration and it can detect only hyper-spherical shaped clusters. The real challenge in this situation is to be able to automatically evolve a proper value of K as well as providing the appropriate clustering of a data set. There exists a Genetic Algorithm (GA) based clustering technique, GCUK-clustering,5 which is able to automatically evolve the appropriate clustering for hyperspherical data sets. However for clusters with

other than hyperspherical shapes, this algorithm is likely to fail, as it uses, like the K-means, the Euclidean distances of the points from the respective cluster centroids for computing the fitness value. In this article a variable string length GA based clustering technique is proposed. Here assigment of points to different clusters is done based on point symmetry based distance rather than Euclidean distance. This enables the proposed algorithm to automatically evolve the appropriate clustering of all types of clusters, both convex and non convex which have some symmetrical structures. The chromosome encodes the centres of a number of clusters, whose value may vary. A new cluster validity index named Sym-'mdex is utilized for computing the fitness of the chromosomes. The effectiveness of the proposed genetic clustering technique for evolving the appropriate partitioning of a dataset is demonstrated on four artificial and one real-life data sets having different characteristics.

2. Proposed Algorithm (VGAPS)

In this section the new clustering algorithm is described in detail. It includes determination of number of clusters as well as appropriate clustering of the data set. This genetic clustering technique is subsequently referred to as variable length genetic clustering technique with point symmetry based distance (VGAPS).

In GAs, the parameters of the search space are encoded in the form of strings (called chromosomes). A collection of such strings is called population. Ini-

82 A New Symmetry Based Genetic Clustering Technique for Automatic Evolution of Clusters

tially a random population is created, which represents different points in the search space. An objective/fitness function is associated with each string that represents the degree of goodness of the solution encoded in the string. Based on the principle of survival of the fittest, a few of the strings are selected and each is assigned a number of copies that go into the mating pool. Biologically inspired operators like crossover and mutation are applied on these strings to yield a new population. The process of selection, crossover, and mutation continues for a fixed number of generations or till a termination condition is satisfied.

2.1. String representation

The chromosomes in VGAPS are made up of real numbers (representing the coordinates of the centres) as well as don't care symbol ' # ' . Note that real encoding of the chromosome is adopted since it is more natural to the problem under consideration. The value of K is assumed to lie in the range [Kmim Kmax), where Kmin is chosen equal to 2 unless specified otherwise. The length of a string is taken to be Kmax where each individual gene position represents either an actual center or a don't care symbol.

2.2. Population initialization

For each string i in the population (i = 1 , . . . P where P is the size of the population), a random number Ki in the range [Kmin,Kmax\ is generated. This string is assumed to encode the centres of Ki clusters. For initializing these centres, Ki points are chosen randomly from the data set. These points are distributed randomly in the chromosome. Let us consider the following example.

Example: Let Kmin = 2 and Kmax = 10. Let the random number Ki be equal to 4 for chromosome i. Then this chromosome will encode the centers of 4 clusters. Let the 4 cluster centers (4 randomly chosen points from the data set) be (10.0, 5.0) (20.4, 13.2) (15.8, 2.9) (22.7, 17.7). On random distribution of these centers in the chromosome, it may look like # (20.4, 13.2) # # (15.8, 2.9) # (10.0, 5.0) (22.7, 17.7) # # .

2.3. Fitness computation

This is composed of two steps. Firstly points are assigned to different clusters by the newly developed point symmetry based distance dps. Next, the cluster validity index, Sym-mdex, is computed and used as a measure of the fitness of the chromosome.

2.3.1. Newly developed point symmetry based distance, dps

The proposed point symmetry based distance dps(x,c) associated with point x with respect to a center c is defined as follows: Let a point be x. The symmetrical (reflected) point of x with respect to a particular centre c is 2 x c — x. Let us denote this by x*. Let the first and second unique nearest neighbors of x* be at Euclidean distances of d\ and a\ respectively. Then

dps(x,c) = -^~Y^- x de(x,c) (1)

where de (x, c) is the Euclidean distance between the point x and c.

Some important observations about the proposed point symmetry distance dps(x,c) are as follows:

(1) Instead of computing Euclidean distance between the original reflected point x* = 2 xc — x and its first nearest neighbor as in2 and,3 here the average distance between x* and its first and the second unique nearest neighbors have been taken. Consequently the term, (di + d2)/2 will never be equal to 0, and the effect of de (x, c), the Euclidean distance, will always be considered.

(2) Considering both d\ and di in the computation of dpS makes the PS-distance more robust and noise resistant. From an intuitive point of view, if both di and d^ of x with respect to c is less, then the likelihood that x is symmetrical with respect to c increases. This is not the case when only the first nearest neighbor is considered which could mislead the method in noisy situations.

It is evident that the symmetrical distance computation is very time consuming. Computation of dps(x~i,c) is of complexity 0(n). In order to compute the nearest neighbor distance of the reflected point of a particular data point with respect to a cluster center efficiently, we have used Kd-tree based nearest neighbor search.

Sriparna Saha and Sanghamitra Bandyopadhyay 83

Kd-tree Based Nearest Neighbor Computation: A K-dimensional tree, or Kd-tree is a space-partitioning data structure for organizing points in a K-dimensional space. A Kd-tree uses only those splitting planes that are perpendicular to one of coordinate axes. ANN(Approximate Nearest Neighbor) is a library written in C++, 6 which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions. In this article, ANN is used to find d± and di in Equation 1 efficiently. The ANN library implements a number of different data structures, based on Kd-trees and box-decomposition trees, and employs a couple of different search strategies. The Kd-tree data structure has been used in this article. Here ANN is used to find d\ and d<i in Equation 1 efficiently. The Kd-tree structure can be constructed in 0(nlogn) time and takes 0(n) space, while the search complexity is O(logn).

2.3.2. Assignment of points

Here for each point Xi, 1 8, point x"i is assigned to some cluster m iff de(xi,cm) < de(x~i,Cj), j = 1,2. ..K, j 7 m. In other words, point x is assigned to that cluster with respect to whose centers its PS-distance is the minimum, provided this value is less than some threshold 8. Otherwise assignment is done based on the minimum Euclidean distance criterion as normally used in5 or the iif-means algorithm. The value of 8 is kept equal to the maximum nearest neighbor distance among all the points in the data set. It is to be noted that if a point is indeed symmetric with respect to some cluster centre then the symmetrical distance computed in the above way will be small, and can be bounded as follows. Let d™^ be the maximum nearest neighbor distance in the data set. That is

djvjv5 = maxi=\,...NdNN(xi), (2)

where d;vAr(Si) is the nearest neighbor distance of Xj. Assuming that x* lies within the data space, it may be noted that

dx < d~f- (3) Ideally, a point x is exactly symmetrical with respect to some c if di = 0. However considering the un

certainty of the location of a point as the sphere of radius d™^ around x, we have kept the threshold 8 equals to d'fi'ff. Thus the computation of 8 is automatic and does not require user intervention. After the assignments are done, the cluster centres encoded in the chromosome are replaced by the mean points of the respective clusters.

2.3.3. Fitness calculation

The fitness of a chromosome is computed using the newly developed Sym-'mdex. Let K cluster centres be denoted by Q where 1 < i < K and number of points in each cluster is equal to n, where i = 1,...K. Then Sym index is defined as follows:

Sym(K) = ( i x i - x DK), (4)

where K is the number of clusters. Here,

K

EK = Y,E* (5)

such that

Ei^^dîx^Ci) (6)

and

DK = maxfi=l \\ci - Cj\\ (7)

DK is the maximum Euclidean distance between two cluster centers among all centres. d*g(xj,c7) is computed by Equation 1 with some constraint. Now first two nearest neighbors of xj — 2 x ct — Xj will be searched among the points which are already in cluster i i.e., now the first and second nearest neighbors of the reflected point Xj of the point x~j with respect to Ci and xj should belong to the ith cluster. The objective is to maximize this index in order to obtain the actual number of clusters and to achieve proper clustering. The fitness function for chromosome j is defined as 1/Symj, where Symj is the Sym index computed for the chromosome. Note that minimization of the fitness value ensures maximization of the Sym index.

Explanation: As formulated in Equation 4, Sym is a composition of three factors, these are 1/K, 1/EK and DK- The first factor increases as K decreases; as Sym needs to be maximized for optimal clustering, so it will prefer to decrease the value of


K. The second factor is the within cluster total symmetrical distance. For clusters which have good symmetrical structure, EK value is less. This, in turn, indicates that formation of more number of clusters, which are symmetrical in shape, would be encouraged. Finally the third factor, DK, measuring the maximum separation between a pair of clusters, increases with the value of K. Note that value of DK is bounded up by the maximum separation between a pair of points in the data set. As these three factors are complementary in nature, so they are expected to compete and balance each other critically for determining the proper partitioning.

2.4. Genetic Operations

The following genetic operations are performed on the population of strings for a number of generations. Selection: Conventional proportional selection is applied on the population of strings. Here, a string receives a number of copies that is proportional to its fitness in the population. Crossover. During crossover each cluster centre is considered to be an indivisible gene. Single point crossover, applied stochastically with probability yuc, is explained below with an example. Example: Suppose crossover occurs between the following two strings: # (20.4, 13.2) # # (15.8, 2.9)| # (10.0, 5.0) (22.7, 17.7) # # and (13.2, 15.6) # # # (5.3, 13.7)| # (10.5, 16.2) (7.9, 15.3) # (18.3, 14.5) Let the crossover position be 5 as shown above. Then the offspring are # (20.4, 13.2) # # (15.8, 2.9)| # (10.5, 16.2) (7.9, 15.3) # (18.3, 14.5) and (13.2, 15.6) # # # (5.3, 13.7)| # (10.0, 5.0) (22.7, 17.7) # # Mutation: Mutation is applied on each chromosome with probability fim. Mutation is of three types. (1) Each valid position (i.e., which is not ' # ' ) in a chromosome is mutated with probability fxm in the following way. A number 6 in the range [0,1] is generated with uniform distribution. If the value at that position is v, then after mutation it becomes v x (1 ± 26), if v ^ 0, otherwise for v = 0 it will be equal to ±26. The '+ ' or '-' sign occurs with equal probability. (2) One randomly generated valid position is removed and replaced by ' # ' . (3) One

randomly chosen invalid position is replaced by randomly chosen point from the data set. Any one of the above mentioned types of mutation is applied randomly on a particular chromosome if it is selected for mutation.

2.5. Termination Criterion

In this article the processes of fitness computation, selection, crossover, and mutation are executed for a maximum number of generations. The best string having the lowest fitness (i.e., largest Sym index value) seen up to the last generation provides the solution to the clustering problem. We have implemented elitism at each generation by preserving the best string seen up to that generation in a location outside the population. Thus on termination, this location contains the centres of the final cluster.

3. Implementation Results

The experimental results showing the efectiveness of VGAPS algorithm are provided for five artificial and one real life data sets. The description of the data sets are given in Table 1. Data_6_2 and Data_4_3 are used in7 while the other data sets can be obtained on request to the authors. The cancer data set is obtained from [www.ics.uci.edu/~mlearn/MLRepository.html]. Each pattern of Cancer dataset has nine features. There are two categories in the data: malignant and benign. The two classes are known to be linearly separable. There are a total of 683 data points in the data set.

VGAPS is implemented with the following parameters (determined after some experimentations): \ic = 0.8, /xm = 0.02. The population size P is taken to be equal to 100. Kmin and Kmax are set equal to 2 and y/n respectively where n is the total number of data points in the particular data set. VGAPS is executed for a total of 30 generations. Note that it is shown in Ref.8 that if exhaustive enumeration is used to solve a clustering problem with n points and K clusters, then one requires to evaluate l/K ^2i-i(—l)K~:'jn partitions. For a data set of size 50 with 2 clusters this value is 249 — 1 (i.e., of the order of 1015). If the number of clusters is not specified a priori, then the search space will become even larger and utility of GAs is all the more evident. For

http://www.ics.uci.edu/~mlearn/MLRepository.html

Sripama Saha and Sanghamitra Bandyopadhyay 85

all the data sets, as is evident from Table 1, VGAPS is able to find out appropriate number of clusters and the proper partitioning. Figures 1, 2, 3, 4 and 5 show the final clustered results obtained after application of VGAPS on Sym_5_2, Sym.3_2, Ring_3_2, Data.6.2 and Data_4_3. For cancer data set it is not possible to show the clustered result visually. The obtained cluster centres are (3.013453, 1.266816, 1.378924, 1.304933, 2.056054, 1.293722, 2.080717, 1.215247, 1.105381), (7.130802, 6.696203, 6.670886, 5.700422, 5.451477, 7.780591, 6.012658, 5.983122, 2.540084) respectively. And the actual cluster centres are (2.9640, 1.3063,1.4144,1.3468,2.1081,1.3468, 2.0833,1.2613, 1.0653) and (7.1883, 6.5774, 6.5607, 5.5858, 5.3264, 7.6276, 5.9749, 5.8577, 2.6025) respectively.

Table 1 also shows the performance of GCUK-clustering5 optimizing Davies-Bouldin index (DB-index) for all the data sets. As is evident, GCUK-clustering is able to detect the proper number of clusters as well as proper clustering for DataJL2 and Data_4_3 but it fails for Sym_5_2, Sym_3.2 and Ring_3_2. Figure 6, 7 and 8 show the clustering result obtained by GCUK-clustering on Sym_5_2, Sym.3_2 and Ring.3.2 respectively. GCUK-clustering obtained, incorrectly, 6, 2 and 5 clusters for these three data sets respectively. Clustering results on DataJL2 and Data_4_3 obtained by GCUK-clustering are same as that of VGAPS and are therefore omitted.

To compare the performance of VGAPS with that of GCUK-clustering,5 for the real-life data set, Minkowski Score (MS)9 is calculated after application of both the algorithms. MS is a measure of the quality of a solution given the true clustering. For MS, the optimum score is 0, with lower scores being "better". For Cancer dataset, MS score is 0.3233 for VGAPS and 0.37 for GCUK-clustering. From the above results it is evident that VGAPS is not only able to find the proper cluster number, but it also provides significantly better clustering (both visually as in Figure 1-5, and also with respect to the MS scores). Moreover, we have also conducted statistical test ANOVA, and found that the difference in the mean MS values over ten runs obtained by VGAPS and GCUK-clustering are statistically significant. For Cancer, mean difference in mean MS obtained by two algorithms over ten runs is -4.67 E-02 which is statistically significant (significance value

is 0.00).

4. Conclusion

In this paper a genetic algorithm based clustering technique, VGAPS clustering, is proposed which assigns the data points to different clusters based on the point symmetry based distance and can automatically evolve the appropriate clustering of a data set. A newly developed symmetry based cluster validity index named Sym index is utilized for computing the fitness of the chromosomes. The proposed algorithm has the capability of detecting both convex and non-convex clusters that possess the symmetry property.

The effectiveness of the clustering technique is demonstrated for several artificial and real life data sets of varying complexities. The experimental results show that VGAPS is able to detect proper number of clusters as well as proper clustering from a data set having any type of clusters, irrespective of their geometrical shape and overlapping nature, as long as they possess the characteristic of symmetry.

Table 1. Results obtained with the different data sets using VGAPS and GCUK. Here, # pts, # dim, # AC, # OC denotes respectively number of points in the data set, number of dimension, actual number of clusters, obtained number of clusters

Name # p t s # d i m # A C #oc VGAPS GCUK

Sym.5.2 850 2 5 5 6 Sym-3J2 600 2 3 3 2 Ring.3.2 350 2 3 3 5 Data.6-2 300 2 6 6 6 Data.4.3 400 3 4 4 4 Cancer 683 9 2 2 2

Fig. 1. Clustered Sym_5_2 using VGAPS where 5 clusters are detected


Fig. 2. Clustered Sym_3_2 using VGAPS where 3 clusters are detected

Fig. 6. Clustered Sym_5_2 using GCUK-clustering where 6 clusters are detected

. - • • • • ; : " 7

2 °$*$ °°

m& irtii?

Fig. 3. Clustered Ring-3.2 using VGAPS where 3 clusters Fig. 7. Clustered Sym.3-2 using GCUK-clustering where 2 are detected clusters are detected

41

#

^4 ^ S

&

* &

Fig. 4. Clustered Data-6-2 using VGAPS where 6 clusters are detected

Fig. 8. Clustered Ring-3_2 using GCUK-clustering where 5 clusters are detected

Fig. 5. Clustered Data-4_3 using VGAPS where 4 clusters are detected

R e f e r e n c e s

1. B. S. Everi t t , S. L a n d a u and M. Leese, Cluster Analysis (London: Arnold, 2001).

2. M.-C. Su and C.-H. Chou, IEEE Transactions Pattern Analysis and Machine Intelligence 2 3 , 674 (2001).

3. C. H. Chou, M. C. Su and E. Lai, S y m m e t r y as a new measure for cluster validity, in 2nd WSEAS Int. Conf. on Scientific Computation and Soft Computing' 2002, pp . 209-213.

4. S. Bandyopadhyay and S. Saha, Pattern Recog. (Revised.).

5. S. Bandyopadhyay and U. Maulik, Pattern Recognition 3 5 , 1197 (2002).

Sriparna Saha and Sanghamitra Bandyopadhyay 87

6. D. M. Mount and S. Arya, Ann: A library for approximate nearest neighbor searching (2005), http://www.cs.umd.edu/~mount/ANN.

7. U. Maulik and S. Bandyopadhyay, Pattern Recognition 33, 1455 (2000).

8. M. de Berg, M. V. Kreveld, M. Overmars and

O. Schwarzkopf, Cluster Analysis for Application (Academic Press, 1973).

9. A. Ben-Hur and I. Guyon, Detecting stable clusters using principal component analysis in methods in molecular biology (Humana press, 2003).

http://www.cs.umd.edu/~mount/ANN

A Non-Hierarchical Clustering Scheme for Visualization of High Dimensional Data

G. Chakraborty and B. Chakraborty

Faculty of Software and Information Science Iwate Prefectural University

Japan 020-0193 E-mail: {goutam, basabi}&soft.iwate-pu.ac.jp

N. Ogata

Graduate School of Software and Information Science Iwate Prefectural University

Japan 020-0193

Clustering algorithms with data visualization capability is needed for discovering structure in multidimensional data. Self Organizing Maps(SOM) are widely used for visualization of multidimensional data. Though SOMs are simple to implement, they need heavy computation as the dimensionality increases. In this work a simple non hierarchical clustering scheme has been proposed for clustering and data visualization of high dimensional data in two dimension. Simple simulation experiments show that the algorithm is effective in clustering and visualization compared to SOM while takes much lesser time than SOM.

Keywords: Non hierarchical clustering, Data visualization, Self Organizing Map, Multi Dimemsional Scaling

1. Introduction

Clustering algorthims123 are needed for data mining tasks in the the process of understanding and discovering the natural structure and grouping in a data set. It has also a wide range of applications in other areas like data compression, information retrieval or pattern recognition/classification. Data visualization techniques provide further assistance in this process by visual representation of the data. In case of high dimensional data, understanding of structure is difficult as human cannot visualizes in dimensions more than three. Most clustering algorithms do not work efficiently for high dimensional data due to existence of noisy and irrelevant attributes. High dimensional data sets are often sparse in some dimensions and show clustering tendency in some subspaces.

Subspace clustering45 is an extension of traditional clustering to find clusters in proper subspaces for high dimensional data. Multidimensional Scaling (MDS) is another approach6 to cluster high dimensional data into a lower dimensional space in data visualization for exploring similarities in data. Much of the work in cluster visualization has been done for hierarchical clustering. For non-hierarchical clustering, Self organizing Maps (SOM) are popular. SOMs invented by Kohonen7 reduce the data dimensions by producing a map in 1 or 2 dimension and display similarities in the data through the use of self orga

nizing neural networks. SOMs are simple to construct and easy to understand but the major problem with SOMs is that they are computationally expensive. As the data dimension increases, dimension reduction visualization techniques become more important, but unfortunately the computation time also increases for SOM.

In this work a novel non hierarchical clustering scheme is proposed to lower compuational load with data visualization effect comparable to SOM. Our technique is simple to implement and can be implemented for any dimension. The proposed algorithm is represented in the next section followed by simulation experiments and results in the following section. Final section contains conclusion and direction for improvement.

2. Proposed Clustering Scheme

The main idea of the proposed scheme is as follows. Let us consider a point in two dimension corresponding to each high dimensional point in the data set. Then any two points in the two dimension should be moved closer or far apart ( Fig. 1) depending on the similarity or dissimilarity in the original high dimensional space. This process will be initiated for any two random points and iterated for a large number of times. Eventually the clusters formed in two dimension will produce a image of the clusters present

G. Chakraborty, B. Chakraborty and N. Ogata 89

in the data in the original high dimensional space (Fig. 2). The actual algorithm for computation is as follows.

i i : : : : : : : : : : : :

IE

/ " - \t~~ __

simirality

Fig. 1. Moving two dimensional points according to similarity in original space.

3i_: • _ J I _ T | " " It'

\?~ | ~ ~ ~

I E " I | i 3 1 " _ _ : i r ~ ~<l 1*" :

i ^ * ^ r ~7 , 3 1 V

j. X ^ J J j , , I -H ! x i r i ; ! ; J I _ ; ^ ~ » t ^ | r ^ ;7 _\ lzc _ :

j i . 3^ 0>1E L . . r E ' a J £-,»-• -J S I L ' 4 ^ * 'Z

\ , 3 C | i . ' : i * " " -JT - •"

c = >

Fig. 2. Start and Final configuration of the two dimensional space.

(1) Let us consider D to be the data set containing n m-dimensional data points (X\, X2, • ••, Xn) to be clustered, where Xt = xn, X&,..., xim.

(2) Normalization of the data: At the first step, the data points are to be normalized to (Yi, Y2, • • •, Yn) as following:

Vij = -mini'

maxij ~mi7iij

where y,j is the normalized value of j th component (dimension) of the ith data point. miriij = min(xij,X2j, • • • ,xnj) and maxij — max(xij,X2j, • • • ,xnj) represent the minimum and maximum values of the j th attribute (dimension) of the data respectively.

(3) Two dimensional data generation: Random two dimensional n data points are generated corre

sponding to the data points in the original m dimension(left portion in the Fig 2).

(4) Similarity Calculation: Any two data points Y)t and Yi in m dimension is selected and their similarity is calculated by Manhatton distance as:

Mkl = \Yk-Yl\ As the value of Manhatton distance depends

on number of dimensions, scaling has been done to limit the values to lie between 0 - 0.5 by the following equation

m m + Mki

Rki represents the degree of relatedness of the data points Yk and Yj

(5) Moving the data points in two dimension: Now according to the value of Rki, the data points in two dimension corresponding to Yfe and Y; are moved. The actual movement value is calculated as follows:

Movement Value = i _ ax fit

+ b where the parameters a and 6 can be calcu

lated from the following equations.

1 <*i

0.01

1

+ b

a 2 = a— + b U.o

(2)

(3)

a\ and a2 represent the movement of points in two dimension for high similarity of the original points and low similarity of original points respectively. These two parameters are to be set by users from the value 0.01 to 0.5. The movement of two dimensional points according to the degree of relatedness of the original points are shown in Fig. 1 and Fig. 3.

(6) The above procedure (step 4 and step 5) has to be repeated a large number of times until the data are clustered in two dimension as shown in (Fig 2)

3. Simulation Experiments and Results

Simple simulation experiments have been done for checking the efficiency of the the proposed clustering scheme compared to SOM. Two sets of data are used in the experiment.

90 A Non-Hierarchical Clustering Scheme for Visualization of High Dimensional Data

Fig. 3. Movement value according to degree of Relatedness.

3.1. Vowel data set

This data set8 contains three dimensional array of (three formant frquencies) vowel data of 10 vowels for different speakers uttered several times, a total of 300 data. The three dimensional view of the data set is shown in Fig 4. Now the proposed algorithm and SOM is simulated with this data. For SOM the learning epoch is 10,000 times. For the proposed algorithm the number of iteration has been done for 100,000 times. Though the number of iteration is larger in the proposed algorithm than in SOM, the total time taken is less for the proposed algorithm as time for one iteration is lesser than for one iteration for SOM. The total time taken for SOM is around 2700 sec while for the proposed algorithm it was nearly 100 sec.

Fig. 4. Vowel data set.

Fig. 5 represents the result of SOM by Vowel

data. Fig 6 represents the result of the proposed clustering scheme. It is seen that in the two dimension there are ten discrete clusters corresponding to ten classes of vowels. Though the results are comparable to the output of SOM which also clearly shows ten clusters, the time taken for clustering by the proposed scheme is far less than that of SOM.

Fig. 5. Result of Vowel data by SOM.

0.524

0.5235

0.523

0.5225

0.522

0.5215

*

-

-

-

''*&& ' **&§&. "

1 2 3 * 4 :--5 s 6 7 • 8 9

10 v

-

-

^ **#%..

0.487 0.4875 0.488 0.4885 0.489 0.4895 0.49 0.4905 0.491 0.4915

Fig. 6. Result of Vowel data for the Proposed Algorithm

3.2. Fisher Iris Data

The proposed algorithm has also been simulated with Fisher Iris data.9 This is a four dimensional data set containing three classes of Iris plant (Iris Setosa, Iris Versicolour, Iris Verginica) each with 50 sample points, a total of 150 data points. As the data set is four dimensional it is impossible to visualize in four

G. Chakraborty, B. Chakraborty and N. Ogata 91

dimension. Simulation experiments for clustering by SOM and the proposed scheme have been done for 10,000 times and 100,000 times respectively as before. Fig. 7 and Fig 8 represents the result of SOM and the result of the proposed scheme. In this case also the results show that the clustering result of the proposed algorithm is similar to the results obtained by SOM but the time taken for the proposed clustering scheme is lesser than time taken by SOM.

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Fig. 7. Result of the Iris data by SOM.

-

.

'»%,

vs VG »

-

0.49952 0-499525 0.49953 0.499535 0.49954 0.499545 0.49955 0.499555

Fig. 8. Result of Iris data by Proposed Algorithm.

4. Conclusion

In this work a non hierarchical clustering scheme has been proposed for data visualization of multi dimensional data in two dimension. The clusters generated in two dimension retain the properties of the data

i,e. similarity/dissimilarity between the data points in the original dimension. The algorithm is simple and less time consuming than more widely used Self Organizing Map for non hierarchical clustering. The clustering results obtained by SOM and the proposed algorithm is almost similar.

In the present work we have done simulation with 3 and 4 dimensional data sets but the algorithm can be applicable to data sets with larger dimension. The computational load with increase of dimension in the proposed algorithm is not much compared to the increase of computational load with dimension in case of SOM as in this algorithm we used manhatton distance for similarity measurement. With the use of euclidean distance, the computational load increases but clusteting result does not differ much. So we preferred manhatton distance as the similarity measure. Here in this simple experiment, in both the cases, it is found that the resulting clusters clearly show the data structure like SOM but take much lesser time than SOM for execution.

Further experiment with data of higher dimension has to be done to justify the efficiency of the algorithm. Moreover we have only experimented with the data sets which have clear clusters, we need also to experiment with datasets with overlapping cluster to see the effect of the proposed algorithm. In our simulation we have decided at the start to iterate the process for 100,000 times. We need to set a proper stopping criterion depending on some cluster validity measures. We are currently working on this problem and hope to report in near future.

References

1. A. K. jain and R. C. Dubes, Algorithms for Clustering Data (Prentice-Hall, Englewood Cliffs, NJ, 1988).

2. L. P. wang and X. J. Fu, Data Mining with Computational Intelligence, (Springer, Berlin, 2005).

3. V. S. Tseng and C. P. Kao ' Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method', IEEE/ACM Transactions on Computational Biology and Bioinformatics 2, 355 - 365, (2005).

4. L. Parsons, E. Haque and H. Liu, ACM SIGKDD Explorations Newletter 6(1),90, (2004).

5. X. J. Fu and L. P. Wang, ' Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance', IEEE Trans. System, Man, Cybern, Part B, 33, 399-409, (2003).

92 A Non-Hierarchical Clustering Scheme for Visualization of High Dimensional Data

6. J. Kruskal and M. Wish, Multidimensional Scaling (Sage Publications, London, 1978).

7. T. Kohonen, Self- Organizing Maps (New York: Springer-Verlag, 1997).

8. UCI Machine Learning Repository (http://www.ics.uci.edu/ mlearn/databases/).

9. J. C. Bezdek, 'Pattern Recognition with Fuzzy Objective Functions (Plenum Press, NY, 1981).

10. J. Lee and D. Lee ' An improved cluster labeling method for support vector clustering', IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 461 - 464, (2005).

http://www.ics.uci.edu/

93

An Attribute Partitioning Approach to Correlation Connected Clusters

Vijaya Kumar Kadappat'* and Atul Negi*

t Department of Computer Applications Vasavi College of Engineering, Hyderabad 500031, India.

Email: [email protected]. com, kadappakumar@gmail. com

* Department of Computer and Information Sciences University of Hyderabad, Hyderabad 500046, India. Email: [email protected], [email protected]

Correlation Connected Clusters are objects that are grouped based upon correlations in local subsets of data. The method 'Computing Clusters of Correlation Connected objects' (4C)1 uses DBSCAN and Principal Component Analysis (PCA) to find such objects. In this paper, we present a novel approach that attempts to find correlation connected clusters using an attribute partitioning approach to PCA. We prove that our novel approach is computationally superior to the 4C method.

Keywords: Clustering, DBSCAN, PCA, SubXPCA, correlation connected clusters, attribute partitioning

1. Introduction

Cluster analysis is widely used for data exploration in data mining. A recently proposed clustering technique, 'Computing Clusters of Correlation Connected objects' (4C technique) is based on DBSCAN2 and Principal Component Analysis (PCA). The 4C method was proved to be useful in various data sets such as molecular biology, time sequences, etc. 4C finds correlations which are local to subset of the data and are not global, since dependency between features can be different for different subgroups of data set. The 4C method was proved to be superior to DBSCAN, CLIQUE, etc.1 However, 4C is not suitable for high dimensional data due to inability of DBSCAN handling such data. PCA is a crucial aspect of 4C method. In 4C method, PCA is used to find correlations in the neighbourhood of a core object and PCA plays vital role in 4C in finding correlated clusters. A clustering approach based on subsets of attributes was previously presented by Friedman and Meulman.3 Following this line of thinking, SubXPCA4 was proposed to improve PCA. SubXPCA finds local subsets of attributes and combines them globally like PCA does. More importantly, SubXPCA4 was found to be computationally superior to PCA for high dimensional data. In this paper we attempt to improve 4C method based upon our insight into advantages of attribute partitioning approach to PCA (i.e. SubXPCA) for high dimensional data and we prove computational superiority of our approach over 4C.

The organization of the paper is as follows. In section 2, we review the salient aspects of 4C1 and a review of SubXPCA is presented in section 3. We propose 'an Attribute Partitioning approach to Correlation Connected Clusters' (AP-4C) and time efficiency of AP-4C over 4C is proved in section 4. We conclude with some comments in section 5.

2. Computing Clusters of Correlation Connected objects (4C)

We review briefly the 4C method here and a detailed discussion may be found in Bohm et oZ.'s work.1 Consider X = { X i . . . X N } , the set of N patterns, each of d dimensionality. 4C algorithm: We use the definitions of core object, correlation dimension, direct correlation-reachability, Ne * (Xi) etc., as defined by Bohm et al.x for our discussion and we give them here - correlation dimension: Let S C X, A < d, V = {Aj . . . A^}, the eigenvalues of S in descending order and 6 £ 5i((5 ss 0). S forms a A— dimensional linear correlation set w.r.t. 5 if atleast d — X eigenvalues of S are close to zero. Let S C X be a linear correlation set w.r.t. S € 5ft. The number of eigenvalues with Aj > 6 is called correlation dimension of S. Direct correlation reachability (DirReach^(P2,Pi))-Let e,S £ 5ft and /x, A € N. A point Pi e D is direct correlation-reachable from a point P2 € D w.r.t. e, n, 6 and A if P2 is correlation core object, the correlation dimension of e-neighbourhood




94 An Attribute Partitioning Approach to Correlation Connected Clusters

of Pi, Ne(Pi) is atmost A and Pj 6 7VeCp2(P2).

Cp2 is correlation similarity matrix, Nf''(P) = {Xi e D\max{distP(P, Xj), distXi(Xit P)} < e}, where distP(P,Xi) = y/(P - Xi)CP(P - Xi)T.

Input: X, pattern (object) set; e, neighbourhood; fi, number of points in e—neighbourhood; A, upper bound for correlation dimension; 6, threshold to select correlation dimension. Output: Every object is assigned a cluster-id or marked as noise. Assumption: Each pattern (object) in X is marked as unclassified initially. Method: For each unclassified pattern (object), Xi € X repeat the steps 1 — 3

l.Test whether Xi is a correlation core object as follows: l.l.Compute e—neighbourhood of Xi, iV^Xi). 1.2.If Number of elements in Ne(Xi) > fi, then

1.2.1. Compute covariance matrix of patterns in AT£(Xi), i.e. M X | .

1.2.2.If Correlation Dimension of A^Xi) < A 1.2.2.1.Compute correlation similarity

matrix, Mx, and Ne X ' (X;)

1.2.2.2.Test if number of patterns

inATeMx'(Xi) >n

2.If (Xi is a correlated core object) then Expand a new cluster as follows: 2.1.Generate new cluster-id.

2.2.Insert all Xj e N6 ' (Xi) into queue, Q. 2.3.While (Q is not empty) repeat the steps

2.3.1 - 2.3.4. 2.3.l.E = first object in Q. 2.3.2.Compute

R = {Xj e X\DirReaehi>*(E,Xj)}, where DirReachf,^(E,'X.j) indicates Xj is direct correlation reachable from correlation core object E.

2.3.3.For each Xj € R repeat the steps 2.3.3.1-2.3.3.2 2.3.3.l.If Xj is unclassified or noise then

Assign current cluster-id to Xj. 2.3.3.2.1f Xj is unclassified then insert

Xj into queue Q. 2.3.4.Remove E from queue Q.

3.1f (Xi is not a correlated core object) then Mark Xi as noise.

4C method was found to be very useful to find correlations in local subsets of data such as microbiology, e-commerce. 4C makes use of PCA to find correlations in the data set, which is not suitable for high dimensional data, hence 4C consumes a large chunk of time for such data sets. To counter this problem, we make use of SubXPCA, a computationally more efficient variation of PCA which is described in the following section.

3. Cross-sub-pattern Correlation based Principal Component Analysis (SubXPCA)

We review briefly SubXPCA here and a detailed discussion may be found in the work of Negi and Kadappa.4

SubXPCA Algorithm: In the following steps, we use the indices as follows. 1 2) equally-sized sub-patterns, each of size dt (= d/k) and SPj is the set of j t h

sub-patterns of Xi, Vi. 1.2. For every SPj, repeat the following steps

1.2.1 to 1.2.4. 1.2.1.Compute covariance matrix, (Cj)dtxdi-1.2.2.Compute eigenvalues (A£) and corr

esponding eigenvectors (e^). 1.2.3.Select r (< d{) eigenvectors correspo

nding to first r largest eigenvalues obtained in step 1.2.2. Let Ej be the set of r eigenvectors (column vectors) selected in this step.

1.2.4. Extract r local features (PCs) from SPj by projecting SPj onto Ej as follows. Let Yj be the reduced data in this step, and is given by

(Yj)Nxr = (SPj)Nxdi(Ej)dtxr 1.3. Collate Yj, Vj, as shown below. Let Z

denotes such combined data. Z; = (j/i(i, l),yi(i,2),...,yi(i,r)... and so on . . . yk(i, 1 ) , . . . , yk{i, r)) where Zi is the ith

row of Z, which corresponds to Xi. And (Vjih l),Vj(i,2),...,Vj(i,r)) is ith r owofYj .

2. Apply classical PCA on Z obtained in step 1.3. 2.1.Compute Final covariance matrix, (CF)(kr)x(kr

Vijaya Kumar Kadappa and Atul Negi 95

for the data Z. 2.2. Compute eigenvalues (Xf) and correspon

ding eigenvectors (ef). 2.3.Select rF (< kr) eigenvectors corresponding

to first rF largest eigenvalues obtained in step 2.2. Let E F be the set of rF eigenvectors selected in this step.

2.4.Extract rF global features (PCs) by projecting Z (obtained in step 1.3) onto E F . Let Z F

be the data obtained after projection and is given by ( Z F ) N x r F = ( Z ) N x k r ( E F ) k r x r F

2.5.ZF is the final reduced pattern data (corresponds to original data, X), which can be used for subsequent classification, recognition, etc.

SubXPCA was proved to be computationally more efficient than PCA and the successful classification results on 5 UCI repository of Machine Learning databases was also presented by Negi and Kadappa.4

4. An Attribute Partitioning approach to Correlation Connected Clusters (AP-4C)

In this section we present our clustering method, AP-4C, which is based on DBSCAN and SubXPCA.

4.1. AP-4C Algorithm:

To find correlation dimension of e-neighbourhood of core object (See step 1 of sec. 2), 4C method uses classical PCA which is computationally expensive. To ease this problem, AP-4C method uses SubXPCA to find eigenvalues and eigenvectors, to compute correlation dimension. The algorithm is as follows.

Input: X, pattern (object) set; e, neighbourhood; fi, number of points in e—neighbourhood; A, upper bound for correlation dimension; 5, threshold to select correlation dimension

Output: Every object is assigned a cluster-id or marked as noise

Assumption: Each pattern (object) in X is marked as unclassified initially

Method: For each unclassified pattern (object), X; 6 X repeat the steps 1 - 3 .

l.Test whether Xi is correlation core object as follows: 1.1.Compute e-neighbourhood of Xi, 7Vf(Xi). 1.2.If Number of elements in 7Ve(Xi) > u, then

1.2.1.Find Correlation Dimension of iv£(Xi)

using SubXPCA4 as follows: 1.2.1.1.Partition each pattern in 7V£(Xi)

of pattern Xi, into k sub-patterns, find sub-covariance matrices, project sub-patterns on them and find final covariance matrix (Cx,) of the same neighbourhood as given in step 1.2.1 and 2.1 of sec. 3.

1.2.1.2.Compute eigenvalues and eigenvectors of Cx, obtained in step 1.2.1.1.

1.2.1.3.Count the number of eigenvalues greater than S. If the count < A then 1.2.1.3.1.Compute correlation sim

ilarity matrix Cx., and JVfx'(Xi)

1.2.1.3.2.Test if number of patterns in -/Ve

Cx'(Xi) > / i Steps 2 and 3: are same as 4C method (See sec. 2), except that M is replaced with C. hence for brevity we do not reproduce it here.

To give a better conceptual conprehension of our method we summarize it in figure 1.

4.2. Time complexities of 4C and AP-4C

In PCA techniques, where covariance matrix is computed explicitly, a large amount of time is spent in computing covariance matrix, and an insignificant amount of time for other tasks such as finding eigenvalues. Hence, we focus on time complexity of covariance matrices of PCA and SubXPCA used in 4C and AP-4C methods respectively. Consider X = { X i , . . . , X N } , the set of TV patterns of size d. Time complexity to calculate covariance matrix (C) in PCA, Tc' It involves computation of d(d + l ) /2 parameters, and a total of Nd(d + l ) /2 operations and Tc is given as Tc = 0(Nd2). Time complexity to calculate k sub-covariance matrices (Cj) in SubXPCA, Tfx, is given as follows.

TFl = O(kNdf) (1)

where k is number of sub-patterns per pattern and O(Ndf) is the time complexity of calculating a sub-pattern covariance matrix, Cj. Time complexity to calculate final covariance matrix (CF) in SubXPCA, I>2 , is given by

TF2 = Q(Nk2r2) (2)


('lusters awl/or Noise objects <C:

A No

V

Select next unclassified \ ; •

Mnd ('-neighbourhood of X. N (X );

Find Correlation Dimension of N(.(Xf) using SubXPCA:

u) Partition each object into sub-objects b) Extract local features from each i'!l sub-objects set using PCA c) Apply PCA on locally extracted data to get eigen values d) If (number of eigen values > 8) < A. then

(i) Compute correlation similarity matrix, C" ;

(ii) If Count (objects in (^-neighbourhood of X. vv.r.t. C ) > u then

X is- correlation core object

No

Expand cluster w.r.l. X. using concept of direct

correlation-reachabililv

Expansion of cluster is completed

Fig. 1. The flow chart of proposed AP-4C algorithm

where r < di, and di, is sub-pattern size. From eqs.

(1) and (2), t ime complexity to calculate all covari-

ance matrices of SubXPCA, T>, is given by

7> = O(kNdf) + 0{Nk2r2) (3)

Time complexity of 4C method, T±c, is discussed in1

and is reproduced here.

TiC = 0(NTc + Nd3) = 0(NNd2 + Nd3) (4)

Vijaya Kumar Kadappa and Atul Negi 97

On the same lines, time complexity of AP-4C method, TAP, is given by

TAP = 0(NTF+Nkdf) = 0(N(kNdf+Nk2r2)+Nkdf

(5) where di is sub-pattern size and k is number of sub-patterns per pattern (See sec. 3).

Theorem 4.1. TF < Tc, Vr < diy/(k-l)/k, where 2 < k < d/2, is number of sub-patterns per pattern, r is number of chosen projection (eigenvectors per Sj and di is sub-pattern size (See step 1 in sec 3)

Proof. To prove that TF < Tc

From eq. (3), TF = kNdf + N(kr)2

TF = (l/k)Nd2 + {d2/d2)N{kr)2 (since di = d/k) TF = (l/k)Nd2 + (k2/d2) [Nd2] r2

TF = (l/k)Nd2 + (r2/d2) [Nd2] (since di = d/k)

TF = a+r^)[Nd2} (6)

Obviously, TF < Nd2 if (± + £ ) < 1

TF < Nd2 if r2/df < (1 - 1/fc)' Hence TF <Tc,Vr < disJ(k-\)/k The theorem follows. •

Lemma 4.1. Umrî,k-~2TF « (\/k)Tc-

Proof of Lemma. Since r < d\, it is true that r > 1 (From step 1.2.3 in sec 3). To prove TF w (l/k)Tc, we have to minimize (r2/d2) (from eq. (6)). It is true that (r2/d2) reaches minimum possible value, as r tends to minimum possible value and d; tends to maximum possible value. Also it is true that (r2/d2) reaches minimum possible value, as r —> 1 and di —• (d/2) (since r > 1 and Theorem 4.1). Since di = (d/k), (di —> d/2) implies k —• 2. From previous discussion, (r2/d2) reaches minimum possible value, as r —-> 1 and fc->2. Hence the Lemma follows. •

By Lemma 4.1, TF as (l/k)Tc is true for smaller values of k and r. However, in practice, r may not be chosen as 1 (i.e. smallest possible value), especially when k is small, since the classification rate may get reduced due to less number of PVs (r). Hence some tradeoff between r and k is required to achieve good classification rate and time efficiency.

Lemma 4.2. TAP < T^c-

Proof of Lemma. From Theorem 4.1, TF < Tc

(7) Consider Second term from eq (5), i.e. Nkdf. Nkdf = (l/k2)Nd3 (since dt = d/k) Nkdf < Nd3 (since k > 2) (8) Therefore, from eqs. (4), (5), (7) and (8) the Lemma follows. •

4.3. Why is AP-4C more efficient than 4C ?

AP-4C uses SubXPCA to compute eigenvalues which are used to find correlation dimension of e-neighbourhood of a core object (See step 1.2.1 of sec. 4.1). Similarly 4C uses classical PCA for the same. In PCA methods where covariance matrix is comuted explicitly, most of the time is consumed for the computation of covariance matrix alone. In contrast to PCA (where a single large covariance matrix, C, is computed), SubXPCA computes k, (k > 2), smaller sub-pattern covariance matrices (Cj), one for each sub-pattern set, Sj, and one final covariance matrix (C F ) . By the Theorem 4.1, it is obvious that TF < Tc, Vr < di\J(k — l)/k, where r is number of chosen projection vectors per Sj, di is sub-pattern size and k is number of sub-patterns per pattern. The upperbound for r (i.e. di^/(k — l)/k) is reasonably large and in practice, we choose first few salient features (i.e. r is small in general), therefore, the computation of final covariance matrix, C F , becomes trivial. Finally, SubXPCA is faster by nearly k times to PCA (See Lemma 4.1). The concept of partitioning is the basic reason for the lesser time complexity of SubXPCA. Since we use SubXPCA in AP-4C for finding correlation dimension of e—neighbourhood, AP-4C is thus faster than 4C, which uses PCA and the same is proved in Lemma 4.2. It was found that classification results of SubXPCA and PCA were significantly same.4

5. Conclusions and Future work

In this paper, we have proposed a new and efficient 4C method, AP-4C, suitable for high dimensional data. Theoretical proofs reveal that AP-4C is more efficient than 4C. 4C becomes a special case of AP-4C if (i) the number of sub-patterns per pattern in SubXPCA is taken as 1 and (ii) step 2 of SubXPCA


is omit ted. AP-4C may be extensively used in high

dimensionality da ta mining applications.

In the near future, we a t t empt to improve our

method by using ideas from 'clustering objects on

subset of a t t r ibutes ' , 3 where the relevant a t t r ibute

subsets for each individual cluster can be different

and partially (or completely) overlap with those of

other clusters.

R e f e r e n c e s

1. C. Bohm, K. Railing, P. Kroger and A. Zimek, Computing clusters of correlation connected objects, in

Proc. of SIGMOID ACM, (Paris, France, 2004). 2. A. K. Pujari, Data mining Techniques (Universities

Press, India, 2002). 3. J. Friedman and J. Meulman, J. of Royal Statistical

Society. 66, 815 (2004). 4. A. Negi and V. K. Kadappa, An experimental study

of sub-pattern based principal component analysis and cross-subpattern correlation based principal component analysis, in Pnoc. Image and Vision Computing international confernce, (Univ. of Otago, New Zealand, 2005).

5. S. Chen and Y. Zhu, Pattern Recognition 37, 1081 (2004).

PART E

Document Analysis

101

A Hybrid Scheme for Recognition of Handwritten Bangla Basic Characters Based on H M M and MLP Classifiers

U. Bhattacharya*, S. K. Parui and B. Shaw

Computer Vision and Pattern Recognition Unit Indian Statistical Institute

203, B. T. Road, Kolkata-700108, INDIA E-mail: ujjwal&isical.ac.in*

This paper presents a hybrid approach to recognition of handwritten basic characters of Bangla, the official script of Bangladesh and second most popular script of India. This is a 50 class recognition problem and the proposed recognition approach consists of two stages. In the first stage, a shape feature vector computed from two-directional-view-based strokes of an input character image is used by a hidden Markov model (HMM). This HMM defines its states in a data-driven or adaptive approach. The statistical distribution of the shapes of strokes present in the available training database is modelled as a mixture distribution and each component is a state of the present HMM. The confusion matrix of the training set provided by the HMM classifier of the first stage is analyzed and eight smaller groups of Bangla basic characters are identified within which misclassifications are significant. The second stage of the present recognition approach implements a combination of three multilayer perceptron (MLP) classifiers within each of the above groups of characters. Representations of a character image at multiple resolutions based on a wavelet transform are used as inputs to these MLPs. This two stage recognition scheme has been trained and tested on a recently developed large database of representative samples of handwritten Bangla basic characters and obtained 93.19% and 90.42% average recognition accuracies on its training and test sets respectively.

Keywords: Handwritten character recognition; Bangla character recognition; stroke feature; wavelet based feature; HMM; MLP.

1. Introduction

Although significant development has already been achieved on recognition of handwriting in scripts of developed nations, not much work has been reported on Indie scripts. However, in the recent past, significant research progresses could be achieved towards recognition of printed characters of Indian scripts including Bangla.1 Unfortunately, the methodology for printed character recognition cannot be extended towards recognition of handwritten characters. The development of a handwritten character recognition engine for any script is always a challenging problem mainly because of the enormous variability of handwriting styles.

Many diverse algorithms/schemes for handwritten character recognition2'3 exist and each of these has its own merits and demerits. Possibly the most important aspect of a handwriting recognition scheme is the selection of an appropriate feature set which is reasonably invariant with respect to shape variations caused by various writing styles. A large number of feature extraction methods are available in the literature.4

India is a multilingual country with 18 constitutional languages and 10 different scripts. In the

literature, there exist a few studies on recognition of handwritten characters of different Indian scripts. These include Ref. 5 for Devanagari, Ref. 6 for Tel-ugu, Ref. 7 for Tamil, Ref. 8 for Oriya among others. Bangla is the second most popular Indian script. It is also the script of two other Indian languages, viz., Assamese and Manipuri. On the other hand, Bangla is the official language and script of Bangladesh, a neighbour of India. Several off-line recognition strategies for handwritten Bangla characters can be found in Refs. 9-11. A few other works dealing with off-line recognition of handwritten Bangla numerals include Refs. 12,13.

Many of the existing works on handwritten Bangla character recognition are based on small databases collected in laboratory environments. On the other hand, it is now well established that a scheme for recognition of handwriting must be trained and tested on a reasonably large number of samples. In the present work, a recently developed large and representative database14 of handwritten Bangla basic characters has been considered.

In the present article, a novel hybrid scheme for recognition of Bangla basic characters has been proposed. In the first stage, a shape vector representing all of certain directional view-based strokes of an in-

102 A Hybrid Scheme for Recognition of Handwritten Bangla Basic Characters Based on HMM and MLP Classifiers

put character is fed to our HMM classifier. The set of posterior probabilities provided by the HMM determines the smaller group of the input character among N such groups of confusing character shapes. In the second stage, three MLP classifiers for the particular group independently recognize the character using its representations at three fine-to-coarse resolutions based on a wavelet transform. Final recognition is obtained by applying the sum rule15 for combination of the output vectors of three MLPs of the particular group.

An HMM is capable of making use of both the statistical and structural information present in handwritten shapes.16 A useful property of the present HMM is that its states are determined in a data-driven or adaptive approach. The shapes of the strokes present in the training set of our image database of handwritten Bangla characters are studied and their statistical distribution is modelled as a multivariate mixture distribution. Each component in the mixture is a state of the HMM. This model is robust in the sense that it is independent of several aspects of input character sample such as its thickness, size etc. In the proposed approach, above HMM is used to simplify the original fifty class recognition problem into several smaller class problems.

Wavelets have been studied thoroughly during the last decade17 and its applicability in various image processing problems are getting explored. In the present work, Daubechies18 wavelet transform has been used to obtain multiresolution representations of an input character image. Distinct MLPs are employed to recognize the character at each of these resolution levels. Final recognition results are obtained by combining the above ensemble of MLPs. Such a multiresolution recognition approach was studied before for numeral recognition problem12 and was observed that it is robust with respect to moderate noise, discontinuity or small changes in rotation.

2. Stage I of the Recognition Scheme

The first stage of the proposed recognition scheme consists of a preprocessing module, extraction of directional view based strokes from character image, computation of stroke features and designing a classifier based on HMM.

2.1. Preprocessing

For cleaning of possible noise in an input character image, it is first binarized by Otsu's thresolding technique19 followed by smoothing using median filter of window size 5. No size normalization is done at the image level since it is taken care of during feature extraction. A sample image from the present database and the effect of smoothing on extracted strokes by the subsequent module is shown in Figs. 1(a) - 1(1).

"3* <T* "eft ({:\ z^ T-(«) (6) <B <A V) """GO

ZJI *3* "c^ ^ ^ s - ^ « ) (H) CO 0') <# CO

Fig. 1. (a) Image of a sample character; (6) image in (a) after binarization; (c) vertical strokes obtained from (6); (d) vertical strokes obtained after pruning (c); (e) horizontal strokes obtained from (6); ( /) horizontal strokes obtained after pruning (e); (g) image of the sample character after smoothing; (h) image of the smooth character sample after binarization; (i) vertical strokes obtained from (h); (j) vertical strokes obtained after pruning (i); (k) horizontal strokes obtained from (h); (I) horizontal strokes obtained after pruning (k).

2.2. Stroke Extraction

Horizontal and vertical strokes present in the input character shape are obtained in the form of digital curves each of which is one-pixel thick and in which all the pixels except the terminals have exactly two 8-neighbours. Certain two directional view based approach is considered for extraction of these strokes. In this approach, a binarized image E consisting of only the vertical strokes present in the input character shape is obtained by removing all the object pixels of the binarized character image other than those for which right or east 4-neighbour is a background pixel, that is, where the pen movement is upward or downward. In other words, the object pixels of the input sample image that are visible from the east form the vertical strokes as shown in Fig. 2(a). Similarly, another binarized image S consisting of only the horizontal strokes is generated. The object pixels of the binarized character image whose bottom or south 4-neighbour is a background pixel, that is, where the pen movement is side-wise form the horizontal strokes as shown in Fig. 2(b). Thus, here,

U. Bhattacharya, S. K. Parui and B. Shaw 103

connected components of images E and S are respectively vertical and horizontal strokes. Vertical strokes consisting of pixels less than 20% of the height and horizontal strokes consisting of pixels less than 20% of the width of the input image are ignored during further processing. Such vertical and horizontal strokes for a character shape in Fig. 1(a) are respectively shown in Figs. l ( j) and 1(1).

(a) (b)

Fig. 2. Strokes in a character sample are shown, (a) Dark pixels indicate E image; (b) dark pixels indicate S image.

2.3. Computation of Feature Components

On each stroke, six points Pi,i = 0, . . . , 5 are obtained such that the curve lengths Pj_iPj, i = 1,...,5 are equal with P$ and P5 are respectively the lowest and highest pixels for a vertical curve or the leftmost and rightmost pixels for a horizontal curve. The feature components corresponding to each stroke from an input character are obtained as {oti,ct2,- • • ,cts,X,Y,L}, where on is the angle between the line Pi- iPj and the positive X-axis, (X, Y) is its centroid and L is the length of the stroke. It represents the shape, size and position of the stroke with respect to the character image. The quantities a* are invariant under scaling and the remaining three quantities X, Y and L are normalized with respect to the character height.

For an input character, all the strokes are arranged from left to right and obtained the observation sequence O — (0\,..., OT), where T is the number of strokes and Oi = {an, aa,..., ai5,Xi, Yt, Lt} corresponds to the i-th stroke in the above arrangement.

2.4. HMM Based Classifier

The HMM designed for the present classification task is a non-homogeneous one and the classifier used in the first stage of the proposed recognition scheme consists of 50 such HMMs denoted by 7,-, j = 1,2,.. . , 50 one for each of the underlying 50 classes.

For an input pattern X, the probability values P(0 |7 j ) are computed for all j = 1,2,.. . , 50 using the well known forward and backward algorithms.20

Finally X is assigned to class c e {1 ,2 , . . . , 50} such that c = arg max {P(0 |7 , )} . Computational de-

l< j<50

tails are avoided due to space constraints. It has been assumed above that the feature val

ues {oti,a-2,... ,a5,X,Y,L} corresponding to the strokes obtained from training samples of each class follow a mixture of K 8-dimensional Gaussian distributions. The unknown parameters of this mixture distribution for different choices of K are obtained using the well known EM algorithm.21 The optimum value of K for each class is determined by using the Bayesian information criterion (BIC).22 The mean vectors of these K distributions are called shape primitives for the corresponding character class and these form the state space of the associated HMM. Thus, here, the states of an HMM are not determined a priori but are constructed adaptively on the basis of the training set of character samples of the respective class. The parameters of the underlying HMMs (such as the initial and transition probability matrices) are computed on the basis of the above states.

3. Final Classification

An analysis of the recognition performance of the above HMM based classifier on both the training and test sets shows that a significant percentage of mis-classifications occurred within several smaller groups of character classes. This pattern of misclassifications is shown in Table 1.

In the second stage, fresh classification is performed for each of the above sub-groups. It significantly improves the overall recognition accuracy of the first stage. However, in this context, it is obvious that if an input sample in the first stage of its classification is misclassified in a group other than its own (as shown in Table 1), then the second stage of the present classification scheme fails to correctly classify the sample. An input sample misclassified in the first stage as another character of its own group, gets a second chance of being properly classified by the second stage of classification. Brief details of the final classification scheme is given below. A similar scheme23 was used before for recognition of hand printed Bangla numerals.


Table 1. Groups of confusing character shapes provided by the HMM classifier

Group No.

1

2

3

4

5

6

7

8

Groups of confusing characters

3T 3It * * , * $ Sfcfr* ^ *r TF * 3 \3 !ft" T5 v» 3 * *r*r*r*TTrwws' irsr * ?&* T^s \? % t <* j>*rFJto!T«t*r?r*rT>

Total

No. of 1 in the

Training

600

1200

1200

1500

1800

3000

2700

3000

15000

samples group

Tea

400

400

707

1000

1200

1974

1800

2000

9481

Within '

Tiaining

22.50

12.00

20.50

10.27

30.17

33.00

15.85

27.23

23.05

reclassification percentage group Between groups Total

Test

24.00

12.25

21.22

11.90

32.83

37.13

17.06

29.30

25.67

Training

3.83

3.92

5.17

5.60

3.67

6.77

4.96

6.23

5.37

Test

4.50

4.25

5.94

7.50

3.92

952

6.78

7.65

6.98

Training

26.33

15.92

25.67

15.87

33.84

39.77

20.81

33.46

28.42

Test

28.50

16.50

27.16

19.40

36.75

46.65

23.84

36.95

32.65

3.1. Multi-resolution Representation of a Character Image

The bounding box (minimum rectangle enclosing the shape) of a character image is first normalized to the size 64 x 64. Wavelet decomposition algorithm24is applied to this normalized image recursively for two times to obtain an 16 x 16 smooth.. .smooth approximation of the original image as the final decomposition. Here, we also obtain a 32 x 32 smooth.. .smooth approximation of the original image at an intermediate stage. Images of a sample character at multiple resolutions are shown in Figs.3. The images at these three fine-to-coarse resolutions are gray-valued ones and as before Otsu's thresholding technique is used for their binarization.

(a) (b) (c) id)

Fig. 3. Image of a character sample at different resolutions: (o) The original image; (6) Size normalized (64 x 64) image; (6) Wavelet transformed image at resolution (32 x 32); (d) Wavelet transformed image at resolution (16 x 16).

3.2. Classification at Multiple Resolutions

For an input character, images at each of the above three fine-to-coarse resolutions are fed to the input

layers of three different MLP networks. Optimal size of the hidden layer in each case has been determined through near exhaustive simulation runs. The size of the output layer is equal to the number of classes involved in the recognition task and this number varies for different groups of Table 1. These MLPs are trained using the well known backpropagation training algorithm. For the same sample of any group, the values at the output layers of the corresponding three MLP classifiers are usually different and this is the reason that we have combined the output values of these MLPs obtaining better recognition accuracy.

3.3. Combination of MLP Classifiers at Multiple Resolutions

It is now very popular in the OCR community to consider combination of multiple classifiers for better recognition performance than that of a single classifier. There exist various approaches for such combination of outputs of multiple classifiers. These combination approaches have recently been studied in Ref. 25. Such combination approaches include product rule, sum rule, median rule, majority voting etc.

It has been observed that the product rule provides the worst results which are also worse than any of the individual classifiers while the performance of the sum and median rule are good. Also, performance of the majority voting rule is reasonably good. The recognition results shown in the next Section have been obtained by combining the three MLP classifiers using sum rule.

U. Bhattacharya, S. K. Parai and B. Shaw 105

4. Experimental Results and there are 9481 samples in the test set.

Most of the existing off-line recognition studies on handwritten characters of Indian scripts are based on different databases collected either in laboratory environment or from smaller groups of the concerned population. However, recently a large and representative database of handwritten Bangla basic characters was developed and the simulation results of the proposed recognition approach have been obtained based on the training and test samples of this database.

Vcî" **«-* -&\S^

i'S* ^tSv& s f 6 ^ -

^5PJ " * *0 -zl-sn*

$ £> $ ^^^ %%-s-

o o o

O o f

r \s--w\y

s ^

Fig. 4. Samples from the present database of handwritten Bangla characters; three samples in each category are shown.

4.1. Database of Handwritten Bangla Basic Characters

Samples of the present database were collected by distributing several standard application forms among different groups of population of the state of West Bengal in India and the purpose was not disclosed. Subjects were asked to fill-up these forms in Bangla by putting one character per box. Such handwritten samples were collected over a span of more than two years.

Sample characters of the present database are stored as grayscale images at 300 d.p.i. resolution. These are TIF files with 1 byte per pixel. A few samples images from the present database are shown in Fig. 4.

The present database consists of 24481 images of isolated Bangla basic characters explicitly divided into training and test sets. The training set consists of 15000 samples with 300 samples for each category

4.2. Recognition Results of the First Stage

In the first stage of the present hybrid recognition scheme based on an HMM classifier, horizontal and vertical strokes have been extracted as described in Section 2.2. Parameters of the HMM corresponding to a character class are estimated using shape vectors computed from its set of strokes as described in Section 2.3. For example, 7 is the minimum value of K and was obtained for the 42-th character of Bangla alphabet while 18 is the maximum value of K and was obtained for the 24-th character of the alphabet.

The first stage of the present scheme provided only 71.58% and 67.35% average recognition accuracies respectively on the training and test sets of the present database. However, from Table 1, it is seen that 23.05% and 25.67% of the misclassifications occurred within 8 smaller groups consisting of 2 to 10 different character classes.

4.3. Recognition Results of the Second Stage

Combination of three MLP classifiers within each of the above smaller groups of Bangla characters provided 98.48% and 97.21% average recognition accuracies respectively for the training and test samples. Individual recognition accuracies at the second stage for each of the above groups are shown in Table 2.

Thus, at the end of the second stage, the overall final recognition accuracies are respectively 93.19% and 90.42% on the training and test sets of the present database of handwritten Bangla basic characters.

5. Conclusions

In the present work, we considered a hybrid approach to recognition of handwritten Bangla basic characters. In this approach, both HMM and MLP classifiers have been used. Shape features are used by the HMM classifier of the first stage while pixel images at multiple resolutions are used by MLP classifiers of the second stage. The approach of the second stage was studied before providing acceptable recognition accuracies in smaller problems such as recognition of


Table 2. Recognition performance of the 2nd stage of the proposed scheme

Group No.

Characters belonging to different groups

Samples correctly classified Correct recognition (%) in respective groups within each group during the 1 st stage by the 2nd stage

Training Test Training Test

3T 31T

^ It T? ? 1 \3 \B 'ST ^ 'S "»

»J JJ IT T ¥ * ^ 7T V *f

j*ro _ j toJT e r*r?r a tT5

577

1153

1138

1416

1734

2797

2566

2813

382

383

665

925

1153

1786

1678

1847

100

99.57

99.29

99.08

99.13

97.57

98.17

97.87

98.95

98.17

98.35

97.95

98.01

96.75

97.26

95.78

Total 14194 8819 98.48 97.21

handwri t ten Bangla numerals. However, when it is a

larger class problem, the performance of the scheme

is not equally satisfactory. This is the reason for our

choice of shape features with an HMM in the first

stage and MLP-based multi-resolution recognition

approach in the latter stage.

References

1. B. B. Chaudhuri, U. Pal, Pattern Recognition, 31, 531-549 (1998).

2. R. Plamondon, S. N. Srihari, IEEE Trans. Patt. Anal, and Mach. Intell., 22, 63-84 (2000).

3. N. Arica, F. Yarman-Vural, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 31, 216 - 233 (2001) .

4. O. D. Trier, A. K Jain, and T. Taxt, Pattern Recognition, 29, 641 - 662 (1996) .

5. K. R. Ramakrishnan, S. H. Srinivasan, S. Bhaga-vathy, Proc. of the 5th ICDAR, 414-417 (1999).

6. M. B. Sukhaswami, P. Seetharamulu, A. K. Pujari, Int. J. Neural Syst., 6, 317-357 (1995).

7. R. M. Suresh, L. Ganesan, Proc. of Sixth IC-CIMA'05, 286-291 (2005).

8. S. Mohanti, IJPRAI.12, 1007-1015 (1998). 9. U. Bhattacharya, S. K. Parui, M. Sridhar, F.

Kimura, CD Proc. IICAI,1357-1376 (2005). 10. T. K. Bhowmik, U. Bhattacharya, S. K. Parui, Proc.

ICONIP, 814-819 (2004) . 11. F. R. Rahman, R. Rahman, M. C. Fairhurst, Pattern

Recognition, 35, 997-1006 (2002). 12. U. Bhattacharya, B. B. Chaudhuri, Proc. of ICDAR,

Seoul, 322-326 (2005).

13. U. Bhattacharya, T. K. Das, A. Datta, S. K. Parui, B. B. Chaudhuri, International Journal for Pattern Recognition and Artificial Intelligence, 16, 845-864 (2002).

14. www.isical.ac.in/~ujjwal/download/database.html, "OFF-LINE HANDWRITTEN CHARACTER DATABASE".

15. J. Kittler, M. Hatef, R. P. W. Duin and J. Matas, IEEE Trans, on Patt. Anal, and Mach. Intell. 20 226-239 (1998).

16. H. Park and S. Lee Pattern Recognition, 29, 231-244 (1996).

17. A. Graps, IEEE Computational Science and Engineering, 2(2), (1995).

18. I. Daubechies, IEEE Trans, on Information Theory. 36, 961-1005 (1990).

19. N. Otsu, IEEE Trans. Systems, Man, and Cybernetics, 9, 377-393 (1979)

20. L. R. Rabiner, Proc. Of the IEEE, 77(2), 257-285 (1989)

21. K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, San Diego, 2nd Ed., 1990.

22. J. Bernardo and A. Smith, Bayesian Theory. John Wiley & Sons, 1994.

23. U. Bhattacharya and B. B. Chaudhuri, Proc. of the 7th Int. Conf. on Document Analysis and Recognition, vol. I, 16-20 (2003).

24. S. G. Mallat, IEEE Trans, on Pattern Anal, and Machine Int., 11(7), 674 -693 (1989).

25. I. K. Ludmila, IEEE Trans. On Patt. Anal, and Mach. Intell., 24, 281-286 (2002).

http://www.isical.ac

107

An Efficient Method for Graphics Segmentation from Document Images

S. Mandal, S. P. Chowdhury and A. K. Das

CST Department Bengal Engineering and Science University

Howrah - 711 103, India E-mail: {sekhar,shyama,amit} @cs.bees.ac.in

B. Chanda

Electronics and Communication Sciences Unit Indian Statistical Institute Kolkata - 700 108, India


Major constituents of any document are text, graphics and half-tones. While half-tone c a n be characterised by its inherent intensity variation, text and graphics share common characteristics except difference in spatial distribution. The success of document image analysis systems depends on the proper segmentation of t e x t and graphics as text is further subdivided into other classes like, heading, table, math-zones. Segmentation of graphics is essential for better OCR performance and vectorization in computer vision applications. Graphics segmentation from text is particularly difficult in the context of graphics made of small components (dashed or dotted lines etc.) which have many features similar to texts. Here we propose a very efficient technique for segmenting all sorts of graphics from document pages.

Keywords: Document Image Analysis (DIA), Graphics Segmentation

1. Introduction

Due to its immense potential for commercial applications research in Document Image Analysis (DIA) supports a rapidly growing industry including OCR, vectorization of engineering drawings and Vision systems. Commercial document analysis systems are now available for storing business forms, performing OCR on typewritten/handwritten text, and compressing engineering drawings. Graphics detection is one of the first application areas in the document processing system. However, until today we do not have any efficient method for detecting for all type of graphics appeared in frequently used real life documents. Here, we are mainly focusing on the segmentation of the graphics from a document page, which is already half-tone segmented, and may not be even fully skew corrected.

This paper is organised as follows. Section 2 describes past research. Proposed method is detailed in Section 3. Concluding section (Section 4) contains experimental results and remarks.

2. Past work

Graphics segmentation is attempted and reported by many researchers.1-11 Quite a few are in the domain of text graphics separation and in many cases text

strings are separated out thus indirectly segmenting graphics as left-overs. Texture based identification is proposed by4 exploiting the nature of the texture of regular text to be supposedly different from that of graphics by using Gabor filters. We have come across many references where graphics are identified from engineering drawings for the purpose of vectorization.10-13 However, engineering drawings are special cases of document images containing predominantly graphics portion.

In a nutshell we are aware of three different approaches for separating out graphics from text; they are;

(1) Directional morphological filtering.

The technique is applied to locate all linear shapes and are considered to be text; effectively leaving other objects as graphics. This works well for simple maps14 and may have inherent problems in dealing with more complex scenes.

(2) Extraction of lines and arcs etc.

Relying on transform15 or vectorization16

many had tried t o isolate the graphical objects from text. This approach works well for engineering drawings.


108 An Efficient Method for Graphics Segmentation from Document Images

(3) Connected component Analysis.

A set of rules (based usually on the spatial characteristics of the components (text and graphics) is used to analyse connected components to filter out the components. The algorithms can handle complex scenario and can be tuned to deal increasingly complex documents. One of the best examples of this class is the work done by Fletcher and Kasturi.1

Here we elaborate, the well known and frequently referred endeavour by Fletcher and Kasturi.1 This is a deliberate choice as our approach, in many ways, resembles their approach based on connected component analysis They have done text-string separation from mixed graphics thereby indirectly segmenting the graphics part. On the other hand, we tried to separate graphics from the document containing both text and graphics using simple spatial properties.

The major steps of Flecher and Kasturi's algorithm are as follows:

1) Connected component generation: This is done by component labeling process which also computes the maximum and minimum co-ordinates of the bounding rectangles.

2) Filtering using area/ratio: This is used to separate large graphical components. The filtering threshold is determined according to the most populated area and the average area of the connected components. The working set to be used by the next step is now reduced to text strings and small graphical components.

3) Co-linear component grouping: Hough transform is used on the centroids of the rectilinear bounding boxes of each component. This operation consolidates grouping of co-linear components.

4) Logical-grouping: Co-linear components are grouped to words by using information like position of each component, inter character and inter word gap threshold.

5) Text string separation: The words along a text line is segmented from the rest of the image.

The algorithm is robust to changes in font size a

and style. It is by and large skew independent but sensitive to touching and fragmented characters. The

aassumed that the maximum font size in the page should be less than 5 times of the minimum font size used in the same

page

inherent connectivity assumption weakens with images of degraded/worn out documents due to a large number of touching and fragmented characters. As the aspect ratio in graphics and text vary widely the dependence on area/ratio filter is unwieldy. The algorithm is also computationally expensive like any other Hough transform based approach.

Next we present the technique used by us to separate graphics which is based on similar techniques but computationally cheaper and more effective in segmenting line art made of dotted or dashed lines or very short line segments irrespective of their orientation

3. Proposed Method

We have started with gray image of the document and using a texture based technique17 half-tones are removed. Next, it is converted to binary image by the well known technique proposed by Otsu;18 so we do the graphics segmentation with binary images.

Graphics segmentation using connected component analysis yields good results if the lines and arcs forming the graphical component are all connected. Here are a pair of examples shown in Fig. 1. Note that the graphics (single big connected component) is totally removed in both the cases.

However graphics made of dotted (or dashed) lines or short line segments are difficult to detect as the size of the individual components are similar to text characters. Therefore, the individual connected component does not signify anything, but a sequence (or group) of such components together represent graphics. The presence of such graphics which are difficult to detect is shown in Fig. 2.

Special care is taken in our approach to segment out graphics made of dotted and dashed lines in any orientation.

Our approach to detect graphics made of small components is based on grouping of small disconnected components supposedly form a big (spatially) component. In order to segment these cases we have made use of grouping of small components starting from any particular point by observing

(1) adjacency, (2) overlapping, (3) pseudo continuity and (4) stroke width

S. Mandal, S. P. Chowdhury, A. K. Das and B. Chanda 109

Fig. 1. Segmentation using connected component analysis; First row shows two pages with graphics made of connected components and the second row shows the result after graphics removal.

(a) (b)

Fig. 2. Graphics made of small components (a) Dotted lines along with big connected components; (b) Graphics with dotted and dashed lines.

of nearby components. It may be observed that the text characters do

possess similar properties but there are subtle differences. For example, two characters are adjacent to each other mostly in horizontal direction and they share overlapping of their extended bounding boxes, again, in horizontal direction. As the orientation of

the document is known and the text line detection of skew free document is also known there should not be any confusion in identifying two nearby characters belonging to two adjacent text lines. Thus, grouping of lines made of dots or dashes and small arcs is possible as elaborated next.

We start with the characteristics of graphics made with dashed or dotted lines;

(1) Number of foreground to background transitions is 1 in vertical or horizontal directions.

(2) Ratio of height to width is within a range of 0.5 to 2.

(3) Ratio of foreground to background pixel within the bounding box encompassing a component is more than 0.5. This rule is exclusive for dots only.

(4) Two components would treated as adjacent if their extended (5 times) bounding boxes have spatial overlap.

Applying the above rules we could group small components forming part of a graphics object. However, we have introduced more checking to rule out the possibilities of grouping small components available in normal text. We form a group of valid components (small) if the count goes more than 5. This is a check against the possibilities of grouping 5 consecutive dots (dots of i and j) together.13.

For grouping of solid short arc segments we use the following rules.

(1) Number of foreground to background transitions is 1 in vertical or horizontal directions.

(2) The ratio of pen width of the two components will have a range of 0.5 to 2.0.

(3) Extended skeleton of the components will have spatial overlaps.

At this juncture; the terms pen width and extended skeleton need be explained.

Pen width is a measure of the stroke width of the arcs forming a graphics. It is computed by drawing four lines (with 0, 45, 90 and 135 degree slopes) from all the points in the skeleton to the object boundary and computing average of the minimum of those radial lines. It is defined as below:

Pw = Average of all skeleton points (Minimum o f ( / ° , / 9 0 , ( / 4 5 + / 1 3 5 ) / 2 ) )

It may be noted that minimum of 3 radial lines are

bConsider the word "divisibility" which has got 5 is'

110 An Efficient Method for Graphics Segmentation from Document Images

J A ^ V ^ - ' V .

Fig. 4. Result of graphics Segmentation. Original Images in the left column and segmented images are in the right column.

actually taken as in actual calculation average of the lines in slopes 45 and 135 degrees are considered. The

pen width is used to verify that adjacent components belonging to a group if the pen width has a permitted variation. Note that for text portion pen width variation is limited and the same is true in case of graphics made up of short lines and arcs where abrupt changes are unlikely.

Extension of the skeleton is done by dividing the skeleton of the component into four parts along the principal axis. So we get three control points (points of divisions) in the skeleton. We make a copy of the lower half and upper half of the skeleton. Take their mirror images and join them from the lower and upper end points (trying to maintain the slope). This is shown in Fig. 3.

/ / / / i f I

|(a) fbj (cl (d) (e) |

Fig. 3. Expansion of the skeleton taking the mirrored and flipped lower and upper half of the skeleton; a) Original component; (b) Its skeleton; (c) Lower part of the skeleton; (d) upper part of the skeleton and (e) Extended skeleton.

This effectively extends the skeleton as shown in the figure roughly maintaining the original slope in both the ends. So, in one sense it is a pseudo window creation strategy in which the components, in their extended form, come closer to each other. Thereafter, the adjacency conditions are checked to form a group. Adjacency conditions need be fine tuned to accommodate adjacent components in bends, crossings and corners. Without such a measure, a single curved graphics component may be segmented as multiple one as the component at the corners or bends would be the missing links. To accommodate them we will accept the number of vertical or horizontal transitions be more than 1 but the pen width remains within the range of 0.5 to 2.0.

The results of the graphics separation is shown in Fig. 4 for a number of cases.

S. Mandal, S. P. Chowdhury, A. K. Das and B. Chanda 111

4. E x p e r i m e n t a l R e s u l t s

We have carried out the experiments using around

200 samples taken from the U W - I , U W - I I databases

as well as our own collection of scanned images from

variety of sources e.g., books, reports, magazines and

articles. The summary of the results is presented in

Table 1 for graphics zones whose ground t ru thed in

formation is available in the U W - I and II database.

Note tha t the ground- t ru th information for our col

lection is done manually. All experiments were car

ried out in a p4 based high end P C and all the pro

grams are writ ten in C and the average segmentation

time excluding the half-tone removal is around 3.4

second.

Table 1. Segmentation Performance (in %)

Cla

ssi

fied

(BG)

(SG)

(OC)

Big Graphics

(BG)

98

0

2

Actual Small

Graphics (SG)

0

92

8

Other Components

(OC)

2

10

88

The table shows near perfect result for graph

ics with big connected components. Result for small

graphics is also very impressive; however we fail to

cluster some of them as they are dispersed in the

graphical portion. It may be noted tha t this segmen

tat ion algorithm is fully automatic and the param

eters used work satisfactorily for a wide variety of

fonts and styles.

R e f e r e n c e s

1. L. Fletcher and R. Kasturi, IEEE Transaction on Pattern Analysis and Machine Intelligence 10 (6), 910 (1988).

2. O. T. Akindele and A. Belaid, Page segmentation

by segment tracing, in ICDAR93, (Tsukuba, Japan, 1993).

3. K. C. Fan, C. H. Liu and Y. K. Wang, Pattern Recognition Letters Vol. 15, 1201 (1994).

4. A. K. Jain and S. Bhattacharjee, Machine Vision and Application 5, 169 (1992).

5. A. K. Jain and B. Yu, Page segmentation using document model, in Proc. ICDAR'97, Ulm, Germany, August 1997.

6. A. K. Das and B. Chanda, Segmentation of text and graphics from document image: A morphological approach, in International Conf. on Computational linguistics, Speech and Document Processing (ICCLSDP'98); Feb. 18-20,Calcutta, India:, 1998.

7. T. Pavlidis and J. Zhou, Computer Vision Graphics and Image Processing vol. 54, 484 (1992).

8. F. M. Wahl, K. Y. Wong and R. G. Casey, CGIP 20, 375 (1982).

9. W. Liu and D. Dori, Computer Vision and Image Understanding 70(3), 420 (1998).

10. C.-C. Han and K.-C. Fan, Pattern Recognition 27(2), 261 (1994).

11. T. Pavlidis, CVGIP 35, 111 (1986). 12. J. Song, F. Su, C. Tai, J. Chen and S. Cai, Line

net global vectorization: An algorithm and its performance evaluation, in Proc. CVPR, Nov. 2000.

13. J. Chiang, S. Tue and Y. Leu, Pattern Recognition 12, 1541 (1998).

14. H. Luo and R. Kasturi, Improved Directional Morphological Operations for separation of Characters from Maps/Graphics, in K. Tombre and A. K. Chhabra, editors, Graphics Recognition - Algorithms and Systems, LNCS, Volume 1389, (Springer-Verlag, 1998), pp. 35-47.

15. A. Kacem, A. Belaid and M. B. Ahmed, IJDAR 4, No . 2, 97 (2001).

16. D. Dori and L. Wenyin, Vector based Segmentation of Text Connected to Graphicss in Engineering Drawings, in Advances in Structural and Syntactical Pattern Recognition, P. Perner, P. Wang and A. Rosenfeld, editors, volume 1121 of LNCS, (Springer-Verlag, August 1996), pp. 322-331.

17. A. K. Das and B. Chanda, Extraction of half-tones from document images: A morphological approach, in Proc. Int. Conf. on Advances in Computing, Calicut, India, Apr 6-8, 1998.

18. N. Otsu, IEEE Trans. SMC, 9 N o . 1, 62 (1979).

112

Identification of Indian Languages in Romanized Form

Pratibha Yadav*, Girish Mishra and P. K. Saxena

Scientific Analysis Group Defence Research and Development Organization

Ministry of Defence Metcalfe House, Delhi-110054> India

E-mail: [email protected]*

This paper deals with identification of romanized plaintexts of five Indian Languages - Hindi, Bengali, Ma-nipuri, Urdu and Kashmiri. Fuzzy Pattern Recognition technique has been adopted for identification. Suitable features/characteristics are extracted from training samples of each of these five languages and represented suitably through fuzzy sets. Prototypes in the form of fuzzy sets are constructed for each of these five languages. The identification is based on computation of dissimilarity with prototypes of each of these languages. It is computed using a dissimilarity measure extracted through fuzzy relational matrices. The identifier proposed is independent of dictionary of any of these languages and can even identify plaintext without word break-ups. The identification can be used for automatic segregation of plain texts of these languages while analysing intercepted multiplexed interlieved Speech/Data/Fax communication on RF channel, in a computer network or Internet.

Keywords: measures

Fuzzy sets; Fuzzy relations; Linguistic charateristics; Fuzzy Pattern Recognition (FPR); Fuzzy distance

1. Introduction

In digital communications, be it terrestrial or satellite based, many channels are used to carry across multiplexed Speech/Data/Fax, of course with suitable modulation and following required protocols. In case of Networks such as Internet also, where TCP/IP protocol is followed, communications also take place in the form of packets of Speech/Data/Fax. Since English had been a language of common use for many, most of text communication has been in English language. With the emergence of need for Information flow and free exchange of ideas among various communities, application softwares are being developed for languages other than English. Regional languages are emerging as a viable media for both written as well as spoken communication. When it comes to secure communication, languages also provide a natural barrier apart from security through encryption. Thus, while monitoring/intercepting such plain traffic, though the protocol followed helps in segregating texts from voice and fax, yet the problem still remains to segregate text messages into different regional languages without expert domain knowledge. In case, the communication is protected through encryption the problem becomes more complex when one needs decryption first and then go for text/language identification. It is this problem that needs to be addressed. In this paper a solution has been proposed towards

this issue using Fuzzy Pattern Recognition approach. While using regional languages for text communication the most common way would be to roman-ize such texts using 26 Roman Alphabets and apply the existing computer and communication tools, which are based on English. This romanization can be done either following certain standards or some non-standard natural way based on phonetics. Out of these two, the second one is more common and natural but involves more vagueness and uncertainty, and that is the reason, fuzzy logic was found to be suitable to address the issue of identification of various romanized regional languages.

Fuzzy Logic, introduced by Zadeh in 1965,l provides a handy tool to deal with uncertainty (vagueness).8 Fuzzy Pattern Recognition2-5 has been one of the main application oriented research that had been pursued by many researchers.2 Most of the work on Language identification9-11 has been based on dictionary.7 For the first time Fuzzy Pattern Recognition based techniques were applied for identification of three European languages namely English, German and French even when the work-break-up was not known.6

In this paper, the problem of Language Identification for non-standard romanized plaintexts of five Indian languages - Hindi, Bengali, Manipuri, Urdu and Kashmiri has been tackled using Fuzzy Pattern Recognition (FPR) when the texts are continuous

mailto:[email protected]*

Pratibha Yadav, Girish Mishra and P. K. Saxena 113

(without word break-up) and no dictionary is available. The problem is quite challenging as all the five languages are quite similar phonetically and moreover romanization is non-standard.

A set of 12 feature fuzzy sets has been selected for the purpose of classification and a classification algorithm has been designed based on fuzzy dissimilarity as described in the following sections.

2. Features for Classification

For the problem of identification of these five Indian languages, which are phonetic in nature, the linguistic characteristics of these languages are exploited. These characteristics are based on the occurrences of various alphabets and their affinity to combine with some alphabets. After a thorough and careful study of these languages a set of fuzzy features has been selected, based on the following set of linguistic characteristics:

1. Frequencies of alphabets 2. Variety of left contact letters of an alphabet 3. Variety of right contact letters of an alphabet 4. Variety of two-sided contact letters of an alphabet 5. Frequencies of doublets 6. Occurrences of highest diagraph starting with a letter 7. Occurrences of highest diagraph ending with a letter 8. High-medium-low-very low frequency categorization 9. Frequency of the alphabet with which highest diagraph starting with a letter is formed 10. Frequency of the alphabet with which highest diagraph ending with a letter is formed 11. Frequency of the alphabet with which the specified alphabet makes most frequent reversal 12. Frequency of the alphabet with which the specified alphabet makes least frequent reversal

Fuzzy sets corresponding to each of these characteristics have been constructed with the set of 26 alphabets as the basic (Universal) set by defining characteristic values in the interval [0,1]. The characteristic values of the fuzzy set m corresponding to the first characteristic are obtained by dividing the frequencies of alphabets by maximum frequency of alphabets. The characteristic values of the fuzzy sets /i2, /i3 and fii corresponding to characteristics 2, 3 and 4 respectively are obtained by normalization of

the entries by 23 as the maximum number of different letters contacting a given letter does not exceed 23. For construction of fuzzy sets /is, fi& and /X7 corresponding to characteristics 5, 6 and 7, corresponding scores are taken out of a text length of 10. For constructing fuzzy set for frequency categorization very high, high, medium, low and very low frequent letters, characteristic value /*8 is taken as

/i8(x) = 1.0 if fii(x)>0.7 = 0.9 if 0 . 5 < / u ( x ) < 0 . 7 = 0.7 if 0 .3</xi(x)<0.5 = 0.5 if 0.1 </xi(x)<0.3 = 0.1 if otherwise

Finally for the construction of fuzzy sets fig, /*io, / in , and /U12 corresponding to characteristics 9,10,11 and 12, normalization is done by dividing the values by maximum frequency to bring down the membership grades in the interval [0,1].

Thus, for each of the five languages considered, such fuzzy sets are constructed by choosing a large number of texts of each of languages (each text with length=400 characters). Finally, for each of these languages standard feature fuzzy sets/prototypes were constructed by taking averages. Thus five sets of prototypes say, H? , A»f, fJ-f1, Mi7 and fif (i = 1 , . . . , 12) are extracted for the languages Hindi, Bengali, Ma-nipuri, Urdu and Kashmiri respectively.

After the construction of prototypes for each of these languages, the next problem is to develop a classification criterion so that a given unknown text can be identified and classified to one of these five classes.

3. Classification Criteria

For classification of patterns, one has to use some distance measure or similarity measure to decide about the closeness of the unknown pattern with various prototypes. There are various distance measures5 like Hamming Distance, Euclidean distance, Minkowski's distance etc. which can be used for comparing two feature fuzzy sets /i and v.

All of these distance measures were tested but the classification score was not very satisfactory. Hence, in this paper, a new dissimilarity measure defined in6'11 has been used. It is defined through the following fuzzy relation fin between /i and v.

114 Identification of Indian Languages in Romanized Form

pR(xi,xj) = e-kMx*-''W (1)

Here k and I are parameters, which can be fixed imperically or through experimentation. Such a matrix R with entries coming from fuzzy relations of the form (1) is called a fuzzy relational matrix, denoted by /XR. Then the dissimilarity between fuzzy sets \x and v is defined using fuzzy relational matrix HR as

df(n, v) = a(26 - TracefiR) +

+ PJ2T,\lJ'R(xi>xj) -f*R{xj,Xi)\ (2) * 3

After lots of experimentation and learning, the parameters k and I were chosen to be 1.0 and 2.0 respectively. The values of a and (3 were fixed as 2.0 and 0.25 respectively. For any given unknown text 12 feature fuzzy sets /x i , . . . , /ii2 are calculated. For calculating the association of this unknown pattern with a class say Hindi, the following process is followed. For each i, /i, is compared with (if1 and dissimilarity measure df is calculated using (1) and (2) as

d? = dI{&,fl?) (3)

The final dissimilarity value DH of the unknown text with the Hindi language is calculated as follows:

12

DH = 5> 4 d? (4) t = i

where tUj's are weights. These weights are chosen according to the significance of the characteristics. In our case the weights were selected as

W\ W2 Ws W4 W5 WQ

0.140 0.135 0.135 0.135 0.059 0.100 W-j W% Wg Wio Wn W\2

0.100 0.056 0.035 0.035 0.035 0.035 Similarly, other dissimilarity measures DB, DM,

Du, and DK (with Bengali, Manipuri, Urdu and Kashmiri resp.) were computed. The unknown text is classified to the class with which the value of the dissimilarity measure is minimum.

4. Results

The Algorithm has been tested on a number of texts from each of the five languages. Text length has been taken as 400 characters (chars). The program has been tested on PC and takes few seconds of CPU.

Dissimilarity values of twenty-five test samples with each of the five classes have been reflected in Table 1. Out of these, text 1 to text 5 are Hindi texts, text

Table 1. Values of the Dissimilarity Measures with Prototypes of Five Languages.

Texts

Text 1 Text 2 Text 3 Text 4 Text 5





DH

03.85 03 .86 03.79 03.75 03.62

04.76 04.04 04.67 04.57 05.43

04.45 04.90 04.55 04.49 04.47

04.55 04.43 04.18 04.41 04.14

04.57 04.07 04.63 04.64 04.70

DB

05.91 05.83 05.99 06.52 06.00

02 .83 03.67 02 .88 02.91 02 .73

05.25 06.05 05.52 05.77 05.42

05.83 06.47 05.92 06.02 05.91

05.80 05.94 06.21 06.44 06.11

DM

05.63 04.83 05.12 05.34 05.01

04.68 04.75 05.33 05.11 05.57

02.92 03 .43 02 .99 02 .96 03 .49

05.14 05.24 04.84 05.37 05.16

04.87 04.72 04.99 05.03 04.90

Du

04.21 04.06 04.00 04.11 03.77

04.81 04.23 05.12 04.86 05.69

04.70 05.43 04.88 04.64 04.98

03 .83 03 .79 03 .14 03 .34 03 .34

04.90 04.63 05.19 05.07 05.01

DK

05.23 04.52 04.78 04.98 04.92

05.33 04.86 05.51 05.21 06.00

04.71 04.82 04.51 04.68 04.77

05.27 05.44 05.12 05.15 05.15

03 .35 02 .91 03 .28 0 5 . 2 3 0 3 . 3 5

6 to text 10 are Bengali texts, text 11 to text 15 are Manipuri texts, text 16 to text 20 are Urdu texts and text 21 to text 25 are Kashmiri texts. Table 2 gives a summary of a very large number of experiments and tests reflecting the over all success of the Identifier to be almost 100% (depicted through bar-chart in Fig. 1). After trying the algorithm with text length 400, efforts were made to find the optimal length required for the algorithm so that a good success rate is achieved. For this purpose the algorithm has been tried with text length 200 and 150 chars also. The summary of results with text length 200 and 150 chars has been shown in Tables 3 and 4 respectively (depicted through bar-charts in Fig. 2 and Fig. 4 respectively). Even in these cases success rate achieved is very good (88% to 98%).

Pratibha Yadav, Girish Mishra and P. K. Saxena 115

Text Length - 400 chars

Hnd Bengali Maripuri Urdu Kashmiri

• # Correctly Classified

Fig. 1. % Success for Language Identification (for texts with 400 chars).

Text Length - 200 chars

Hnd Bengali Manipuri Urdu Kashmiri

• # C orrectly C lassified


5. C o n c l u s i o n s

A five-class classification problem has been addressed using F P R . The advantage of the approach is tha t the prototypes of each of the five languages are constructed once and then any single given unknown text is identified if it belongs to any of these five languages. It is independent of dictionary and processes texts even when word-break-up is not known. The

identifier developed is working well with a very high

success rate (above 99%) for plaintexts of these five

languages for text length of 400 chars. Success ra te is

above 95% for text length of 200 chars and is slightly

less (above 88%) for text length of 150 chars. These

are depicted through bar-chart in the following Fig

ure:

Hindi Bangla Mampun Urdu Kashmiri

• 400 chars • 200 chars a 150 chars

Fig. 3. Comparison of success(%) for different text lengths considered.

Text Length -150 chars

Hnd Bengali Maripuri Urdu Kashmiri

• # Correctly Classified


References

1. L. A. Zadeh; Fuzzy Sets. Inform, and control 8, pp338-353, 1965.

2. Bezdek C. J. and Pal, S. K.; Fuzzy models for Pattern Recognition, IEEE Press, 1992.

3. C. J. Bezdek; Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum, New York, 1981.

116 Identification of Indian Languages in Romanized Form

Table 2. Success rate of Language Identification (Text length = 400 chars)

Language

Hindi Bengali Manipuri Urdu Kashmiri

#Texts

8139 7502 7584 976 396

Hindi

8102 22 0 1 0

Bengali

10 7457

13 0 0

Manipuri

0 2

7570 0 1

Urdu

27 21 1

975 0

Kashmiri

0 0 0 0

395

Success Rate (%)

99.54 99.40 99.82 99.90 99.75


Language


#Texts

16278 15004 15167 1952 791

Hindi

15449 398

3 25 8

Bengali

108 14402

107 42 0

Manipuri

26 32

15049 12 0

Urdu

769 183 11

1872 0

Kashmiri

20 1 0 0

783

Success Rate (%)

94.91 95.99 99.22 95.99 98.99


Language


#Texts

21704 20006 20223 2602 1054

Hindi

19279 733 49 74 10

Bengali

288 18715

197 63 5

Manipuri

105 97

19906 35 1

Urdu

2183 496 78

2429 0

Kashmiri

95 9 0 0

1038

Success Rate (%)

88.83 93.55 98.43 93.35 98.48

4. H. J. Zimmermann; Fuzzy Set Theory and its Applications, 4th Ed., Kluwer Academic Publisher, Bon-ston/ Dordrecht/London, 2001.

5. G. J. Klir and B. Yuan; Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall, 1997.

6. P. K. Saxena and Uma Gupta; Fuzzy Language Identifier, Proceedings of the National Seminar on Cryp-tology (NSCR), Delhi, pp. D-ll to D-20 , 1998.

7. Kenneth R. Beesley; Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Text, Language of Crossroads: Proceedings of the 29th Annual Conference of the American Translators Association, pp. 47-54, 12-16 Oct,1988.

8. Saxena P. K., Yadav P.and Sarvjeet K.; Fuzzy Sets in

10.

11.

Cryptology, Proceedings INFOSEC 1994, Bangalore, pp. 13 to 33 , 29-30 Oct,1994. Navneet Gaba, Sarvjeet Kaur and P. K. Saxena; Identification of Encryption Schemes for Romanized Indian Languages, Proceedings of ICAPR 2003, pp. 164 to 168 , 2003. Verma N., Khan S. S. and Shrikant; Statistical Feature Extraction to Discriminate Various Languages: Plain and Crypt, Proceedings of National Conference on Information Security (NCIS), pp. 1 to 10 , Jan, 2003. Yadav P. and Saxena P. K.; Identification of Regional Languages - A Fuzzy Theoretic Approach, Proceedings of NCIS, New Delhi pp. 11 to 19 , Jan, 2003.

117

Onl ine B a n g l a H a n d w r i t i n g R e c o g n i t i o n S y s t e m

K. Roy

Dept. of Computer Science West Bengal University of Technology

BF 142, Saltlake, Kolkata-700 064, India

N. Sharma, T. Pal and U. Pal


Kolkata- 700 108, India

Handwriting recognition is a difficult task because of the variability involved in the writing styles of different individuals. This paper presents a scheme for the online handwriting recognition of Bangla script. Online handwriting recognition refers to the problem of interpretation of handwriting input captured as a stream of pen positions using a digitizer or other pen position sensor. The sequential and dynamical information obtained from the pen movements on the writing pads are used as features in our proposed scheme. These features are then fed to the quadratic classifier for recognition. We tested our system on 2500 Bangla numeral data and 12500 Bangla character data and obtained 98.42% accuracy on numeral data and 91.13% accuracy on character data from the proposed system.

Keywords: Online Recognition, Indian Script, Bangla, Modified quadratic discriminant function

1. Introduction

Data entry using pen-based devices is gaining popularity in recent times. This is so because machines are getting smaller in size and keyboards are becoming more difficult to use in these smaller devices. Also, data entry for scripts having large alphabet size is difficult using keyboard. Moreover, there is an attempt to mimic the pen and paper metaphor by automatic processing of online characters. However, wide variation of human writing style makes online handwriting recognition a challenging pattern recognition problem.

Work on online character recognition started gaining momentum about forty years ago. Numerous approaches have been proposed in the literature [1-5] and the existing approaches can be grouped into three classes namely, (i) structural analysis methods where each character is classified by its stroke structures, (ii) statistical approaches where various features extracted from character strokes are matched against a set of templates using statistical tools and (iii) motor function models that explicitly use trajectory information where the time evaluation of the pen co-ordinates plays an important role.

Many techniques are available for online recognition of English,Arabic, Japanese and Chinese [1-4, 6-9] characters but there are only a few pieces

of work [10-13] available towards Indian characters although India is a multi-lingual and multi-script country. Connell et al. [11] presented a preliminary study on online Devnagari character recognition. Joshi et al. [12] also proposed a work on Devnagari online character recognition. Later Joshi et al. [13] proposed an elastic matching based scheme for online recognition of Tamil character recognition. Although there are some work towards online recognition of Devnagari and Tamil scripts but the online recognition work towards other Indian languages is very few. In this paper we propose a system for the online recognition of Bangla character. Recognition of Indian character is very difficult with compare to English because of its shape variability of the characters as well as larger number of character classes. See Figure 1 where samples of four Bangla characters are shown to get an idea of handwriting variability.

There are twelve scripts in India and in most of these scripts the number of alphabets (basic and compound characters) is more than 250, which makes keyboard design and subsequent data entry a difficult job. Hence, online recognition of such scripts has a commercial demand. Although a number of studies [14-16] have been done for offline recognition of a few printed Indian scripts like Devnagari, Bangla, Gurumukhi, Oriya, etc. with

118 Online Bangla Handwriting Recognition System

commercial level accuracy, but to the best of our knowledge no system is commercially available for online recognition of any Indian script. In this paper we propose a scheme for online Bangla handwritten character recognition and the scheme is robust against stroke connections as well as shape variation while maintaining reasonable robustness against stroke order-variations. A quadratic classifier is used here for recognition purpose.

The rest of the paper is organized as follows. In Section 2 we discuss about the Bangla language and data collection. Feature extraction process is presented in Section 3. Section 4 details the classifier used for recognition. The experimental results are discussed in Section 5. Finally, conclusion of the paper is given in Section 6.

v£ <&

%

&

3> *

•ssr

vC

"*l -zX &i *t

• 5 *

3? ^ • »

" < ^

• ^

<F

a © a 0 a

<L Fig. 1. Examples of some Bangla online characters. First three columns show samples of handwritten characters and last column shows samples of a numeral.

2. Bangla Language and online data collection

Bangla, the second most popular language in India and the fifth most popular language in the world, is an ancient Indo-Aryans language. About 200 million people in the eastern part of Indian subcontinent speak in this language. Bangla script alphabets are used in texts of Bangla, Assamese and Manipuri languages. Also, Bangla is the national language of Bangladesh.

The alphabet of the modern Bangla script consists of 11 vowels and 40 consonants. These characters are called as basic characters. Writing style in Bangla is from left to right and the concept

of upper/lower case is absent in this script. It can be seen that most of the characters of Bangla have a horizontal line (Matra) at the upper part. From a statistical analysis we notice that the probability that a Bangla word will have horizontal line is 0.994 [14].

In Bangla script a vowel following a consonant takes a modified shape. Depending on the vowel, its modified shape is placed at the left, right, both left and right, or bottom of the consonant. These modified shapes are called modified characters. A consonant or a vowel following a consonant sometimes takes a compound orthographic shape, which we call as compound character. Compound characters can be combinations of two consonants as well as a consonant and a vowel. Compounding of three or four characters also exists in Bangla. There are about 280 compound characters in Bangla [15]. In this paper we consider the recognition of Bangla basic characters.

To get an idea of Bangla basic characters and their variability in handwriting, a set of handwritten Bangla basic characters are shown in Figure 2. Main difficulty of Bangla character recognition is shape similarity, stroke size and the order variation of different strokes. By stroke we mean the set of points obtained between a pen down and pen up. From the statistical analysis on our dataset we found that the minimum (maximum) number of stroke used to write a Bangla character is 1 (4). The average number of stroke per character is 2.2. We have also seen that the characters W, ' Q' and '*>' are mostly written by single stroke whereas the character 'w' is written by almost all writer by 4 strokes. It also found that the characters ' ^ , '& , '^ ' , '^ ' , '?', 'S',^' etc. are always written by 2 strokes.

Online recognition of these characters imposes several problems like stroke-number, stroke connection and shape variations. Most characters are composed of multiple strokes. Another difficult problem involves in the stroke-order variations. Writing of different stroke sequences of a character is not similar for all persons. During handwriting, some people give upper stroke of a character before giving a lower stroke of that character. Whereas to write the same character, some people give lower stroke before giving the upper stroke of that character. These stoke order variations complicates the development procedure of an online recognition system.

K. Roy, N. Sharma, T. Pal and U. Pal 119

\3f vxi *-£^> © • * sj»3>2 *5»

T *

v!r

<?r x

*4T ST

^ ^ C

V ar f*

*F 3r^r y 5

"3T

-* \

^ ~

<rT SS 1°

• B O "

^r T> *

\ &

>£ *5 0» *J-« r

•&r •»r >r T3"

0

0

• #

Fig. 2. Examples of handwritten Bangla basic characters.

To illustrate this stoke order variation in Bangla script, Figure 3 shows a Bangla character that contains four different strokes. The left-most column shows the first stroke and this stroke is same for all the three samples of three different writers. Stroke-order varies from the second column onwards and the final (complete) character is shown in the rightmost column. From the 2nd column of Figure 3, it can be noted that the three samples have different shape. This is because of stroke order variation of the writers. For upper sample of the second column of Figure 3, the writer has given the upper stroke as second stroke. For middle sample of the second column the writer has given the lower stroke as second stroke. For lower sample, the writer has given the middle stroke as second stroke. Similar situation also occurs in 3rd column for Figure 3. Matra is a very common feature in Bangla and its length varies from writer to writer. It is seen that the presence or absence of Matra is the main difference between two characters. For example the character ' ^ ' is a basic character which has Matra but ''0' is a Bangla numeral and it does not have Matra.

For online data collection, the sampling rate of the signal is considered fixed for all the samples of all the classes of character. Thus the number of points M in the series of co-ordinates for a particular sample is not fixed and depends on the time taken to write the sample on the pad. As the number of points in

actual trace of the characters are generally large and varies greatly due to high variation in writing speed, a fixed lesser number of points, regularly spaced in time are selected for further processing.

The digitizer output is represented in the format of pj € R2X{0,1} i = 1..M, where pi is the pen position having x-coordinate (n) and y-coordinate (yi) and M is the total number of sample points. Let (pi) and (pj) be two consecutive pen points. We retain both of these two consecutive pen points (pi) and (pj) if the following condition is satisfied:

x2 + y2 > m2 (1)

where x = Xi — Xj and y = yi—yj. The parameter m is empirically chosen. We have set m equal to zero; in Equation 1 to removes all consecutive repeated points.

&

-~b

• &

~&

^

$

^

*5 %

©" 3> c, &

Fig. 3. Example of different stroke order for a character having four strokes.

Analyzing a total of 15000 Bangla characters we found that, for writing Bangla characters, the number of sample points (M) varies from 14 (for the character ^5) to 176 (for the character ^F) points. The average number of sample points in a Bangla character is 72. We also computed the average number of sample points in each character class. We noted that the character class (5f) has the maximum number of sample points and its average value is 113. The character class '&' has the minimum number (46) of sample points.

3. Feature extraction

Any online feature is very much sensitive to writing stroke sequence and size variation. Also, in Bangla Matra creates a lot of problem in online recognition. To overcome it, we detect and remove the Matra present in the characters. As Matra of Bangla script


is a digital straight line lies on the upper part of a character. A stroke is called Matra if : (a) The ratio of distance of 1st and last point of the stroke to sum of individual distances between constituent points of the stroke is less than 1.5. (b) The ratio of the stroke height to the character height [max(yi) — min(yi)Vi\ is less than 0.35 and all the points of the stroke lie in upper side (we consider top 40% height of the character as upper side) of the character.

These thresholds are decided based on statistical analysis of our dataset. If more than one stroke satisfies the above criteria then higher value of condition 1 is selected.

The features calculated based on Matra are (1) The ratio of average value of x coordinate of the selected stroke to the length of the character, (2) The ratio of average value of y coordinate of the selected stroke to the width of the character, (3) Ratio of the length (L = J2h i = 1—M where k = (x2 + j /2) , x = Xi - xi+i and y = yt - yi+i) of the stroke to the length of the character, (4) Ratio of the area of the stroke to the character and (5) Ratio of aspect ratio of the stroke to that of character.

A total of 5 features as discussed above are calculated based on Matra. After feature detection from Matra, it is then removed from the character and the rest of the points of the characters are first normalized. The normalization is done in two stages. First the points are re-sampled to a fixed number points (N) and then they are converted from equal time sample to equal distant points. For example see Figure 4. We have studied several local features, which include a normalized representation of the co-ordinates, a representation of the tangent slope angle, a normalized curvature, the ratio of tangents, etc.

The processed character is transformed into a sequence t = [£I,..,£JV] of feature vectors U = (Ui,U2,ti3)T. Here (1) t u = (Xi-/zx)/<7y and t i 2 = (yj-/xy)/<7y are the pen co-ordinates normalized by the sample mean fj, = ji 5Zi=i Pi a n d standard deviation, ay =

* / ^ r j J2 (A*I/ — Vi)2 °f the character's sample points,

and (2) ti3 = arg((xi+i-Xi-i) + j * (j/i+i - Jft-i)), with

j 2 = — 1 and "arg " the phase of the complex number above, is an approximation of the tangent slope angle at point i.

Thus finally, a feature vector sequence is defined as t = [ti,..,tff], each vector of it as U = (tn,ti2,ti3)T is obtained. Here we consider N = 50. So a total of 155 (50 X 3 [3 for each point] + 5 [5 features based on Matra]) features are used in our experiment.

..../

r"\ ••:-' 1 • /

;: . •'.. ; ; • • • '

^

fV 1 IA

/ / ^ / /

Fig. 4. Feature extraction from a sample of character is shown, (a) Original image, (b) its normalized point used as feature (mapped into 50 points), (c) the normalized character.

4. The Classifier

Based on these 155 dimensional features, recognition of characters in quadratic classifier [17] is carried out by using the following discriminant function:

g(X) = (N + N0-n-l)ln[l + j^[\\X-M\\2

i=l At T N &

k

+ X>(A< + ^ 2 ) - (2) i=l

where X is the feature vector of an input character; M is a mean vector of samples; <j>J is the ith eigen vector of the sample covariance matrix; A* is the ith eigen value of the sample covariance matrix; n is the feature size; a2 is the initial estimation of a variance; N is the number of learning samples; and AT0 is a confidence constant for a , and 7V0 is considered as 3N/7 for 155 dimensional feature. We do not use all eigen values and their respective eigen vectors for the classification. We sort the eigen values in descending order and take first 70 eigen values and their respective eigen vectors for classification. Rejection in the system is done if for a character the difference of 1st and 2nd value of g(X) is smaller than a threshold.

K. Roy, N. Sharma, T. Pal and U. Pal 121

5. Results and Discussion

The experimental evaluation of the above techniques was carried out using isolated Bangla character and numerals (digits). The data was collected from people of different background both by using mouse as well as Wacom tablet. The mouse is used as an input medium to facilitate the user with the flexibility of using this common input medium. The data obtained by mouse is in general poorer than Wacom tablet. Our another aim to use this mouse as one of the input devices is to test our system on these poorer data. If we get encouraging results from the data capture by mouse then there is no need to use any additional hardware like Wacom tablet for data capture. A total of 15,000 characters (2500 digits and rest are characters) are collected for the experiment. Out of them 66.7% of the characters (digits) are used for the training of the classifier for the present work and rest is used for the testing purpose. The recognition accuracy obtained from our classifier is shown in Table 1.

Table 1. Recognition results on Bangla character and numeral (here rejection is not considered).

Data Character Numeral

Recognition rate 91.13% 98.42%

Error rate 8.87% 1.58%

From the experiments we noted that the overall recognition accuracy of the proposed scheme was 91.13% for Bangla characters and 98.42% accuracy for Bangla numerals when rejection is not considered. Accuracy of 96.23% and 99.58% is obtained for Bangla characters and numerals respectively, if we consider first two top choices of the recognition results. The detail recognition results are given in Table 2.

Table 2. Recognition results based on different choices from top.

Top choices

1 2 3 4 5

Accuracy Character

91.13% 96.23% 97.68% 98.26% 98.39%

Numeral 98.42% 99.58% 99.72% 99.72% 99.86%

We also analyzed the error versus rejection rate

of the classifier and the results are presented in Table 3. Table 3 depicts that for Bangla numeral, 1.57% error occurs at a rejection of 0.13% and only 0.85% error occurs when rejection was 1.10%. Table 3 also depicts that for Bangla character 6.56% error occurs at a rejection of 3.79% and only 2.64% error occurs when rejection was 17.27%.

Table 3. Rejection versus error rate obtained for the Characters and Numerals.

Characters Rejection

3.79% 7.44% 12.28% 17.27%

Error 6.56% 5.12% 3.69% 2.64%

Numerals Rejection

0.13% 0.61% 0.85% 1.10%

Error 1.57% 1.10% 0.96% 0.85%

Recognition accuracy of about 96% (99.6%) is obtained in character (numeral) recognition, if first two choices are considered. These different choices in recognition accuracy will be very helpful in designing a complete recognition system on word/sentence level where we will be able to take the help of dictionary or we can predict alternate results. From the experiment we noted most of the errors occurred because of similar shape characters. Maximum error occurred between ' T and ' T and it is noted that about 1.067% cases they mis-recognized one as the other. Main difference between the above two characters is that there is a small loop in left bottom side of one of the characters. Some times during handwriting people do not give this loop and hence mis-recognition occurs. The next erroneous character pair in Bangla is '&', and 'l»'. They mis-recognized in 0.93% cases. The main difference between the above characters is also the presence of loop in the lower half of the character and sometimes people do not give this loop during handwriting. The five most erroneous character pairs obtained from our experiment are shown in Figure 5. Here erroneous Bangla character pairs are shown in the first row. Percentage of error as well as number of times these character pairs mis-recognized between them are given in the second row of this Figure. For example character '*T is mis-recognized as ' T in 19 cases, and the characters ' T is mis-recognized as ' T in 11 cases. So they mis-recognized between them in 30 (19+11) cases and their percentage of error is 1.067%.


-V&J" 1.067%

30(19+11)

T? ~B SYT (?T <5 \& 0.93%

26(16+10) 0.32% 9(8+1)

0.14% 4 (3+1)

*4~ n 0.14% 4 (3+1)

Fig. 5. Examples of some erroneous characters.

upper part . We tested the proposed system on 15000

samples and obtained encouraging results. Not much

work has been done towards the online recognition

of Indian scripts in general and Bangla in part icular .

So this work will be helpful for the research towards

online recognition of other Indian scripts.

We compared our work with the work of Garain

et. al. [10]. For numeral recognition they have used

a dataset of 400 numerals for their work. They

captured their da t a by Wacom tablet and did not

used mouse for da ta collection. For feature extraction

they have used angular variation in 8 directional

code and Euclidian distance of each sample point.

Nearest neighbor classifier has been used by them

for recognition. A comparative result on numeral

recognition is shown in Table 4. Garain et. al.

obtained 97.43% accuracy on training da ta of size

400 and they did not report any test result. We

obtained 99.86% accuracy on training da ta and

98.42% on test data .

Table 4. Comparison result on numeral recognition.

Method

Garain et. al. [10] Proposed

Data size 400

2500

Recognition Train

97.43% 99.86%

Test Not reported

98.42 %

Comparative results of character recognition is

shown in Table 5. Garain et. al. [10] have used a

dataset of 2440 character collected using Wacom

tablet for their work and obtained 96.34% accuracy

on training data . We obtained 98.73% accuracy on

training da t a and 91.13% on test da ta .

Table 5. Comparison result on character recognition.

Method

Garain et. al. [10] Proposed

Data size 2440 12500

Recognition Train

96.34% 98.73%

Test Not reported

91.13 %

6. C on c lu s ion

This paper presents a quadratic classifier based

system for the recognition of online Bangla

handwrit ing. A modified and robust feature

extraction technique is proposed here which can be

used for the scripts with Matra/Shirorekha at the

References

[1] R. Plamondon and S. N. Srihari, IEEE PAMI 22, 63 (2000).

[2] C. C. T. C. Y. Suen and T. Wakahara, IEEE PAMI 12, 179 (1990).

[3] E. J. Bellagarda, J. R. Bellagarda, D. Nahamoo and N. S. Nathan, A probabilistic framework for online handwriting recognition, in Proc. of 3rd IWFHR, 1993.

[4] I. Guyon, M. Schenkel and J. Denker, Overview and synthesis of on-line cursive handwriting recognition techniques, in Handbook of Character Recognition and Document Image Analysis, 1997.

[5] CEDAR, Penman: Handwritten Text Recognition Project Description.

[6] C. Bahlmann and H. Burkhardt, IEEE PAMI 26, 1 (2004).

[7] S. D. Connell and A. K. Jain, PR 34, 1 (2001). [8] S. Jaeger, C. L. Liu and M. Nakagawa, IJDAR 6,

75 (2003). [9] C. L. Liu, S. Jaeger and M. Nakagawa, IEEE PAMI

26, 198 (2004). [10] U. Garain, B. B. Chaudhuri and T. Pal, Online

handwritten indian script recognition: A human motor function based framework, in Proc. of 16th ICPR, 2002.

[11] S. D. Connell, R. M. K. Sinha and A. K. Jain, Recognition of unconstrained online devnagari characters, in Proc. of 15th ICPR, 2000.

[12] N. Joshi, G. Sita, A. G. Ramakrishnan, V. Deepu and S. Madhvanath, Machine recognition of online handwritten devanagari characters, in Proc. of 8th ICDAR, 2005.

[13] N. Joshi, G. Sita, A. G. Ramakrishnan and S. Madhvanath, Elastic matching algorithms for online tamil character recognition, in Proc. ICONIP, 2004.

[14] B. B. Chaudhuri and U. Pal, PR 3 1 , 531 (1998). [15] B. B. Chaudhuri and U. Pal, An ocr system to read

two indian language .scripts: Bangla and Devnagari, in Proc. of 4th ICDAR, 1997.

[16] V. Bansal and R. M. K. Sinha, On how to describe shapes of Devanagari characters and use them for recognition, in Proc. 5th ICDAR, 1999.

[17] T. Wakabayashi, S. Tsuruoka, F. Kimura and Y. Miyake, Systems and Computers in Japan 26, 35 (1995).

123

Oriya Off-Line Handwritten Character Recognition

U. Pal, N. Sharma


Kolkata-108, INDIA, E-mail:umapada@isical. ac. in

F. Kimura

Graduate School of Engineering Mie University

1577 Kurimamachiya-cho, TSU, Mie 514-8504, Japan

Recognition of handwritten characters is difficult because of variability involved in the writing style of different individuals. This paper deals with recognition of off-line Oriya handwritten characters using quadratic classifier based on the features obtained mainly from directional chain code information. Here, at first, the bounding box of a character is segmented into blocks and directional chain code features are computed in each of these blocks. Next, these blocks are down sampled using Gaussian filter for recognition. Finally, chain code features obtained from down sampled blocks are fed to the quadratic classifier for recognition. We used two sets of feature vectors (one feature vector has 64 dimension and the other has 400 dimension) and their corresponding results obtained from the classifier are reported. We tested our system on 9556 Oriya off-line handwritten characters and we obtained 91.11% accuracy from the proposed recognition system when 400 dimensional feature vector was considered. We have used five-fold cross-validation technique for result computation.

Keywords: Oriya script, Handwritten character recognit

1. Introduction

Recognition of handwritten characters has been a popular research area for many years because of its various application potentials. Some of its potential application areas are postal automation, bank cheque processing, automatic data entry, etc. There are many pieces of work towards handwritten recognition of Roman, Japanese, Chinese and Arabic scripts, and various approaches have been proposed by the researchers towards handwritten character recognition.1'2 One of the widely used approaches is based on Neural Network.1 Here the network architecture is, at first, trained by a set of training data and then the trained networks classify the input. Some researches used structural approach, where each pattern class is defined by structural description and the recognition is performed according to structural similarities.3 Statistical approach is also applied to character recognition.4 It is insensitive to pattern noise and distortion but modeling of statistical information is a tedious task. Combination of structural and statistical method is also used for character recognition.4 Among others, Support vector machines,5

Fourier and Wavelet description,6 Fuzzy rules,7 tolerant rough set,8 are reported in the literatures.

Indian script, Document analysis.

In this paper, we propose a system for the recognition of unconstrained off-line Oriya handwritten characters. Although some work have been done towards the recognition of Oriya printed characters9 '10

and handwritten Oriya numerals11 but to the best of our knowledge there is no work on Oriya handwritten characters and this is the first work of its kind.

In this paper a quadratic classifier based scheme is proposed for unconstrained off-line Oriya handwritten character recognition. In the proposed scheme, at first, the bounding box of a character is segmented into blocks and directional features are computed in each of these blocks. Next, these blocks are down sampled using Gaussian filter for recognition. Finally, chain code features obtained from down sampled blocks are fed to the quadratic classifier for recognition.

Rest of the paper is organized as follows. In Section 2 properties of Oriya language and data collection for the present work are discussed. Feature extractions procedure is reported in Section 3. In Section 4, we briefly explain the classifier used for the recognition purpose. The experimental results are discussed in Section 5. Finally conclusion is given in Section 6.

124 Oriya Off-Line Handwritten Character Recognition

2. Oriya Language and data collection

India is a multi-lingual and multi-script country and there are about 22 official languages. Oriya is one of the official languages in India and it is mainly used in the Indian state of Orissa. More than 31 million people in the eastern part of Indian subcontinent speak in this language. The Oriya script, by which Oriya language is written, is developed from the Kalinga script, one of the many descendents of the Brahmi script of ancient India.

The alphabet of the modern Oriya script consists of 11 vowels and 41 consonants. These characters are called basic characters and the basic characters of Oriya script are shown in Fig.l. Out of these 52 basic characters in Oriya, two characters are equal in shape. For recognition, we consider these two characters as one class. Writing style of Oriya script is from left to right and as in other Indian script, concept of upper/lower case is absent in Oriya script. From Fig.l it can be noted that most of the characters in Oriya is circular in nature and there is no horizontal line (like Matra/Shirorekha in De-vnagari script) in the characters of this script. In Oriya script a vowel following a consonant takes a modified shape. Depending on the vowel, its modified shape is placed at the left, right, both left and right, or bottom of the consonant. These modified shapes are called modifiers or matra. A consonant or a vowel following a consonant sometimes takes a compound orthographic shape, which we call as compound character. Compound characters can be combinations of two consonants as well as a consonant and a vowel. There are more than 200 compound characters in Oriya script10 and in this paper we consider the recognition of off-line handwritten Oriya basic characters.

Main difficulty of any recognition system is shape similarity. In Oriya there are many characters having similar in shape. Examples of some groups of similar shaped characters are shown in Fig.2. From the figure it can be seen that shapes of two or more characters of a group is very similar and such shape similarity makes the recognition system more complex to get higher recognition rate. Data collection for the present work has been done from different individuals of various professionals. We have collected 9556 data (at least 180 samples of each character class) for the experiment of the proposed work. We used a flatbed scanner for digitization. Digitized im

ages are in gray tone with 300 dpi and stored as TIF Format. For binaraization of the data we used the Otsu12 method and have converted the data into two-tone (0 and 1) images (Here ' 1 ' represents object point and '0' represents background point).

2 l 2 l l « f i Q Q € ) ^ ( 3 ^ l ( 3 l ( a )

<S\QQ @ £1 @ ® £J Q § o 8 ~ ( b )

Fig. 1. Basic characters of Oriya script, (a)Vowels and (b) Consonants

€161

fi?Q

Q Q Q

QQ

0 ©

QQ Fig. 2. Examples of some similar shaped characters

3. Feature selection

Histograms of direction chain code of the contour points of the characters are used as feature for recognition.13'14 We have used two sets of features for Oriya character recognition. For high-speed recognition we use 64 dimensional features, and for high accuracy recognition we use 400 dimensional features. The feature extraction procedure is described below.

3

4* *

5

r

j

t

1 I 1 k 1 J*

r . *

> | 7

X

x P X

i

X 1

Fig. 3. (a) For a point P and its four neighbors are shown by X,- (b) For a point P the direction codes for its eight neighboring points are shown

U. Pal, N. Sharma and F. Kimura 125

Q Bounding

box

Contour

7X7 block

(b)

Extraction

and

segmentation

Ch»n Cose

1

i

4

Court

2

T" 2

0 % I i f 1 1 •.

Cham code (d)

*• -t *

- f «. - w ™ ^ . H M ,fa «

(fl 11

P4

a -6 to 0>\ T—1

4H

0 4-i

<tj

u t. w

4 3 2

7 S

^ 1 2 J e )

./*tv

Gaussian Filter

: 1(f) Down sampling to 64 dim. feature

Fig. 4. Pictorial representation of the 64 dimensional feature extraction process for an Oriya character, (a) Two tone image of a Oriya character 'KA' (b) Bounding box of the character. (c) Contour of the character shown in black color and the bounding box is segmented into 7 X 7 blocks, (d) Chain code of a block shown in zoomed version and its chain code count is also shown in the neighboring table (e) 196 dimensional chain code features obtained from 7 X 7 block, (f) 64 dimensional features obtained after down sampling by a Gaussian filter.

Given a two-tone image, we first find contour points of the image by the following algorithm. For all object point in the image, consider a 3 x 3 window surrounded to the object point. If any one of the four neighboring points (as shown in Fig.3(a)) is a background point then this object point (P) is considered as a contour point. Otherwise it is a non-contour point.

3.1. 64 Dimensional feature extraction:

At first we compute the bounding box (minimum rectangle containing the character) of an input character. This bounding box is then divided into 7 x 7 blocks (as shown in Fig.4(c)). In each of these blocks the direction chain code for each contour point is noted and frequency of direction codes is computed. Here we use chain code of four directions only [directions 0 (horizontal), 1 (45 degree slanted), 2 (vertical) and 3 (135 degree slanted)]. See Fig.3(b) for illustration of four chain code directions. We assume chain code of direction 0 and 4 are same. Also, we assume direction 1 and 5, 2 and 6, 2 and 7 are same.

Thus, in each block, we get an array of four integer values representing the frequencies of chain code in these four directions. These frequencies are used as feature. Histogram of the values of these four direction codes in each block of an Oriya handwritten character ('KA') is shown in Fig.4(e). Thus, for 7 x 7 blocks we get 7 x 7 x 4 =196 features. To reduce the dimension of feature vector, after the histogram calculation in 7 x 7 blocks, the blocks are down sampled into 4 x 4 blocks by a Gaussian filter. As a result we have 4 x 4 x 4 (=64) features for recognition. Histogram of these values of all the four directions obtained after down sampling is shown in Fig.4(f). In this feature calculation we perform a height normalization and it is simply done by multiplying each feature vector component by the ratio of the standard height to the actual height of the character. For our experiment, we consider 76 as the standard height.

3.2. 400 Dimensional feature extraction:

To obtain 400 dimensional features we apply the following steps.

Step 1: At first size normalization of the input


binary image is done. Here we normalize the image into 126 x 126 pixels.

Step 2: The input binary image is then converted into a gray-scale image applying a 2 x 2 mean filtering 5 times.

Step 3: The gray-scale image is normalized so that the mean gray scale becomes zero with maximum value 1.

Step 4: Normalized image is then segmented into 9 x 9 blocks.

Step 5: A Roberts filter is then applied on the image to obtain gradient image. The arc tangent of the gradient (strength of gradient) is quantised into 16 directions and the strength of the gradient is accumulated with each of the quantized direction. By strength of Gradient ( f(x,y) ) we mean f(x,y)= ^(Au)2 + (At;)2

and by direction of gradient 0(x, y) we mean 9(x,y)= t a n - 1 ^ ,

where, Au = g(x+l,y+l) - g(x,y) and , Av = g(x+l,y) - g(x,y+l) and, g(x,y) is a gray scale at (x, y) point.

Step 6: Histograms of the values of 16 quantized directions are computed in each of 9 x 9 blocks.

Step 7: 9 x 9 blocks is down sampled into 5 x 5 by a Gaussian filter. Thus, we get 5 x 5 x 16 = 400 dimensional feature.

4. Character Recognition Classifier

Recognition of characters in quadratic classifier13 is carried out by using the following discriminant function:

g(X) = (N + N0~n-1) ln[l + j^[\\X - M| |2

i = l M + N a

+ $ > ( A i + 7f(72). (1) i = l

where X is the feature vector of an input character; M is a mean vector of samples; <f>[ is the ith eigen vector of the sample covariance matrix; A; i is the ith

eigen value of the sample covariance matrix; n is the feature size; a1 is the initial estimation of a variance; N is the number of learning samples; and N0 is

a confidence constant for a , and N0 is considered as 3N/7 for 64 dimensional feature, and N0 = N/9 for 400 dimensional feature. We do not use all the eigen values and their respective eigen vectors for the classification. For 64 dimensional case, we sort the eigen values in descending order and take first 20 eigen values and their respective eigen vectors for classification. For 400 dimensional case, we take first 40 sorted eigen values and their respective eigen vectors for the classification.

Here, at first, for high speed recognition we use 64 dimensional features in the quadratic classifier. Next, to get higher accuracy we use 400 dimensional features. Rejection criteria of the proposed system depends on the difference of 1st and 2nd value of g(X).

5. Results and discussion

We applied our proposed scheme on 9556 characters obtained from individuals of different sections of population (like student, teacher, government employee, businessmen etc.) of Orissa state. From the experiment with 64 dimensional features we note that the overall recognition accuracy of the proposed scheme is 84.58%, when zero rejection rate is considered. From the experiment with 400 dimensional features we note that the overall recognition accuracy of the proposed scheme is 91.11%, when rejection rate is zero. We also notice that 92.61% (96.52%) accuracy is obtained if we consider first two top choices of the recognition results in 64 (400) dimensional features. The detail recognition results with different choices (from top) are given in Table 1. For result computation, we have used five-fold cross validation scheme. Here we divided the database into 5 subsets and testing is done on each subset using rest of the subsets for learning. The recognition rates for all the subsets are averaged to get accuracy.

Table 1. Recognition results of Oriya handwritten characters based on different choices from top(at 0% rejection).

No. of Top Choices

1 2 3 4 5

64 Dimensional Feature

84.58% 92.61% 95.98% 97.29% 98.13%

400 Dimensional Feature

91.11% 96.52% 98.19% 98.92% 99.15%

U. Pal, N. Sharma and F. Kimura 127

From the experiment we noted that in both 64 and 400 dimensional feature vector the character 'THA'(O) got the highest recognition accuracy (99.21% in 64 dimension and 100% in 400 dimension). Also from the experiment we noted that both in 64 and 400 dimensional feature vectors the character 'BHA'(W) got the lowest recognition accuracy (47.37% in 64 dimension and 61.05% in 400 dimension). This lower accuracy is due to the similarity of

the character 'BHA'(Q) with the characters 'U'(W)

and 'UU' (Q). Although from the printed version of these characters we can see some differences in the shape of these three characters, but during writing people do not give much attention and hence these characters look very similar. As a result such lower accuracy is obtained. To get an idea about the shape of some of these handwritten characters a few handwritten samples of these three characters are shown in Fig.5. From this figure it can be noted that many samples of these three characters look very similar.

Table 2. Main Confusion pair obtained from the experiment.

mis-recQjniized as the samples of the character class 'BHA'(W).

Character class

Q Q Q Q Q Q Q

Classified as

Q n

Q Q €1 ® Q ©

400 Dim. features

29.47%

23.15%

17.46%

11.64%

9.13%

7.98%

7.38%

64 Dim. features

29.54%

23.64%

20.10%

8.46%

13.44%

8.46%

13.30%

From the experiment we note main confusing character pairs and their results are given in Table 2. From the table it may be noted that both in 64 and 400 dimensional case maximum confusing character

pair is 'BHA'(® ) and ' U ' ( Q ) . About 29% samples

of the character class 'BHA'(v2«) are mis-recognized

as the samples of the character class 'U'(vJv- Also

about 23% samples of the character class 'U'

Printed characters

Q

Q

Q

Handwritten samples

<? GL(F)^<£ @<3J ^ © ^ 65 <£. *$ § < ^

Fig. 5. Hand written samples of three Oriya characters.

We also analyzed error versus rejection rate of the classifier and the details results are shown in Table 3. From the table it can be noticed that in 64 (400) dimensional feature 11.32% (5.61%) error occurs when we reject 7.02% (6.92%) characters, and only 3.42% (0.99%) error occurs when 28.09% (28.00%) characters are rejected.

There is no existing work on Oriya handwritten characters and hence we cannot compare our results.

Table 3. Rejection versus error rate obtained from the proposed classifier.

64 Dimension Rejection

0.00% 7.02% 12.22% 17.82% 28.09%

Error

15.42% 11.32% 10.0% 6.36% 3.42%

400 Dimension Rejection

0.00% 6.92% 12.42% 17.96% 28.00%

Error

8.89% 5.61% 3.77% 2.45% 0.99%

6. Conclusion

This paper deals with a scheme for the recognition of unconstrained off-line Oriya handwritten characters. To take care of variability involved in the writing style of different individuals, the features are mainly considered from the contour of the characters. We tested our scheme on 9556 data and obtained 91.11% recognition accuracy. To the best of our knowledge there is no work on Oriya handwritten character and this is the first work on Oriya handwritten character recognition.

are


References

1. R. Plamondon and S. N. Srihari, IEEE Trans, on PAMI 22, 62 (2000).

2. Y. Suen, M. Berthod and S. Mori, Proceeding of the IEEE 68, 469 (1980).

3. U. Pal and B. B. Chaudhuri, Pattern Recognition 37, 1887 (2004).

4. J. Cai and Z. Q. Liu, IEEE Trans, on PAMI 21 , 263 (1999).

5. H. Byan and S. W. Lee, IJPRAI17, 459 (2003). 6. P. Wunsch and A. F. Laine, Pattern Recognition 28,

1237 (1995). 7. Z. Chi and H. Yan, Pattern Recognition 28, 56

(1995).

8. K. Kim and S. Y. Bang, IEEE Trans, on PAMI 22, 923 (2000).

9. S. Mohanty, IJPRAI 12, 1007 (1998). 10. B. B. Chaudhuri, U. Pal and M. Mitra, Sadhana 27,

23 (2002). 11. K. Roy, T. Pal, U. Pal and F. Kimura, Oriya hand

written numeral recognition system, In Proc. 8th International Conference on Document Analysis and Recognition 2005.

12. N. Otsu, IEEE Trans, on SMC 9, 62 (1979). 13. F. Kimura, K. Takashina, S. Tsuruoka and

Y. Miyake, IEEE Trans, on PAMI 9, 149 (1987). 14. F. Kimura, T.Wakabayashi, S. Tsuruoka and

Y. Miyake, Pattern Recognition 30, 1329 (1997).

Recognition of Handwritten Bangla Vowel Modifiers

S. K. Parui, U. Bhattacharya and S. K. Ghosh

Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata-700108, India E-mail: [email protected], [email protected], [email protected]

There have been a few studies on recognition of handwritten basic characters of Indian scripts including Bangla. However, for real life applications, a proper study of its vowel modifiers and their handwritten variations is essential. The only work on the recognition of handwritten vowel modifiers of an Indian scripts reported so far1 deals with vowel modifiers that appear in the lower zone of a handwritten Bangla character. In the present study, we propose a scheme for detection and recognition of vowel modifiers that appear in the middle. The study is based on shape, size and positional features extracted from the skeleton of a handwritten character image. Recognition is done on the basis of Gaussian and Dirichlet mixture distributions. The scheme has been tested on a recently developed large database of handwritten Bangla vowel modifiers and the results are satisfactory.

Keywords: Handwritten character recognition; Handwritten Bangla character recognition; Bangla vowel modifiers; Dirichlet distribution

1. Introduction

India has a large number of languages and scripts with Bangla being the second most popularly used Indian script just behind Devanagari. Also, Bangla is the official script of Bangladesh, a neighbouring country of India and it is used by more than 200 million people worldwide.

Diverse schemes for handwritten character recognition are discussed in Ref. 2. They include works on English, Chinese, Korean, Arabic and Kanji scripts. Such recognition works for Indian scripts include Refs. 3 and 4 for Devnagari characters and Refs. 5 and 6 for Bangla characters. All these recognition studies for handwritten characters of Indian scripts were done on the basis of small databases collected in laboratory environments. Moreover, research in recognition of off-line handwritten characters of Indian scripts has been limited to basic characters. Any significant research work in such an area requires availability of standard tools and data resources. The progress in handwritten character recognition research for Indian scripts particularly Bangla has not been significant due to nonavailability of standard databases of handwritten numerals/characters/words. However, situation has started to change in the recent years and a few such databases have already been reported.7

2. Background

There are 11 vowels (Fig. 1(a)) and 39 consonants (Fig. 1(b)) in modern Bangla alphabet. They are called basic characters. Unlike in English, the con

cept of upper/lower case is absent in Indian scripts. However, often a vowel in conjunction with a consonant or a cluster of consonants form different shapes and thus the total size of the alphabet of an Indian script may be as large as 500. In Bangla, a vowel other than '5T following a consonant may take a modified shape and these are called vowel modifiers (VMs). Modified shapes of 10 vowels (other than *5T) are shown in Figs. 2(a) — (j). These will henceforth (in the rest of this article) be respectively called as A, I, II, U, UU, RI, E, AI, O, and AU.

^, Sit, % t , % @, \ 3, tf, «, S (a)

^, if *t, \ «, 'rJ, % % % tf»,

% i , vs, u, «r, ^ % w, «, ^ n *p, , *i, \ *, *r, n ^ n 3, vSt U, % % ^ g, *

(b)

Fig. 1. Bangla basic characters (o) vowels; (6) consonants.

A large database of handwritten Bangla basic characters and a scheme for its recognition was reported8 recently. A similar database of Bangla VMs has also been developed.1 A major problem encountered in Bangla handwriting recognition is the presence of VMs along with basic characters. These VMs




130 Recognition of Handwritten Bangla Vowel Modifiers

1, r, i , « , *, <, &, £ , c t , c i (a) (b) (c) (d) (O </) (g) (h) ( 0 (;)

Fig. 2. Vowel modifiers of Bangla (a) A; (6) I; (c) II; (d) U; (e) UU; (/) RI; (g) E; (ft) AI; (i) O; (j) AU.

may appear either in the lower zone or in the middle zone or in both upper and middle zones of a basic character. These are explained in Fig. 3. The only entities that appear entirely in the lower zone are three

VMs, namely, «.(U), «.(UU) and <(Ri). In an earlier work,8 handwritten shapes of these three VMs were studied for their detection and recognition.

upper zone middle zone lower zone

ftfl"^C?fa Fig. 3. Three zones found in Bangla characters.

The skeletal representation of the first pattern may have an end pixel in the lower half (Figs. 5(a), (c), (e), (k), ([), (m) and (n)) or it may contain a loop that is vertically elongated (Figs. 5 (b), (d), (/), (o), (p), (q) and (r)). The skeleton of the second pattern also may have an end pixel in the lower half (Figs. 5 (i), (j), (k), (/), (o) and (p)) or it may have a loop that is not vertically elongated (Figs. 5. (g), (h), (m), (n), («), (r)).

(it) 0 ) (m) (») (o) (p) (S) (,)

Fig. 5. Variations in handwritten forms of 7 VMs of Bangla whole or part of which occur in the middle zone.

Of the remaining 7 vowel modifiers, 3 appear only in the middle zone while the other 4 appear in both the upper and middle zones in their printed forms. However, the shapes of the parts of the later four vowel modifiers that appear in the middle zone are exactly the same as those of the three VMs that appear entirely in the middle zone. In the present paper, handwritten forms of only the VMs or parts of VMs occurring in the middle zone are studied for their detection and classification. These 7 VMs occurring along with the basic character 3 are shown in Fig. 4. From this Figure, it can be seen that parts of these VMs that appear in the middle zone have only two distinct patterns, namely, 1 and '. The first of these two patterns can appear both on the left and on the right of a basic character while the second one occurs only on the left. Detection and recognition of the above are easier in case of their printed form.

^1 ft % J3'fa JM J3\ («) (4) (c) (d) («) (/) (*)

Fig. 4. Shapes of 7 VMs under consideration; (a), (d) and (J) VM appear only in the middle zone; (b), (c), (e) and (g) VM appear in both the upper and middle zones.

The shapes of the above two patterns occurring in the middle zone widely varies in their handwritten forms. Skeletal shapes of several handwritten samples consisting of major varieties are shown in Fig. 5.

3. Description of the methodology

3.1. Training and teat samples

It is an accepted fact that any effective work towards handwriting recognition requires a database of representative samples. Recently we developed such a database of bangla handwritten vowel modifiers.1

This database consists of a training set of 34000 samples and a test set of 6000 samples. There are equal numbers of training/test samples in each of the 10 classes. The present work is based on the above database of training and test sets.

3.2. Characteristics of the shapes of the VMs

A handwritten VM image is first binarized using Otsu's algorithm9 and cleaned by median filtering. The skeleton of the input image is obtained by applying the thinning algorithm of Datta and Parui.10

Simple heuristics have been applied for pruning of possible hairs in the skeleton. Three features are initially computed from the skeletal image of the input sample. These are presence or absence of (i) a skeletal end pixel, (ii) a skeletal loop that is not vertically elongated or (m) a skeletal loop that is vertically elongated. Let us now define four events described in Table 1.

S. K. Parui, U. Bhattacharya and S. K. Ghosh 131

Table 1. Four events characterizing various shapes of vowel modifiers.

Events Characteristics

L A skeletal end pixel in the bottom right quadrant

M A skeletal end pixel in the bottom left quadrant

N A nearly circular loop in the bottom half

V A vertically elongated loop

An extensive study of the skeletal shapes of the VM samples in our database shows that there are 18 handwritten variations, shown in Fig. 5, that the said 7 VMs can have and each of these can be characterized in terms of the above 4 events. For example, each of Figs. 5(a), (e) and (j) is characterized by the event L, Figs. 5(b),(f),(o),(p) and (r) by the event V and Fig. 5(g) by the event NV.

VM images satisfying event L have either a vertical line (i) on the right as shown in Figs. 5(a), (e), (k) and (Z) or a (") as shown in (Figs. 5(j), (I) and (p). We prepare two training sets TLI and TL2 of images having these two symbols respectively. We also prepare a training set TLZ of images having no VM but satisfying condition L. Similarly, sample images of VM satisfying condition M have either a vertical line (i)on the left (Fig. 5(c)) or ' (Fig. 5(i)). We prepare two training sets TMI a n d TMI of im

ages having these two symbols respectively. We also prepare a training set TMZ of images having no VM but satisfying condition M. In total we prepare 12 training sets of images given in Table 2.

3.3. Feature Vectors and Their Distributions

If there is a skeletal end pixel (Q) in the lower half of the image, we trace the skeleton from Q until a junction pixel or another end pixel (R) is encountered.

In case of a loop detected anywhere in the image, the junction pixel (J) is found and the skeleton is traced in North or West direction until another junction pixel or an end pixel(i?) is encountered. The curve (V) thus traced is divided into 5 segments of equal length.1 That is, four points PI,P2,P$,PA on V are found such that the curve distances between Pi-i and Pi(i = 1,2, . . . , 5 ) are equal where PQ = Q and P& = R. For this we have used the algorithm of Parui et al.11 Let 9i,i = 1,2,. . . ,5 be the angles that the line segments Pi-iPi make with the a;—axis. Note that the angles 0* are features that are invariant under scaling or size of the curve and represent only its shape. Finally, the feature vector of the curve is defined as 9 = (6i,e2,e3,94,e5,L,X,Y), where L,X,Y are the length and the coordinates of the centre of gravity of the curve respectively. We assume that 0 follows a mixture of 8-dimensional Gaussian distributions,12 which is defined as,

/ ( e) = YlPiUi^' where

i= l

« i ( 9 ) =

exp{-.5(9 - /XifErH© - Mi)}/{(27r)8/2|E i|1/2}

In the above, Pi are prior probabilities and K\ is the number of components in the mixture.

To distinguish between vertically elongated loops and loops with other shapes, the feature vector a = (Ui,U2,X,Y) is used, where Ui is the perimeter of the loop normalized with respect to the image height, C/2 is a measure of vertical elongatedness of the loop and is defined as the ratio of the width and height of the bounding box of the loop and X, Y indicate the coordinates of its centre of gravity, a also is assumed to follow a mixture of 4-dimensional

Table 2. Training sets and their characteristics.

Sets Description of the sets

T L I Images satisfying L and having 1 on the right

TLI Images satisfying L and having C

TL3 Images satisfying L but having no VM

T M I Images satisfying M and having 1 on the left

Tjv/2 Images satisfying M and having C

TM3 Images satisfying M but having no VM

Tjvi Images satisfying N and having C

TJV2 Images satisfying N but having no VM

T01 Images satisfying Vand having 1

T02 Images satisfying V but having no VM

Tpi Images obtained after removing " if present,

and 1 from the images in TLI U T Q I

Tp2 Images obtained after removing the vertical

line on the right from the images of the

4 basic characters *t. CT. ^T. "f


Gaussian distributions, denoted as

K2

g{a) = Y^QiVi{a), where i = i

Vi(a) =

exp{-.5(a - / i i f E - ^ a - /x i)}/{(27r)4/2 |E i |1/2}

In the above, qi are prior probabilities and K<i is the number of components in the mixture.

When there is a vertical line on the right, it may or may not be a VM (Fig. 6(a) and Fig. 6(6)). To distinguish between a VM and a non-VM, in such cases, we use a mixture of Dirichlet distributions,13

which is defined as

K3

h((3) = ^rjWj(/3) , where i = l

Ti are prior probabilities and K3 is the number of components in the mixture and

1*03) = !*(/?!,...,&) = / ^ i\P?'-\ where

i= l

4 4

Y^ fa = 1, 0 < fa < 1 and m = 2 J »™i> rn,i>0 i=l i=l

The feature vector /? = (/?i,..., (3A) used above is computed from the skeleton image as follows.

The bounding box (minimum enclosing rectangle) of the skeleton excluding the vertical line on the right is divided into four horizontal strips of equal height as shown in Figs. 6(a) and 6(6). j3i(i = 1,2,..., 4) is computed as the proportion of skeletal pixels present in the ith strip (i = 1,2,.. . ,4 from top to bottom).

Fig. 6. (a) A VM is present. (b,c) No VM is present

3.4. Classification

3.4.1. Training

From the training set T/,2 we learn the parame-

ters of / i ( 0 ) = yjpijUii(©)- From the combined t= i

training set TLI U TLZ, we learn the paramenia

ters of / 2 ( 9 ) = Yl P2iU2i(Q)- Similarly, from the

i= i training sets TM\,TM2 and TM3, the distributions /3(©)i /4(©) a n < i /5(®) respectively are determined. From the training sets T/vi and TV2 the distributions fe(@) and /?(@) respectively are determined.

We learn the parameters of gi(a) = 2_.Qiivu(a) i=l

from the combined training set T/viUX^ and the pa-

rameters of 32(a) = 2_]q2iV2i(a) from the combined »=i

training set To\ U To2-Note that the distributions U2i(©)'s represent

shapes of types that are present on the right side of the images shown in Fig. 6. From the mean vectors of these distributions, we identify the U2,(0)'s that represent a vertical shape (Fig. 6(a) and Fig. 6(6)). The other U2i(0)'s represent a shape that is not vertical (Fig. 6(c)).

Consider the two images in Fig. 6 in which there is a vertical shape on the right side in both the images. However, this shape is a VM in Fig. 6(a), but not a VM In Fig. 6(6). Two distributions h\{j3) and /i2(/3) are estimated from Tp\ and Tp2 respectively.

Expectation maximization (EM) algorithm is used to learn the parameters of the above mixture distributions. In order to find the optimal value of K, the Bayesian information criterion (BIC)14 is used.

3.4.2. Testing

The classification of VMs is made on the basis of the algorithms described below. ALGO-1 outputs a decision regarding VMs on the basis of skeletal end pixels in the lower half of the image while ALGO-2 outputs such a decision on the basis of skeletal loops present in the image. After the outputs of ALGO-1 and ALGO-2 are obtained, ALGO-3 is employed.

ALGO-1: Step-1. Set SW = 0. If there is no skeletal end pixel go to STOP-1.

S. K. Parui, U. Bhattacharya and S. K. Ghosh 133

Step-2. If the skeletal end pixel is in the left lower half, compute the feature vector of the curve V and then compute /3(©),/4(@) and / s (0 ) . Step-3. If /3(0) is the largest, VM T is present (Event M). Go to STOP-3. Step-4. If /4(B) is the largest, VM 'E' is present (Event M). Go to STOP-3. Step-5. If / s (0 ) is the largest, no VM is present so far. If there is another end pixel, go to Step-2. Else go to STOP-3. Step-6. If the end pixel is in the right lower half compute the feature vector 9 of the curve V and then compute /i(©) and /^(O). Step-7. If / i ( 0 ) is larger, VM 'E' is present (Event L). Go to STOP-3. Step-8. If hi®) is larger, compute u2i(©)'s. Find «2r(©) that is largest among them. If U2r(©) does not represent a vertical shape, no VM is present so far. If there is another end pixel, go to Step-2. If U2r(©) represent a vertical shape, set SW = SW + 1 . If SW = 1 , then if there is another end pixel go to Step-2. If SW=2, VM 'A' is present go to STOP-3. STOP-1.

ALGO-2: Step-1. If there is no loop, go to STOP-2. Step-2.Compute the feature vector a of the loop and then compute g\(a) and #2 (a). Step-3. If gi(a) is larger (Event AT), then first compute the feature vector O and then compute fe(Q) and / 7 ( 0 ) . Step-4. If fe(B) is larger, VM 'E' is present. Go to STOP-3. Step-5. If /?(0) is larger, no VM is present so far. If there is another loop, go to Step-2. Else go to ALGO-3. Step-6. If 52(0) is larger (event V), compute the C.G. of the loop. Step-7. If the loop is on the left, VM T is present. Go to STOP-3. Step-8. If the loop is on the right, set SW = SW + 1 . If SW =1, then if there is another loop, go to Step-2. If SW=2, VM 'A' is present. Go to STOP-3. STOP-2.

ALGO-3: Step-1. If SW = 0, no VM is present. Go to STOP-3. Step-2. If SW=1 (VM (•) is present in the right

side). All the skeletal pixels falling on the vertical shape are removed. The feature vector 0 of the modified image is computed. Then h\((3) and /i2(/3) are computed. If the former is greater, then VM (1) is present. Else, no VM is present. STOP-3.


Simulation results of the proposed recognition scheme have been obtained using the database described in Section 3.1. The basic characters used for the training and test of the present scheme have been taken from a similar database7 also developed by us.

Experimental results corresponding to the events described in Table 1 are given below.

Table 3. Confusion matrix of the test set of event L

LI L2 L3

LI

85.84 8.75 9.7

L2

6.06 83.25 7.96

L3

8.1 8.00 82.34

Event L: A skeletal end pixel gets detected in the bottom right quadrant. In this case, the sample may have (i) a vertical line on the right (LI) or (ii) a ' (L2) or (Hi) no VM (L3). In these situations, the recognition accuracies are respectively 86.25%, 85.25% and 83.20% on the respective training sets and similar figures on the corresponding test sets are respectively 85.84%, 83.25% and 82.34%. The relevant confusion matrix for the test set is shown in Table 3.

Event M: A skeletal end pixel gets detected in the bottom left quadrant. In this case, the sample may have (i) a vertical line on the left (Ml) or (ii) a ' (M2) or (Hi) no VM (MS). In these situations, the recognition accuracies are respectively 98.2%, 98.12%, and 95.60% on the training sets and similar figures on the corresponding test sets are respectively 97.3%, 96.4%, 95.6%. The relevant confusion matrix for the test set is shown in Table 4.

Event N or V: A nearly circular loop in the bottom half or a vertically elongated loop is detected. In this case, the sample may have (i) a -7 (Nl) or (ii) no VM (N2 or 02) or (Hi) a vertical line (part or whole of a VM) (01). In these situations, the recognition accuracies are respectively 94.72%, 97.80%, 85.00%


Table 4. Confusion matrix of the test set of event M

Ml M2 M3

Ml

97.30 1.77 2.30

M2

1.50 96.40 2.10

M3

1.20 1.83 95.60

and 89.00% on the training sets and similar figures on

the corresponding test sets are respectively 93.66%,

96.60%, 84.00% and 88.00%. The relevant confusion

matr ix for the test set is shown in Table 5.

Table 5. Confusion matrix of the test set of event N and O

Nl N2 01 02

Nl

93.66 1.60 4.05 4.60

N2

3.00 96.60 5.60 3.30

Ol

1.30 1.00

84.00 4.10

02

2.00 0.80 6.35

88.00


In the present article, we have described a scheme

for detection and recognition of handwrit ten vowel

modifiers of Bangla. The proposed scheme has been

used to identify the whole or par t of a vowel modifier

tha t appears in the middle zone of a Bangla charac

ter. This is a pioneering work since to the best of

our knowledge no similar work is available in the lit

erature. Although the recognition accuracies are not

high, these are encouraging due to the above fact. In

future, we shall take care of the par t of handwrit

ten vowel modifiers of Bangla, which appears in the

upper zone.

References

1. S .K. Parui, U. Bhattacharya, A. K. Datta and B. Shaw, Proc. 3rd Workshop on Computer Vision, Graphics and Image Processing, WCVGIP 2006, Hyderabad, 204 (2006).

2. R. Plamondon, S. N. Srihari, IEEE Trans. Patt. Anal, and Mach. Inteli, 22(1), 63 (2000).

3. S. D. Connell, R. M. K. Sinha and A. K. Jain, Proc. of the 15th ICPR, Bercelona, Spain, 368 (2000).

4. I. K. Sethi and B. Chatterjee, Pattern Recognition, 9, 69(1977).

5. A. K. Dutta and S. Chaudhury, Pattern Recognition, 26,1757(1993).

6. F. R. Rahman, R. Rahman and M. C. Fairhurst, Pattern Recognition, 35, 997(2002).

7. U. Bhattacharya and B. B. Chaudhuri, In Proc. of the 8th Int. Conf. on Document Analysis and Recognition, Seoul, 2, 789(2005).

8. U. Bhattacharya, S. K. Parui, M. Sridhar and F. Kimura, Proc. of IICAI-05, Pune, India, 1357(2005).

9. N. Otsu, IEEE Trans. SMC, 9, 377(1979). 10. A. Datta and S. K. Parui, Pattern Recognition, 27

1181(1994). 11. S. K. Parui and D Dutta Majumder, Pattern Recog

nition Letters, 1(3), 129(1983). 12. K. Fukunaga, Introduction to Statistical Pattern

Recognition, Academic Press, San Diego, 2nd Edition, 1990.

13. K. Samuel, K. Balakrishman and J. Norman, Continuous Multivariate Distribution. New York : Wi-ley,2000

14. J. Bernardo and A. Smith, Bayesian Theory. John Wiley & Sons, 1994.

135

Template-Free Word Spotting in Low-Quality Manuscripts

Huaigu Cao* and Venu Govindaraju

Center for Unified Biometrics and Sensors (CUBS) Dept. of Computer Science and Engineering

University at Buffalo, Amherst, NY E-mail: {hcao3*, govind} ©buffalo, edu

As the OCR technique is not yet adequate for handwritten scripts with large lexicon, word spotting has been introduced as an alternative to OCR. This paper proposes a novel approach to word spotting that, instead of matching features of the word image to features extracted from predefined templates, uses the estimated posterior probability as the output of well trained classifier for spotting. Gabor features are extracted from gray scale image in order to yield higher performance on degraded, low quality document image.

Keywords: Keyword spotting; OCR.

1. Introduction

Nowadays, Optical Character Recognition (OCR) has advanced enough for a certain applications such as printed document, on-line handwritten document, and off-line hand-written documents with good image quality and relatively smaller lexicon. However, OCR is not adequate for the low-quality handwritten documents with large lexicon. Word spotting1 has been proposed as an alternative to OCR for indexing and retrieving the keywords of low-quality handwritten scripts.

The idea of word spotting is to find all the occurrences of a given word in the document images. Here the input is a word which is usually provided as user input, and the output are all the coordinates of the corresponding word images. Most of the existing approaches to word spotting choose a certain number of word images containing the user query as templates and find the best matches in the dataset. Thus the most important step is to determine the similarity between two word images. Among the existing approaches, the similarity can be defined on either the intensity of the raw images or features extracted from the original images. Several similarities between raw image data using different definitions of distances such as XOR, SSD, and EDM are discussed in.2 There are also some similarities defined over feature space such as SLH2 (using Scott and Longuet-Higgins algorithm), SC3 (Shape Context matching), DTW4 (Dynamic time warping), CORR5 (recovering the correspondences between points of interest in two images). The performance of all the above sim

ilarities are compared in4 which shows that DTW and CORR approaches have the best performance. B. Zhang, et al.6 proposed a method based on word shape and claimed to have better performance than the DTW approach in.4

All of the above approaches assume that for any query, there are a few word images stored as templates in a training set. However, since it is not easy to get all the word images of possible user queries in advance, this assumption limits the application of word spotting to a small set of keywords. To solve this problem, in the approach proposed in this paper, matching is performed in character image level rather than in word image level. Since the possible number of characters is very limited, i.e., 26 letters for English text, it will be much easier to build a training set of character images.

Another novelty of our approach is that we use Gabor features extracted from grayscale images. For some document image with degraded quality, e.g., the noisy carbon medical form shown in figure 1, so much information was lost after binarisation that the binarised version is not even readable by human beings, whereas the grayscale version is still readable. Although there are some literatures8'9 on grayscale feature extraction for OCR, all the existing feature-matching based word spotting methods still extract features from binary image.

The following sections in this paper are organized as follows: in section 2 we will talk about how to extract features from grayscale word image; in section 3, two similarities are discussed; in section 4, the results of experiments are presented.

136 Template-Free Word Spotting in Low-Quality Manuscripts

*

- t ~ , • • • • • • * . - : ,

t

-**.„ „

iJr îMMuiîiiJMTHfmfiffi J in ixrn ...uJWlir H I ,

a) a patch from the medical form (b) the binarized image of the patch

Fig. 1. An example of the carbon copy of medical form.

2. Ext rac t ing Gabor features from word image

2.1. Gabor filter and Gabor wavelet

The two dimensional special function of Gabor filter and its two dimensional Fourier transform can be written as follows:7

g(x,y,F,ax,ay)

e x p [ - | ( ^ + ^ ) + 27rjFa;] (1) 2naxay

G{u,v) = e x p { - - [ - 5 - i - + Zol) (2)

where au = l/(2nax), av = l/(2ircry), and parameter F specifies the central frequency of interest. Gabor wavelet derived from Gabor filter in equation (1) is defined as follows

9mn(x, y) - a~mg(x', y', F, ruc/K, ax, ay), (3)

where

x = a ' (xcos^ + ysinO),

and

x' = a m(—xsm6 + yco$0).

In equation (3), rn = 0,1,..., S-1 where S is the total number of scales, 9 = nir/K, and n = 0,1, . . . , K — 1 where K is the total number of orientations. The central frequency of interest is F. By varying m and n, we can apply filter gmn{x,y) to the input image (using 2-D convolution) to get features at different scale and orientation. The Gabor wavelet is non-orthogonal, thus there is redundant information in the filtered image. The following strategy is used to reduce the redundancy.7 Let Ui and Uh denote the

lower and upper center frequencies of interest. Then the filter parameters a, au and av are determined by

a=(Uh/Ut)^

u (a+l)V21n2

»»(*)K^] 2 l n 2 _i?J ! l2) !£ i"

(4) In order to make the Gabor filter be sensitive to the strokes of the characters, the central frequency of interest should be set to ^ , where W is the stroke width. This is because the stroke width is the half-period of the signal of interest, so the period is 2W, and the frequency is ^ • In PCR medical forms, W varies from 5 to 8 pixels which correspond to frequencies 0.1 and 0.0625, respectively. Our experiment is different from related work8'9 in that instead of applying Gabor filter of only one scale, we applied a Gabor wavelet of two different scales. The upper and lower frequencies of interest are Uh = 0.1, and Ui = 0.05 so the range of stroke width is completely covered. This seems more reasonable in that the stroke width is not fixed. Experiment of comparison indicates a better performance of using 2-scale Gabor wavelet over single scale Gabor filter.

2.2. Feature extraction from character image

In our dataset, the width and height of character images are between 30 and 40 pixels. So we take 64 x 64 pixels character image with the character at the center and apply a Gabor wavelet of two scales (Uh = 0.1, and Ui = 0.05) and four orientations to the images. Then a set of histogram features proposed in8'9

are taken from the 48 x 48 sub-image at the center

Huaigu Cao and Venu Govindaraju 137

of the 64 x 64 transformed image. Specifically, the 48 by 48 output of Gabor filter is divided into N by N blocks. Each block r(x, y) is of size M by M, where M=48/N. In each block, histogram features are calculated separately by positive and negative real part outputs weighted by a Gaussian function8 (See the following equations). F+ = x,y

£(m,n)€r(*,w) G(m - x,n - y) • max(0,FK(m,n))

(5) and F~ = x,y

X3(m,n)er(x,») G ( m ~x,n-y)- min(0,FK(m,n)) (6)

where G(x,y) = e x p { - ( i 2 + y2)/(2r2)}/(27r), and FK is the real part output of Gabor filter. For the scale m=0, N, M, r are set to 4, 12, 6 respectively; for the scale m=0, N, M, T are set to 2, 24, 12, respectively. As a result, totally 2 x 4 x (4 x 4+2 x 2) = 160 features are extracted.

2.3. Feature extraction from word image

%Wk

Fig. 2. The diagram indicating how a word image is split into four character images for feature extraction.

Let Iw[l : 64,1 : L] denote a 64 by L (L > 64) word image, where the word is centered vertically in /„,. Suppose Iw contains n characters. Then a simplified definition of the feature vector of Iw is VW=[V?V?...VT], where Vt(l < i < n) is the 160-dimensional Gabor feature vector of the 64 by 64 sub-image

1,4:4 : s j + 6 3 , 1 : 6 4 ] where i j = ("-«)+(<-i)^-«3) _ I n o t h e r w o r d S ) a w o r d

image is divided evenly into n square blocks of character images allowing overlap. Figure 2 illustrates this feature mapping process. However, characters within a word image are seldom distributed evenly. The solution is to generate three candidates of feature vectors of the following sub-images:

Iw[xi = 4 + 63,1:64],

7 4 ( 4 - 8) : (ij, - 8) + 63,1 : 64], and

7u ;[(4 + 8 ) : (x u + 8 ) + 6 3 , l : 6 4 ]

for the i'th character and choose the one with minimum cost using criteria discussed in section 3.

3. Similarity between a word image and the query word

Suppose the query w = ciC2...c„ is word consisting of n characters, and the n-char acter feature vector of the word image Iw is Vw = [V1

7V2T...V„T]T. We

need to measure the similarity between w and Vw. In our experiments, we tested two different similarities that are based on Euclidian distance and posterior probability, respectively.

3.1. Euclidian distance similarity

We can define the Euclidian distance based similarity between a word image and the query word. First we extract the feature vectors of all the character images in the training set, and calculate the mean feature vector /x(c) of each character c. Then the squared root of squared sum of Euclidian distances between each V^T(1 < i < n) and n(ci) is calculated as a cost function, i.e.,

CE(w,Vw)= ( £ | | V i - / i ( c i . (7) v i = l

The less the cost is, the more similar the word image is to the query. In our experiment, since the training set is not very large, we reduce the feature vector of character image to 40-dimensional by PCA and get better performance than the method using the original 160-dimensional feature vectors. All the 64 by

138 Template-Free Word Spotting in Low-Quality Manuscripts

64 sub-images are taken from the word images, and their distances to each character are calculated and stored in the dataset so that the execution of query skips the step of feature extraction and is extremely fast.

3.2. Probabilistic similarity

Suppose that the posterior probabilities for character classification, i.e., P(Ci\Vi)(l < i < n) are known, and that all the characters are independent, then the posterior probability of the word P(w\Vw) = Iir=i P(ci\Vi)- Then we can normalize the probability for different value of n by calculating the cost

CP(w,Vw)= In P(w\V,„)

(8) The less the cost is, the more similar the word image is to the query. In our experiment, the posterior probabilities of character classification are estimated using the method proposed in10 from the output of LIBSVM11 (an implementation of SVM classifier) using RBF kernel. All estimated probabilities are stored in our database to speed up the execution of queries.

3.3. WMR similarity

For the purpose of comparison, we also performed spotting test using the Word Model Recognizer (WMR),12 a word recognizer that uses chain-code features from binarized images and performs word recognition under a scheme of over-segmentation and recognition. In spotting test with WMR, the word recognition distance produced by WMR is taken as a cost function.

4. Results and discussions

Our experiments are done on 12 medical form images. We take 5295 character images from the first ten document images of our dataset as a training set, and do spotting tests on the last two images using as queries all the 101 different words occurred 127 times in the tested document image.

First, all the word images are manually segmented, which can also be implemented automatically later. Then spotting tests using cost functions CE, Cp and WMR distance are done respectively. Binarization and line removal algorithms proposed in13 are performed before running WMR .

To evaluate the performance of our algorithm the recall-precision curves under all three cost functions are drawn in figure 3. For each cost function, a series of threshold values of the cost function are taken. The values of precision and recall are calculated for each threshold. We can see that the method using cost Cp outperforms that using cost CE and WMR with gains of equal recall and precision rates of 26% and 15%, respectively. Test results show that our approach has a high performance. Unlike existing approaches,4 our approach does not require any word template.

ftecaHAtciaon

Fig. 3. Recall-precision curves of methods using WMR word recognizer (Precision=Recall=54%), Euclidian distance similarity (Precision=Recall=46%), and probabilistic similarity (Precision=Recall=69%), respectively.

References

1. R. Manmatha, C. Han, E. M. Riseman, and W. B. Croft, Indexing handwriting using word matching, in 1st ACM International Conference on Digital Libraries, Bethesda, MD, March 20-23, pp. 151-159 (1996).

2. S. Kane, A. Lehman, and E. Partridge, Indexing George Washington's handwritten manuscripts, Technical Report MM-34, Center for Intelligent Information Retrieval, University of Massachusetts Amherst (2001).

3. S. Belonge, J. Malik, and J. Puzicha, Shape matching and object recognition using shape contexts, IEEE Trans, on PAMI 24: 24 (2002) 509-522 (2002).

4. R. Manmatha and T. M. Rath, Indexing of handwritten historical documents-recent progress, in Symposium on Document Image Understanding Technolegy (SDIUT), pp. 77-85 (2003).

Huaigu Cao and Venu Govindaraju 139

5. J. L. Rothfeder, S. Feng, and T. M. Rath, Using corner feature correspondences to rank word images by similarity, CIIR Technical Report MM-44 (2003).

6. B. Zhang, S. N. Srihari, and C. Huang, Word image retriecal vising binary features, in Document Recognition and Retrieval XI, SPIE vol. 5296, pp. 45-53 (2004).

7. B. S. Manjunath, W. Y. Ma, Texture features for browsing and retrieval of image data, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 18, Issue: 8, Pages: 837 - 842 (1996).

8. Xuewen Wang, Xiaoqing Ding, Changsong Liu, Optimized Gabor Filter Based Feature Extraction for Character Recognition, ICPR (4) 2002: 223-226 (2002).

9. Xuewen Wang, Xiaoqing Ding, Changsong Liu, Gabor filters-based feature extraction for character recognition, Pattern Recognition 38(3): 369-379 (2005).

10. T.-F. Wu, C.-J. Lin, and R. C. Weng, Probability estimates for multi-class classification by pairwise coupling, Journal of Machine Learning Research, 5:975-1005 (2004). URL: http://www.csie.ntu.edu.tw/ cjlin/papers/ svmprob/svmprob.pdf.

11. Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, (2001). Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

12. G. Kim and V. Govindaraju, A lexicon driven approach to handwritten word recognition for real-time applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 19, Pages: 366 -379 (1997).

13. Robert Milewski, Venu Govindaraju, Extraction of Handwritten Text from Carbon Copy Medical Form Images, Document Analysis Systems pp. 106-116 (2006).

http://www.csie.ntu.edu.tw/

http://www.csie.ntu.edu.tw/

Unconstrained Handwritten Digit Recognition: Experimentation on MNIST Database

V. N. Manjunath Aradhya, G. Hemantha Kumar, S. Noushath

Department of Studies in Computer Science University of Mysore

Mysore- 570006, INDIA E-mail: mukesh-mysore@rediffmail. com

The performance of a character recognition system depends heavily on what features are being used. Though many kinds of features have been developed and their test performances on standard database have been reported, there is still room to improve the recognition rate by developing improved features. In this paper, we propose a Two-dimensional Principal Component Analysis (2D-PCA) for efficient handwritten digit recognition. 2D-PCA is based on 2D image matrices rather than ID vectors so that image matrix does not need to be transform into a vector prior to feature extraction as done in PCA.1 For subsequent classification purpose we have used Generalized Regression Neural Network (GR.NN). A test performed on the MNIST handwritten numeral database showed a better recognition rate is among the best on the MNIST database.

Keywords: 2DPCA,GRNN,Handwritten Digit Recognition, MNIST database

1. Introduction

Recognition of characters from the document images is at the heart of any document image understanding system. Character Recognition has been the subject of intensive research during the last two decades. This, not only because it is very challenging scientific problem but also because it provides a solution for processing large volumes of data automatically. Research in Character Recognition is popular for its various application potentials in banks, post offices, defense organizations, reading aid for the blind, library automation, language processing and multimedia design.

Over the last two decades, Neural Networks have been widely used to solve complex classification problems.2 On the other hand, there is a consensus in machine learning community that Support Vector Machines (SVM) is most promising classifiers due to their excellent generalization performance.3 However, SVMs for mulit-classes classification problems are relatively slow and their training on a large data set is still a bottle-neck. In,4 they proposed an improved method of handwritten digit recognition based on neural classifier Limited Receptive Area (LIRA) for recognition purpose. The classifier LIRA contains three neuron layers: sensor, associative and output layers. An efficient three-stage classifier for handwritten digit recognition is proposed in.5 They propose efficient handwritten digit recognition based on Neural Networks and Support Vector Machine classifiers. Combining multiple clas

sifiers based on third-order dependency for handwritten numeral recognition is presented in.6 New approximation scheme is proposed to optimally approximate the probability distribution by the third order dependency, and then multiple classifiers are combined using such approximation scheme. Eigen-deformations for elastic matching based on handwritten character recognition are presented in.7

Selection of a feature extraction method is probably the single most important factor in achieving high recognition performance. Principal Component Analysis (PCA)1 and Fisher Linear Discriminant analysis (FLD),8 respectively known as eigenface and fisherface method, are the two state-of-the-art methods in face recognition. Using these techniques, a character image is efficiently represented as a feature vector of low dimensionality. The features in such subspace provide more salient and richer information for recognition than the raw image. It is this success, which made face recognition (FR) based on PCA and FLD very active, although they have been investigated for decades. In the PCA-based character recognition technique, the 2D character image matrices must be previously transformed into ID image vectors. The resulting image vectors of characters lead to a high dimensional image vector space. It is difficult to compute the covariance matrix accurately due to its large size and the relatively small number of training samples. Hence, in this paper, a image projection technique called two-dimensional principal component analysis (2D-PCA), is developed for

V. N. Manjunath Aradhya, G. Hemantha Kumar and S. Noushath 141

image feature extraction.9 As to conventional PCA, 2D-PCA is based on 2D matrices rather than ID vectors. Image covariance matrix can be constructed directly using the original image matrices. As a result 2D-PCA has two advantages over PCA. It is easier to compute the covariance matrix accurately. Time computations required to determine the corresponding eigenvectors is less.

The remainder of this paper is organized as follows: In section 2, the 2D-PCA method is presented. In section 3, a brief GRNN is described. In section 4, experimental results are presented for the MNIST handwritten database. Finally, conclusions are drawn at the end.

2. 2DPCA Method

Let M denote an n-dimensional unitary column vector. Project an image A, onto M by the following transformation.

Y = AM (1)

where Y is called the projected feature vector of image A The total scatter of the projected samples can be characterized by the trace of the covariance matrix of the projected feature vectors as

J(M) = trace(C)x (2)

where Cx denotes the covariance matrix of the projected feature vectors of the training samples and trace{C)x denotes the trace of Cx. The covariance matrix of Cx can be denoted by

trace{C)x = MT[E(A - EA)T{A - EA))M (3)

let us define the following matrix as

It = E[{A - EAf(A - EA)\ (4)

The matrix It is called the image covariance matrix. We can compute It directly using the training image samples. If N is the number of training image samples, jth training image is denoted by an matrix Aj(j = 1,2, ,N), and the average image is denoted by barA. Then, It can be

It = j;Jt,{Aj-A)T(Ai-A) (5)

Alternatively, (2) can be expressed by

JM = MTItM (6)

The unitary vector M that maximizes the criterion is called the optical projection axis. The optical projection axis Mopt is the unitary vector that maximizes JM- The optimal projection vectors of 2D-PCA, Mi,, Md, are used for feature extraction. For a given image sample A, let

YK = AMK,k= 1,2,-•• ,d (7)

The above equation is called the principal component (vectors) of the sample image A. It should be noted that principal component of 2D-PCA is a vector, whereas the principal component of PCA is a scalar. For classification purpose, GRNN classifier is used.

3. Generalized Regression Neural Network (GRNN)

Work on artificial neural networks, commonly referred to as neural networks, has been motivated right from its inception by the recognition that the human brain computes in an entirely different way from the conventional digital computer.10 The generalized regression neural network was introduced by Nadaraya11 and Watson12 and rediscovered by Specht13 to perform general regressions. Generalized regression neural networks are paradigms of the Radial Basis Function (RBF) used to functional approximation. To apply GRNN to classification, an input vector x (2D-PCA projection feature matrix Ft) is formed and weight vectors W are calculated. The architecture of the GRNN is shown in Fig 1. A detailed review of this method is described in.10

Fig. 1. Sheme of GRNN

142 Unconstrained Handwritten Digit Recognition: Experimentation on MNIST Database

4. T h e R e c o g n i t i o n Resu l t s

Our experiments were performed on the well-known MNIST1 4 database of handwritten digits. MNIST database consists of total 60,000 samples. In this experiment we have used 50,000 patterns of the MNIST training set for training, and the remaining 10,000 for testing purpose. All the digits have been size-normalized and centered in a 28 X 28 box. Samples of handwritten digits are shown in Figure 2. Each experiment is repeated 36 times by varying number of projection vectors t (where t = 1, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, and 100). Since t, has a considerable impact on recognition accuracy, we chose the value that corresponds to best classification result on the image set. All of our experiments are carried out on a P C machine with P4 3GHz CPU and 512 MB RAM memory under Mat-lab 7.0 platform and Windows XP.

0 6 O 0 0

Fig. 2. Samples of handwritten digits of MNIST database

The proposed 2D-PCA method is used for feature extraction, while GRNN is used for subsequent classification purpose. The proposed method achieved 99.7% recognition rate when 10,000 testing pat terns used and training samples used is 50,000. In Figure 3, a comparison of the performance of different algorithms tested on the MNIST database is given. The method (2D-PCA + GRNN) was used on the original MNIST database and provided error rate of 0.3% that is better than error rates of most of the classifiers. The best results provided by Boosted LeNET 1 5 and Virtual SVMs6 are the state-of-art results, but corresponding classifiers were trained on a perturbed MNIST database.

M j t p ' / ^rnterSy MJW

Ti i r i r II »

Bit M U ^ H UtlelS

R^F n ^ t * ! A r cm ^

F s w t s Sneaf S«s©ire n#sr

S M-3 „ > K I

^ *m

BiBLr™^™^ B'""'1' B t B I " i

ss | ^ K i

lute w ' i ^m g?

1 -* t -f

4

| -I

3 i -t

i -i _i

Fig. 3. Comparison of the error rates of different methods on the test set of MNIST database

5. Conclus ion

In this paper, we addressed the problem of handwritten digit recognition based on 2D-PCA for feature extraction and generalized regression neural network for subsequent classification purpose. As opposed to conventional PCA, 2D-PCA is based on 2D matrices rather than ID vectors. 2D-PCA is very effective and more straightforward to use for recognition. The method with the MNIST database showed the best recognition rate (99.7%) among other classifiers proved on this database. In future enhancement, we plan to extend this method for handwritten alphanumeric characters.

References

1. M. Turk and A. Pentland, Journal of Cognitive Neu-roscience 3, 71 (1991).

2. C. M. Bishop, Neural Networks for Pattern Recognition (Clarendon Press, Oxford, 1995).

3. C. Burges, Knowledge Discovery and Data Mining 2, 1 (1998).

4. E. Kussul and T. Baidyk, Image and Vision Computing 22, 971 (2004).

5. D. Gorgevik and D. Cakmakov, An efficient three-stage classifier for handwritten digit recognition, in nth Intl. Conf. on Pattern Recognition (ICPR), 2004.

6. H. Kwang, Pattern Recognition Letters 24, 3027 (2003).

7. S. Uchida and H. Sakoe, Pattern Recognition Letters 36, 2031 (2003).

8. P. N. Belhumeur, J. P. Hespanha and D. J. Krieg-man, IEEE Trans. Pattern Anal. Machine Intell 19, 711 (1997).

9. J. Yang and D. Zhang, IEEE Trans. Pattern Anal. Machine Intell 26, 131 (2004).

V. N. Manjunath Aradhya, G. Hemantha Kumar and S. Noushath 143

10. S. Haykin, Neural Networks (Pearson Education, 2003).

11. E. A. Nadaraya, Theory of Probability applications 19, 141 (1964).

12. G. Watson, Sankhya Series A 26, 359 (1964).

13. Specht, IEEE Trans. Neural Networks 2, 568 (1991). 14. ht tp: / /yann. lecun.com/exdb/mnist . 15. D. DeCoste and B. Scholkopf, Machine Learning 46,

161 (2002).

http://yann.lecun.com/exdb/mnist

PARTF

Image Registration and Transmission

147

An Adaptive Background Model for Camshift Tracking with a Moving Camera

R. Stolkin*, I. Florescu**, G. Kamberov***

* Center for Maritime Systems, ** Dept. of Mathematical Sciences, *** Dept. of Computer Science Stevens Institute of Technology,

Castle Point on Hudson, Hoboken, NJ 07030, USA, E-mail: {*rstolkinOstevens.edu; **[email protected]; ***[email protected]}

Continuously Adaptive Mean shift (CAMSHIFT) is a popular algorithm for visual tracking, providing speed and robustness with minimal training and computational cost. While it performs well with a fixed camera and static background scene, it can fail rapidly when the camera moves or the background changes since it relies on static models of both the background and the tracked object. Furthermore it is unable to track objects passing in front of backgrounds with which they share significant colours. We describe a new algorithm, the Adaptive Background CAMSHIFT (ABCshift), which addresses both of these problems by using a background model which can be continuously relearned for every frame with minimal additional computational expense. Further, we show how adaptive background relearning can occasionally lead to a particular mode of instability which we resolve by comparing background and tracked object distributions using a metric based on the Bhattacharyya coefficient.

Keywords: CAMSHIFT; mean shift; ABCshift; tracking; adaptive; background model; robot vision

1. Introduction

Popular and effective approaches to colour based tracking include the CAMSHIFT,1'2 Mean Shift3

and particle filtering4'5 algorithms. Of these, CAMSHIFT stands out as the fastest and simplest. CAMSHIFT was designed for close range face tracking from a stationary camera but has since been modified for a variety of other tracking situations.6'7

Robust and flexible tracking algorithms, requiring minimal training and computational resources, are highly desirable for applications such as robot vision and wide area surveillance, both of which necessitate moving cameras. Unfortunately CAMSHIFT often fails with camera motion, figure 3, since it relies on a static background model which is unable to adequately represent changing scenery.

We address these difficulties by modifying the algorithm to include a flexible background representation which can be continuously relearned. The resulting algorithm tracks robustly in two situations where CAMSHIFT fails; firstly with scenery change due to camera motion and secondly when the tracked object moves across regions of background with which it shares significant colours, figure 1.

It is observed that the adaptability of this approach makes it occasionally vulnerable to a special mode of instability, in which a shrinking search window can lead to the background being relearned as object. We detect and correct this error by comparing object and background distributions to check for

convergence.

2. Bayesian mean shift tracking with colour models

For each frame of an image sequence, the CAMSHIFT algorithm looks at pixels which lie within a subset of the image defined by a search window. Each pixel in this window is assigned a probability that it belongs to the tracked object, creating a 2D distribution of object location over a local area of the image. The tracking problem is solved by mean shifting3'8 towards the centroid of this distribution to find an improved estimate of the object location. The search window is now repositioned at the new location and the process is iterated until convergence.

By summing the probabilities of all the pixels in the search window, it is also possible to estimate the size of the tracked object region (in pixels). The search window can now be resized so that its area is always in a fixed ratio to this estimated object area.

The tracked object is modelled as a class conditional colour distribution, P (C |0 ) . Depending on the application, ID Hue, 3D normalised RGB, 2D normalised RG, UV or ab histograms may all be appropriate choices of colour model, the important point being that these are all distributions which return a probability for any pixel colour, given that the pixel represents the tracked object. These object distributions can be learned offline from training images,

http://rstolkinOstevens.edu


mailto:[email protected]%7d

148 An Adaptive Background Model for Camshift Tracking with a Moving Camera

or during initialisation, e.g. from an area which has been user designated as object in the first image of the sequence.

The object location probabilities can now be computed for each pixel using Bayes' law as:

P(oio = M 2 ) (1)

where P(0\C) denotes the probability that the pixel represents the tracked object given its colour, P(C\0) is the colour model learned for the tracked object and P ( 0 ) and P(C) are the prior probabilities that the pixel represents object and posesses the colour C respectively.

The denominator of equation (1) can be expanded as:

P(C) = P(C\0)P(0) + P(C\B)P(B) (2)

where P(B) denotes the probability that the pixel represents background.

Bradski1,2 recommends values of 0.5 for both P ( 0 ) and P(B). We find this choice difficult to justify since we take these terms to denote the expected fractions of the total search window area containing object and background pixels respectively. Hence we assign values to object priors in proportion to their expected image areas. If the search window is resized to be r times bigger than the estimated tracked object area, then P(0) is assigned the value 1/r and P(B) is assigned the value (r — l ) / r .

Bradski1,2 suggests learning the expression (2) offline (presumably building a static P(C\B) histogram from an initial image). While it is often reasonable to maintain a static distribution for the tracked object (since objects are not expected to change colour), a static background model is unrealistic when the camera moves. The CAMSHIFT algorithm can rapidly fail when the background scenery changes since colours may exist in the new scene which did not exist in the original distribution, such that the expressions in Bayes law will no longer hold true and calculated probabilities no longer add up to unity.

Particular problems arise with CAMSHIFT if the tracked object moves across a region of background with which it shares a significant colour. Now a large region of background may easily become mistaken for the object, figure 1.

3. Incorporating an adaptive background model

We address these problems by using a background model which can be continuously relearned. Rather than using an explicit P(C\B) histogram, we build a P(C) histogram which is recomputed every time the search window is moved, based only on the pixels which lie within the current search window. P(C) values, looked up in this continuously relearned histogram, can now be substituted as the denominator for the Bayes' law expression, equation (1), for any pixel. Since the object distribution, P(C\0), remains static, this process becomes equivalent to implicitly relearning the background distribution, P(C\B), because P(C) is composed of a weighted combination of both these distributions, equation (2). Relearning the whole of P(C), rather than explicitly relearning P(C\B), helps ensure that probabilities add up to unity (e.g. if there are small errors in the static object model).

Adaptively relearning the background distribution helps prevent tracking failure when the background scene changes, particularly useful when tracking from a moving camera, figure 4. Additionally, it enables objects to be tracked, even when they move across regions of background which are the same colour as a significant portion of the object, figure 2. This is because, once P(C) has been relearned, the denominator of Bayes' law, equation (1), ensures that the importance of this colour will be diminished. In other words, the tracker will adaptively learn to ignore object colours which are similar to the background and instead tend to focus on those colours of the object which are most dissimilar to whatever background is currently in view.

It is interesting to note that the continual relearning of the P(C) histogram need not substantially increase computational expense. Once the histogram has been learned for the first image it is only necessary to remove from the histogram those pixels which have left the search window area, and add in those pixels which have newly been encompassed by the search window as it shifts with each iteration. Provided the object motion is reasonably slow relative to the camera frame rate, the search window motion will be small, so that at each iteration only a few lines of pixels need be removed from and added to the P(C) histogram.

R. Stolkin, I. Florescu, G. Kamberov 149

If the P(C) histogram is relearned only once every frame, the speed should be similar to that of CAMSHIFT. However, if the histogram is relearned at every iteration, some additional computational expense is incurred, since to properly exploit the new information it is necessary to recompute the P(0\C) values for every pixel, including those already analysed in previous iterations. Theoretically, updating at each iteration should produce more reliable tracking, although we have observed good tracking results with both options.

In practice, ABCshift often runs significantly faster than CAMSHIFT. Firstly, the poor background model can cause CAMSHIFT to need more iterations to converge. Secondly, the less accurate tracking of CAMSHIFT causes it to automatically grow a larger search window area, so that far greater numbers of pixels must be handled in each calculation.

4. Dealing with instability

Adaptively relearning the background model enables successful tracking in situations where CAMSHIFT fails, however it also introduces a new mode of instability which occasionally causes problems. If the search window should shrink (due to the object region being temporarily underestimated in size) to such an extent that the boundaries of the search window approach the boundaries of the true object region, then the background model will learn that background looks like object. This results in a negative feedback cycle with the estimated object region and search window gradually (and unrecover-ably) collapsing.

We solve this problem by noting that as the search window shrinks and approaches the size of the object region, the learned background distribution, P(C\B), must become increasingly similar to the static distribution for the tracked object, P(C\0). If this increasing similarity can be detected, then both the object region and search window can be easily resized, figure 5, the correct enlargement factor being r, the desired ratio of search window size to object region size.

Several statistical measures exist for comparing the similarity of two histograms. We utilise a Bhat-tacharyya metric,9 sometimes referred to as Jeffreys-Matsusita distance,10 which for two histograms, p —

{Pi}ie{i,2,...,K}. a n d q = {qi}ie{i,2,...,K} is denned as:

dip, q) = 1 5Z (VP* - Voi? (3) i = l

0 < d < \/2. Note that this metric can easily be shown to be the same, modulo a factor of \ /2, as that referred to elsewhere in the literature.3-5 '8

At each iteration we evaluate the Bhattacharyya metric between the static object distribution, P (C |0 ) , and the continuously relearned background distribution, P(C\B). If the Bhattacharyya metric approaches zero, we infer that the search window is approaching the true object region size while the estimated object region is collapsing. We therefore resize both by the factor r. In practice we resize when the Bhattacharyya metric drops below a preset threshold. Useful threshold values typically lie between 0.2 and 0.7.

We believe this is a novel application of the Bhattacharyya metric. It is common in the vision literature3-5 '8 to use this metric to evaluate the similarity between a candidate image region and an object distribution for tracking (i.e. comparing potential object with known object). In contrast, we use the metric to compare an object distribution with a background distribution, inferring an error if the two begin to converge.

5. Summary of the ABCshift tracker

The key differences between ABCshift and the conventional CAMSHIFT tracker are as follows: 1. The background is continuously relearned. In contrast CAMSHIFT uses a static background model which often fails with a moving camera. 2. Values for the prior probabilities, P ( 0 ) and P(-B), are assigned based on the ratio, r, of search window size to estimated object region size. 3. Object and background distributions are compared using a metric based on the Bhattacharyya coefficient to check for instability and resize the search window if it is shrinking.

The ABCshift algorithm is summarised as: 1. Identify an object region in the first image and train the object model, P(C\0). 2. Center the search window on the estimated object centroid and resize it to have an area r times greater than the estimated object size. 3. Learn the colour distribution, P(C), by building a histogram of the colours of all pixels within the search window. 4- Use Bayes' law, equation (1),

150 An Adaptive Background Model for Camshift Tracking with a Moving Camera

to assign object probabilities, P ( 0 | C ) , to every pixel in the search window, creating a 2D distribution of object location. 5. Estimate the new object position as the centroid of this distribution and estimate the new object size (in pixels) as the sum of all pixel probabilities within the search window. 6. Compute the Bhattacharyya metric between the distributions, P(C\0) and P(C). If this metric is less than a preset threshold then enlarge the estimated object size by a factor r. 7. Repeat steps 2-6 until the object position estimate converges. 8. Return to step 2 for the next image frame.

6. Resul ts

To test the enhanced tracking capabilities of the adaptive background model, we compare the performance of the ABCshift tracker with that of CAMSHIFT for a large number of extended video sequences, available at our website.11 Some sample images are shown here in the figures.

ABCshift has succeeded in conditions of substantial camera motion, rapidly changing scenery, dim and variable lighting and partial occlusion. The strengths of ABCshift are particularly apparent in scenarios where the tracked object passes across regions of background with which it shares significant colours.

7. Conclusions

Adaptive background models, which can be continuously relearned, significantly extend the capabilities and potential applications of simple colour based tracking algorithms to include moving cameras and changing scenery. They also provide robustness in difficult situations where other algorithms fail, such as when the tracked object moves across regions of background with which it shares significant colours.

It is observed that the resulting flexibility can make the tracker prone to a particular mode of instability, however this can be corrected by an innovative application of a metric based on the Bhattacharyya coefficient.

Future work will examine alternative adaptive

techniques for relearning models of both the background and the tracked object. We are also exploring automated initialisation of the tracker in surveillance scenarios and applications of these algorithms to visual servoing and robot vision.

References

1. G. R. Bradski, Intel Technology Journal (Q2 1998). 2. G. R. Bradski, Real time face and object tracking

as a component of a perceptual user interface, in Proc. 4th IEEE Workshop Applications of Computer Vision, 1998.

3. D. Comaniciu, V. Ramesh and P. Meer, IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 564 (2003).

4. L. V. G. Katja Nummiaro, Esther Roller-Meier, Object tracking with an adaptive color-based particle filter, in Pattern Recognition : 24th DAGM Symposium, Proceedings, Lecture Notes in Computer Science Vol. 2449 (Zurich, Switzerland, 2002).

5. P. Perez, C. Hue, J. Vermaak and M. Gangnet, Color-based probabilistic tracking, in Proceedings of the 7th European Conference on Computer Vision-Part I, Lecture Notes In Computer Science Vol. 23502002.

6. J. G. Allen, R. Y. D. Xu and J. S. Jin, Object tracking using camshift algorithm and multiple quantized feature spaces, in Proceedings of the Pan-Sydney area workshop on Visual information processing, ACM International Conference Proceeding Series Vol. 100 (Australian Computer Society, Inc., Darlinghurst, Australia, 2004).

7. N. Liu and B. C. Lovell, Mmx-accelerated real-time hand tracking system, in IVCNZ 2001, (Dunedin, New Zealand, 2001).

8. D. Comaniciu and P. Meer, Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603-619 (May 2002).

9. A. Bhattacharayya, On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society 35, 99-110 (1943).

10. H. Jeffreys, An invariant form for the prior probability in estimation problems. Proc. Royal Society (A) 186, 453-461 (1946).

11. www.math.Stevens.edu/~ifloresc/ABCshift.htm.

http://www.math.Stevens.edu/~ifloresc/ABCshift.htm

R. Stolkin, I. Florescu, G. Kamberov 151

Fig. 1. A simple blue and red checkered object, moving from a region of white background into a region of red background. CAMSHIFT fails as soon as t he object moves against a background with which it shares a common colour. Frames 350, 360, 380, 400, and 450 shown. Green and red squares indicate the search window and estimated object size respectively. This movie, RedWhitelCAMSHIFT.avi, can be viewed at our website.11

Fig. 2. ABCshift tracks successfully. Frames 350, 360, 380, 400, and 450 shown. Green and red squares indicate t he search window and estimated object size respectively. This movie, RedWhitelABCshift.avi, can be viewed at our website.11

Fig. 3. Person tracking with CAMSHIFT from a moving camera in a cluttered, outdoors environment. Frames 1, 176, 735, 1631, and 1862 shown. Since the tracked person wears a red shirt, CAMSHIFT fixates on red regions of background, including brick walls and doors, and repeatedly loses the tracked person. Green and red squares indicate the search window and estimated object size respectively. This movie, PeopleTrackinglCAMSHIPT.avi, can be viewed at our website.11

Fig. 4. ABCshift successfully tracks throughout the sequence and is not distracted by red regions of background, despite being initialised in image 1 which contains no red background. Frames 1, 176, 735, 1631, and 1862 shown. Green and red squares indicate the search window and estimated object size respectively. This movie, PeopleTrackinglABCshift.avi, can be viewed at our website.11

Fig. 5. Bhattacharyya resizing. A simple red and blue checkered object is tracked across red, white and blue background regions. Frames 180, 200, 205, 206 shown. Due to rapid, jerky motion from frames 180 to 205, the search window has shrunk until it falls within the object region, risking relearning that background looks like object. ABCshift has detected this instability using the Bhattacharyya metric, and automatically corrects the estimated object region and search window size in frame 206. Green and red squares indicate the search window and estimated object size respectively. This movie, BhattacharyyaRedWhiteBlue.avi, can be viewed at our website.11

152

Colour and Feature Based Mult iple Object Tracking Under Heavy Occlusions

Pabboju Sateesh Kumar, Prithwijit Guha and Amitabha Mukerjee

Computer Vision Group, IIT Kanpur Kanpur - 208016, UP, India

E-mail; {psateesh,pguha, amit} Giitk. ac. in

Tracking multiple objects in surveillance scenarios involves considerable difficulty because of occlusions. We report a composite tracker - based on feature tracking and colour based tracking - that demonstrates superior performance under high degrees of occlusion. Disjoint foreground blobs are extracted by using change masks obtained by combining an online-updated background model and flow information. The state of occlusion/isolation is identified by associating foreground blobs with object regions predicted using motion initialized mean-shift tracker (colour cue). The feature tracker is invoked in occluded situations to localize these with higher accuracy. We present results from dense traffic data with 5-15 objects in the scene at any instant. Overall tracking accuracy improves to 94.7% from 85.3% achieved by the colour only tracker.

Keywords: Tracking, Occlusions, Feature Correspondence

(a) (b)

Fig. 1. Results of multi-object tracking using (a) only colour cue and (b) both colour match and feature correspondences. The scene contains a total of 16 objects, out of which 10 appear in crowds, (a) The colour based tracker properly localizes only 5 objects; (b) The composite colour-feature based tracker successfully localizes 14 objects in the scene.

1. In t roduc t ion

An algorithm for tracking multiple agents in a monocular surveillance setup is reported. An early approach to this problem deals with tracking blobs obtained from the process of background subtraction.1 However, such blobs do may form a group and get detected as a single blob or an agent can be detected as multiple blobs due to occlusions. The W4 system2 differentiates people from other objects by shape and motion cues and tracks them under occlusions by constructing appearance models and detecting body parts. Several researchers3 have employed particle filtering along with prior shape and motion models for multi-person tracking in cluttered scenes. Recently, Zhao et al4 proposed a Bayesian approach for tracking multiple persons under occlusions by computing MCMC based MAP estimates with prior information about camera model and hu

man appearance along with a ground plane assumption. McKenna et al.,5 on the other hand, presents a colour based tracking algorithm that performs in relatively unconstrained environments and works at three levels of'abstraction, viz. regions, people and groups.

In the recent past, a number of approaches have distinguished the types of occlusion, treating these as a source of additional information to be used in the visual analysis.6 This work builds on this approach, and uses the occlusion type to guide the tracking. We present a hybrid approach based on feature and colour based tracking, and contrast this with other approaches involving only colour.

The algorithm works by learning a background model as a pixel-wise mixture of Gaussians, change masks on which along with inter-frame motion information segments the objects as foreground blobs.6

The object is characterized by its supporting region, weighted colour distribution, trajectory and a planar graph constructed by Delanuay triangula-tion of the feature point set extracted in its supporting region. The system maintains a set of objects to (from) which objects are added (removed) as they enter (exit) the scene. We identify the objects to be either isolated or occluded by associating motion initialized mean shift tracker predicted object regions with foreground blobs. More so, the dissociated object/foreground regions are detected to identify their disappearances / reappearances. The occluded objects are further tracked with higher accuracy by the feature-point graph structure constrained

Pabboju Sateesh Kumar, Prithwijit Guha and Amitabha Mukerjee 153

feature correspondences. The object features are selectively updated based on their occlusion states.

Some salient strengths of the proposed scheme are the following. First, the ability to identify occlusion states and using the same in selective feature updates. Second, the inherent ability of recognizing failure situations and automatic track restorations and finally, a relatively unconstrained approach that does not assume any priors on object shape, motion models and ground plane. Figure 1 shows the performance improvement achieved by the hybrid colour-feature based tracking algorithm as compared to a colour only tracker.

The colour based multi-agent tracking algorithm and its extension to invoke feature correspondences is presented in section 2. Experimental results on a dense traffic video with ground-truth validation are reported in section 3.

2. Multiple Object Tracking

The object regions are segmented as a set of disjoint foreground blobs extracted by combining the cues derived from the change masks over the learned background models (pixel-wise mixture of Gaussians) and the inter-frame motion information. The object regions predicted by motion initialized mean shift trackers7 are associated with the extracted foreground blobs6 to detect the objects in either the state of isolation from other objects and background elements or the state of occlusion arising due to crowding and partial occlusions. Additionally, the cases of entry/exit and disappearance/reappearances are also identified. These occlusion cases guide the tracking algorithm in selective object feature updates and track restoration. The system maintains a set S{t) of objects, to (from) which objects are added (removed) as they enter (exit) the scene. The individual object features are updated as they are tracked across the frames. When an unmatched foreground blob detected, It is matched with disappeared objects based on color and position matching. Search region around each disappeared object is taken into consideration while matching with unmatched foreground blob. In the following sub-sections, we detail the limitations of colour based multi-object tracking algorithm6 and the proposed extension to combine colour and feature based tracking.

2.1. Tracking With Colour Cue

The j t h object Aj(t) is characterized by the set of pixels aj(t) it occupies, the colour distribution hj(t) weighted by the Epanechnikov kernel7 supported over the minimum bounding ellipse of a.j(t) and the finite length position history of the centers {cj(t — £')}T'=o °f * n e m m i m u m bounding ellipse of a,j(t). The object features are initially learned from the foreground blob extracted at its very first appearance and are updated throughout the sequence whenever it is in isolation.

An estimate of the center cj ' (t) of the minimum bounding ellipse of a, (t) is obtained by extrapolating from the trajectory {c,(t — t')}y=i- The mean-shift iterations,7 initialized at an elliptic region centered at cj (t) further localize center of the minimum bounding ellipse of the object region at Cj(t).

The object region and foreground blob associations are computed to identify the various occlusion states. The supporting pixel set, weighted colour distribution and trajectory information are updated for isolated objects. For the occluded ones (same (foreground) object pixel in different (object regions) foreground blobs), we only update the trajectory. More so, we identify the dissociated objects (disappearance) and blobs (entry/reappearance) followed by object-blob association re-computation to restore tracks of the existing ones and log the new objects in S(t).

The colour only tracker employ's the mean-shift algorithm for object localization and is thus prone to erroneous drifts in the mean-shift iterations. The mean shift algorithm models the target as a weighted colour distribution learned over an elliptical domain. Thus, convex near-elliptic compact objects are successfully tracked with this algorithm. However, several real world objects have non-convex shapes with holes - e.g. rickshaw, cycle, man on a motorbike etc. In such cases, the mean-shift tracker learns the background colour distribution in the target model and hence drifts away in object localization iterations. More so, mean-shift trackers are also found to fail under severe occlusions, as it models the colour distribution of the whole target region and not by parts. To avoid these limitations, we extend the object characterization to include the feature points as well, which we describe in the following sub-section.

154 Colour and Feature Based Multiple Object Tracking Under Heavy Occlusions

2.2. Combining Colour and Feature Cues

Feature correspondence was proposed in the context of image registration,8 later extended9 for the selection of good feature points. Consider the consecutive images fit and flt+i? such that n$(U) = Ot+i(U + d\j). Tomasi et al.9 have shown that the displacement vector d u is sufficient during tracking feature points between successive frames approximating deformation value to zero. Feature points are tracked using' the Kanade-Lucas-Tomasi (KLT) tracker.10

The sum of squared difference between consecutive images is reduced to find the displacement vector. Tracking is based on symmetric definition for dissimilarity between two images unlike earlier approaches given as ftt(U-du/2) = 0 e + i ( U + du /2 ) . The displacement vector can be computed by solving the equation G(U)du = e(U), where the 2 x 2 symmetric matrix G and the residue vector e are obtained from,

G(U) = / gt(X)gt(Xfw(X)dX JW(U)

e(U) = 2 x / (Qt - Oe+i)(X)ge(X)W(X)cJX JW(U)

Where, the integration is performed over a certain window W(U) centered at U, g t(X) = V« t (X) is the image gradient and w(JL) is a weighing function defined over W(U).' The pixel position U is considered to host a good feature point, if both the eigen values (Ai,A$) of G(U) are sufficiently high, i.e. mm(Ai, A2) > A, where A is a predefined threshold.

Fig. 2. Delaunay triangulation based graph model for (a car, a cycle and a rickshaw).

The feature tracker is invoked for objects under occlusions. We extend the object characterization of the colour only tracker to include the set of feature points in the object region. We perform a Delau

nay triangulation (figure 2) over the feature point set forming a planar graph that represents a geometrical structure of the object. Isolated feature tracking can erroneously correspond to a) points in the background or b) points on other objects. In case (a), foreground segmentation can be used to eliminate the correspondence. In case- (b), the faulty feature correspondence is detected by ;using the Motion Consistency Hypothesis: feature points on the same rigid body exhibit consistent motion. Even where the bodies exhibit large deformation (e.g. human motions), some branches of the graph exhibit relatively stable deformations. Object feature points are tracked in the consecutive images constrained by the feature-point planar graph structure to improve the tracking performance. However, in cases of (near) complete occlusions, where neither colour match nor feature correspondences can be established, we continue the tracking with motion predicted object position. Object is said to disappear, either it is found out as disappeared using predicates described in6 or if it losses minimum threshold number of features. When object disappears it is tracked using only motion information, which is susceptible to error as number of frames increase. In the case of hybrid tracker the duration of disappearance for objects is less compared to color only tracker, as objects are tracked until the object completely disappears. Hence matching accuracy increased in hybrid tracker compared to color only tracker.

3. Resul ts and G r o u n d - t r u t h Validation

We report an experiment on a traffic surveillance video involving a wide variety of vehicles like motorbikes, bicycles, rickshaws, cars, buses, trucks and tractors, as well as people and animals (cow). Next, we compare our results in multi-object tracking based on only colour and both colour and feature (figure 3).

The proposed algorithm is tested on different data sets. Figure 4 shows the results obtained on human data set. In this data set, we do not perform neglecting of feature points based on Motion Consistency Hypothesis described in section 2.2, because the tracked objects (humans) in this video are not rigid. Tracking accuracy in this case is less for both color only tracker and hybrid tracker compared to their respective Traffic video performances 3.

Pabboju Sateesh Kumar, Prithwijit Guha and Amitabha Mukerjee 155

(<-•) M)

Fig. 3. Re-identification errors. In sequence (c,d) (Mean shift tracker), a silver SUV occludes the rickshaw at upper center; after the occlusion the rickshaw is misidentified as another object that disappeared earlier, and the SUV is wrongly matched as rickshaw. The man in the blue bounding box is matched with a car that left the scene. All these errors are overcome by the hybrid feature tracker (e,f), it is particularly good at re-identification.

1|_A b- • J J5

(b)

•flrt-* —fi»te^!ii

(c) (d)

Fig. 4. Comparison of Mean shift and Modified trackers on Human data set

We observe, in the igures 4(a) and (c), that Mean shift tracker clearly fails to track the person under severe occlusion behind another person. In the igures 4 (b) and (d), we observe that Hybrid tracker able to track person under severe occlusion with blue bounding box. M i n i m u m dis tance (MinDist): Distance between point features can be changed. While selecting the point features, If the point feature that' is going to

be selected is near to already selected features, then it is not selected. It is based on the assumption that "neighboring pixels generally have similar goodness values" . The point feature is considered to be near other point feature if the distance between them is less than minimum distance. This can be used to speed up the process. For results in Igure 4 MinDist is fixed as 0 while it is fixed as 5 in figure 3, We validate the results against a ground-truth data over 700 frames. We compute the following measures:

• Total Tracking Accuracy accuracyt = where bi denotes the number of well

tracked objects and c* is the number of track losses in the ith frame. Re-identification accuracy (re — identt): Let Si be the number of disappeared objects in the ith frame, e» be the number of erroneously tagged reappearances3,, §i the number of successfully registered object entries in the scene, f% be the number of reappearances erroneously detected as scene entries, and lit be the the number of scene entries erroneously detected as reappearances. Then we define the Re-identification accuracy as the ratio re-identt = ^ # % T T r r r f ^ -

2_rfi—% \si~r9* i »*t;

Tracking Accuracy in Crowds is the ratio T**' it

er owdt = w *7* * x, where p» is the number of well tracked objects in the crowd and qi is the number of track losses in crowd, as observed in the ith frame. Approximat ion of Object Localization Ac-curacy, localizt = 2"fef \ % where n* denotes the number of objects tracked with an ill-sized or misplaced bounding box in the ith frame. As seen in table 1, results are significant improvements in all the categories, with strikingly im~ proved re-identification.

4, Conclusion

We have proposed an algorithm for multi-object tracking under occlusion by combining multiple cues( Colour, Motion, Features ) based on their importance in particular situation. The proposed scheme

aLet the agents a and b disappear in frames say x» y respectively. In frame i ( i > (x, y)) , the tracker identified agent b as re-appeared when actually it was agent a.

156 Colour and Feature Based Multiple Object Tracking Under Heavy Occlusions

is not restricted by any prior object shape/mot ion

models or ground plane assumptions and thus per

forms satisfactorily in relatively unconstrained en

vironments; more importantly, since no camera cal

ibration is needed, it can be placed anywhere and

immediately put to work. The remaining limitations

are significantly more difficult - e.g. when an ob

ject is nearly fully occluded (motion projection is

the only option), or differentiating between multi

ple objects entering the scene together (before they

split). A cue towards the latter may be to look for

differences in deformations of the feature graph - re

sulting in several clusters of motion of point features

on single blob, an approach tha t may work when

the objects deform differently, or move with different

speeds. Eventually, it would be important to extend

these ideas to work in more general situations, e.g.

cameras tha t move (initially with pan-tilt motions),

and for dynamic backgrounds (trees, fountains).

References

1. C. R. Wren, A. Azarbayejani, T. Darell and A. Pe-naland, Pattern Analysis and Machine Intelligence 19, 780(July 1997).

2. I. Haritaoglu, D. Harwood and L. Davis, IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 809(August 2000).

3. C. Needham and R. Boyle, Tracking multiple sports players through occlusion, congestion and scale, in Proceedings of the 12th British Machine Vision Conference, 2001.

4. T. Zhao and R. Nevatia, Tracking multiple humans in crowded environments, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, July 2004.

5. S. McKenna, S. Jabri, Z. Duric and A. Rosen-feld, Computer Vision and Image Understanding 80 (2000).

6. P. Guha, A. Biswas, A. Mukerjee and K. Venkatesh, Occlusion sequence mining for complex multi-agent activity discovery, in The Sixth IEEE International Workshop on Visual Surveillance, May 2006.

7. D. Comaniciu, V. Ramesh and P. Meer, Real-time tracking of non-rigid objects using mean shift, in Computer Vision and Pattern Recognition, 2000.

8. B. D. Lucas and T. Kanade, An iterative image registration technique with an application to stereo vision, in Proceedings of 7th International Conference on Artificial Intelligence (IJCAI1981), (Vancouver, British Columbia, 1981).

9. C. Tomasi and T. Kanade, Detection and Tracking of Point Features, Tech. Rep. Technical Report CMU-CS-91-132, Carnegie Mellon University (April 1991).

10. S. Birchfield, Kit an implementation of the kanade-lucas-tomasi feature tracker www.ces.clemson.edu/stb/klt/.

Table 1. Performance Comparision of colour only and colour-feature based hybrid tracker

bt ct

accuracyt

St

et ft 9t ht

re — identt

Pt It

crowdt

nt

localizt

Colour tracker

2904 500

85.31%

29 17 2 5

30

23.44%

425 357

54.35%

213

92.67%

Hybrid tracker

3246 182

94.69%

23 1 2

20 0

93.02%

592 193

75.41%

15

99.54%

http://www.ces.clemson.edu/stb/klt/

157

DCT Properties as Handle for Image Compression and Cryptanalysis

Anil Kr. Yekkala

Philips Electronics India Ltd. No 1, Murphy Road, Bangalore 560008, INDIA

E-mail: anil.yekkala&philips.com

C. E. Veni Madhavan

Indian Institute of Science Bangalore 560012, INDIA

E-mail: cevm&csa.iisc.ernet.in

Narendranath Udupa

Philips Electronics India Ltd. No 1, Murphy Road, Bangalore 560008, INDIA

E-mail: narendranath.udupa&philips.com

DCT transformation has been traditionally used to exploit spatial redundancy within an image and video for purpose of compression. JPEG standard commonly used for image compression is based on DCT transformation. In this paper we analyze the linear and orthogonal properties of DCT transformation for achieving an improvement over compression provided by JPEG, and we also use the DCT properties to demonstrate vulnerability of lightweight encryption schemes based on encrypting the DC values.

Keywords: DCT transformation; JPEG; DC prediction; DC encryption

1. Introduction

Both lossy and lossless compression schemes are based on exploiting the spatial redundancy within an image. In case of lossy compression schemes the spatial redundancy is exploited by representing the image in its frequency domain using a transform coding technique, and then doing the further processing in the frequency domain. DCT transformation is one of the most popularly used transformation techniques for image compression. For instance, in case of JPEG compression1 the entire image is divided into blocks of size 8x8, and each block is represented in its frequency domain by applying 2-dimensional DCT transformation. In the frequency domain the component corresponding to zero frequency is known as DC value and the non-zero frequency components are known as the AC values. It has been generally observed7 that most of the energy of an image is contained in the DC values. Hence, in order to reduce the number of bits to represent the DC values, the spatial redundancy is further exploited by encoding the DC values using Differential DC encoding. It is to be noted that even after differential DC encoding, the DC values generally occupy about 8% of the image

space. Using the properties of DCT we propose a replacement to Differential DC encoding, hence showing further improvement in compression achieved by JPEG. Again owing to the fact that most of the energy is present in DC values, several lightweight encryption schemes3 are based on encrypting the DC values to reduce the intelligibility of the image. Even though techniques4-6 for reconstructing images are available in the case when few of the DC values are corrupted, currently we do not see many methods in literature, which focus on reconstructing an image in the event of corruption of all DC values. Again using the properties of DCT transformation we show a method to reduce the amount of noise introduced in the image due to encryption of DC values, hence showing the vulnerability of Lightweight encryption schemes based on encrypting only the DC values.

1.1. Properties of DCT Transformation

In this paper we use the Linear Property and Orthogonal Property of DCT Transformation, to achieve better compression compared to JPEG, and also to reduce the amount of noise introduced due to encryption of DC values.

158 DCT Properties as Handle for Image Compression and Cryptanalysis

1.1.1. Linear Property

If X is a square matrix of dimension n x n, and if A is a constant square matrix of same dimension with constant value a, then

DCT{X + A) = DCT(X) + K

where K is a matrix whose first element namely the DC value is equal to n x a, and remaining elements i.e. AC values are zero. In case of images the value of n is equal to 8. Hence it can be concluded that modifying or corrupting the DC value of an image block results in a constant shift in each pixel of the block and vice-versa.

1.1.2. Orthogonal Property:

2-dimensional DCT transform is an orthogonal transformation. Hence, in case of JPEG, the output of DCT transformation on each block of 8x8 pixels can be considered to be a representation of 64 orthogonal equations on a vector of 64 elements. Similarly the IDCT operation can be considered equivalent to solving 64 unknown values using 64 independent equations. Hence, based on Orthogonal Property of DCT transformation, it can be observed that decoding each block without knowing the DC value is equivalent, to solving 64 unknown pixel values, based on 63 orthogonal equations representing the equations for the AC values. Hence the problem is equivalent to replacing the equation for DC value, by an alternative equation that is orthogonal to remaining 63 known equations. In this paper we propose to replace the equation X\ + X^ + ... + X&4 = 8 * DC representing the DC value by the equation representing the top-left pixel Xi = p or equivalently by an equation representing any other pixel. Hence, it can be concluded that even if the DC value of the block is not known, it is possible to decode all the pixel values of the block if atleast one pixel value is known along with the 63 AC values. But it is to be noted that solving the pixel values based on AC values and top-left pixel of the block is considered computationally ineffective, compared to solving the pixel values based on IDCT (i.e. using DC value and the AC values). Hence in order to use the fast computational techniques available for IDCT, the top-left pixel is used to predict the DC value approximately (8 times the top-left pixel value). Then using IDCT the pixel values of the block are obtained approximately. The

error (S) due to approximation of DC value is obtained by taking the difference in top-left pixel before and after the IDCT operation. Using Linear Property the error in each pixel due to usage of approximate DC is adjusted by adding the error 5 to each pixel value of the block.

2. Application for Image Compression

In an image it can be observed that the magnitude of the difference between two consecutive DC values is relatively higher, compared to magnitude of difference between two consecutive pixels. This phenomenon can be clearly observed from Table 1, the second column of the Table 1 shows the average of absolute difference of DCs of two consecutive blocks, whereas the third column of the Table 1 shows the average of absolute difference of top-left pixel of a block with a pixel to its left present in the previous block, and the fourth column shows the ratio of second column to third column. It can be observed that ratio for all images except for baboon are above 8. The ratio for baboon image is 7.25, which is quite less. This is due to texture of the baboon image, which can be observed in Table 1. Due to the sharp texture of the baboon image the pixel-to-pixel difference is quite high, but since same type of texture is present in most of the image, the DC differences between two consecutive blocks is generally low. Hence the ratio for baboon image is comparatively low. Again it can be observed from Table 1 that the ratio is quite high for pills image, this is due to smooth texture and sharp edges of the images, resulting in small pixel-to-pixel difference but high block-to-block difference.

Fig. 1. Baboon and Pills.

Hence it can be concluded that the effect of the differential encoding scheme can be improved further by taking difference between two consecutive pixels

Anil Kr. Yekkala, C. E. Veni Madhavan and Narendranath Udupa 159

Table 1. Comparison of DC Diff with Pixel Diff

Standard Images

Lena

Peppers

Baboon

Brandyrose

Bandon

Pills

Opera

Mean of Absolute

DC Differences

(a)

137.1

136.5

106.6

94.1

34.8

184.1

70.4

Mean of Absolute

Pixel Differences

(b)

6.6

6.2

14.7

4.1

2.5

4.9

6.4

Ratio of (a) and (b)

(a/b)

20.77

22.02

7.25

22.95

13.92

37.57

11.00

i.e. with reference to Figure 2 the difference between the top-left pixel (P) of a block (Block2) and the pixel (Q) to its left from the previous block (Blockl). The differential pixel encoding can be applied to all blocks except for the first block in each row.

Q

Blockl PI Block2

Fig. 2. Pictorial view of 2 consecutive blocks of an image.

The image compression based on differential pixel encoding for gray images is explained in following sub-sections

2.1. Mathematical Notations

For explaining the encoding and decoding procedure the following notations will be used I - Denotes the entire image Iij - Represents an 8x8 pixel blocks of an image, namely the block in ith row and jth column of the image. Iij(m,n) - Represents a pixel within the block, namely the pixel in mth row and nth column of the block Iij.

2.2. Encoding procedure for Gray Images

The encoding procedure is similar to JPEG encoding, except for the fact that Differential Pixel encoding is used instead of Differential DC encoding. The blocks

are encoded in sequential order starting from first leftmost block in first row, followed by the second leftmost block in the first row and lastly the last i.e. the rightmost block in the last row. Differential Pixel encoding is not applied for the first block in each row, for these rows the entire DC value is encoded along with the AC value. For the remaining blocks where Differential Pixel encoding is applied, the procedure (say block Itj, where j > 1) is as follows

(1) DCT transformation is applied on the block Ii%j. (2) The DC value of the block U j is replaced by the

top-left pixel value of the block namely It j (1,1). (3) Differential Pixel encoding is applied for the top-

left pixel of the block Iij by taking the difference between the top-left pixel I,j(l, 1) of the block and pixel J i j_ i ( l ,8 ) to its left present in the previous block

A j = / M ( l , l ) - / y - i ( l , 8 )

(4) The pixel difference Dij, along with the AC values of the block are then quantized and encoded using run-length encoding.

2.3. Pixel Difference and quantization

In order to reduce the amount of error due to quantization on Differential pixel encoding, the difference Dij is computed by taking the difference of Iij(l, 1) from modified Ji,j_i(l ,8), i.e. the previous encoded block Jj,j-i is decoded and the modified pixel value i i j _ i ( l , 8) is used. Moreover, the quantization value used for quantizing the pixel difference is 8 times less the quantization value used for quantizing the DC difference in case of JPEG, this is due to the fact that DC value is equal to 8 times the average pixel value of the block.

2.4. Decoding procedure for Gray Images

The encoded image date is segregated into blocks by identifying the End of Block (EOB) indicators and de-quantization is applied on each block. Similar to encoding the image is decoded is sequential manner starting from first row, and within each row the decoding starts from the left most block. For the first block in each row, the DC and AC value is obtained after de-quantization, and IDCT operation is applied on these blocks to obtain the pixel values. For the remaining blocks (say block Iij, where j > 1) the pixel


difference along with the AC values are obtained, and the decoding procedure is as follows

(1) The top-left pixel of the block Iij is obtained by taking sum of the pixel difference Dij and top-right pixel of the previous block, which is already decoded

J i j ( l ) l ) = A j + / i j - i ( l , 8 )

(2) The DCij value of the block is approximately computed using Ddj = 8 * I%,j(l, 1)

(3) IDCT transformation is applied on the approximate DC value and the AC values, to compute the pixel values of the block approximately, denoted by I*j .

(4) The error in pixels, due to approximation of DC value is calculated by taking the difference in top-left pixel before and after IDCT transformation.

Sid = Itj(l,l) - ^ ( 1 , 1 )

(5) The correction is then applied on all pixels of the block Iij by adding Sij to all pixels of the block

2.5. Extension for Color Images

The encoding/decoding scheme for gray images is extended to color images by using the encoding/decoding procedure similar to gray images on all the 3 color components namely Y, Ct, and C r components of the color image.

3. Vulnerability of Lightweight DC Encryption Scheme

In this section we propose a scheme for reducing the amount of noise introduced in an image due to encryption (or corruption) of all DC values. It is assumed that atleast one pixel value of the entire image is known in addition to all AC values. The scheme is then based on first decoding a block whose pixel value is known, along with the available AC values of the block, and then using the advantage of spatial redundancy within an image, a pixel value of neighborhood blocks are predicted, then using the predicted pixel along with the available AC values the neighborhood block is decoded, hence proceeding in a similar manner all blocks of the image are decoded.

3.1. Decoding procedure for gray images

Inputs:

(1) Quantized AC values (2) One pixel of the image is assumed to be known,

for sake of simplicity it is assumed to be th top-left pixel of the image.

Procedure:

(1) AC values are de-quantized using the quantization tables.

(2) The DC value of the block 7i,i is approximately computed as DC\ti = 8 * p.

(3) Using the approximate DC value DCit\ along with the computed AC values, the approximate pixel values of the block 7i,i are obtained using IDCT.

(4) The error in pixel values of the block 7i,i due to approximation of DC\,i is adjusted, by adding delta error Siti = p — / i , i ( l , 1), to all the pixels of block.

(5) The top-left pixel of the block I\$. is predicted using linear extrapolation its neighboring pixels available from already decoded block Jî ,

/ 1 , 2 ( l , l ) = 2 * / i , 1 ( l , 8 ) - J M ( l , 7 )

(6) If the predicted pixel value /i,2(l, 1) is less then 0 it is approximated to 0, and if it is greater then 255, it is approximated to 255.

(7) Using the predicted top left pixel ii,2(l, 1) along with the AC values the entire block 7i,2 is decoded similar to 7i,i.

(8) Similar to decoding of I\$ all blocks in the first row of the image are decoded sequentially.

(9) The top-left pixel of the block 72,i is predicted using linear extrapolation on its neighboring pixels available from already decoded block 7i,i

/ 2 , 1 ( l , l ) = 2 * / l l i ( 8 , l ) - J i , i ( 7 , l )

(10) If the predicted pixel value 72,i(l, 1) is less then 0 then it is approximated to 0, and if it is greater then 255, it is approximated to 255.

(11) Using the predicted value of the top left pixel i2,i (1,1) along with the AC values the entire block 72,i is decoded similar to 7i,i.

(12) Similar to decoding of 72,i all blocks in the first column of the image are decoded sequentially.


(13) The top-left pixel /2,2(1,1) of the block J2,2 is computed using a linear combination of extrapolation and interpolation techniques based on pixels available from the block /1)2 and block iî-

J2,2(l, 1) = {[2 * /i,2(8,1) - J l i2(7,1)] + [2 * 72,i(l,8) - I2 ,i(l,7)] + [/i,2(8,2) + /2,i(2,8)]}/4

(14) If the predicted pixel value /2,2(1) 1) is less then 0 it is approximated to 0, and if it is greater then 255, it is approximated to 255.

(15) Using the predicted value of the top }eft pixel 72,2(1>1) along with the AC values the entire block 72>2 is decoded similar to Ji,i.

(16) Similar to decoding of 72,2 all blocks in the second row and second column of the image are decoded sequentially, followed by third row and third column and sequentially all the blocks of the image.

3.2. Controlling error propagation due to pixel prediction

Error propagation due to usage of pixel prediction is controlled to large extent, by checking if all the pixel values of each predicted block after decoding are lying between 0 and 255. If a pixel within the block is less then 0, then all pixels in the block are increased by a constant factor, such that the minimum pixel value within the block becomes 0. Similarly, if a pixel within a predicted block is above 255, each of the pixels within the block is reduced by a constant factor, such that the maximum pixel value within the block is 255. It can be observed from Linear Property of DCT transformation, a situation when the minimum pixel value is less then 0, and maximum pixel value is greater then 255 will not arise.

3.3. Extension of decoding procedure for color images

The reconstruction algorithm is extended to colour images by using the decoding scheme for grey image on all the 3 colour components of the colour image independently. Since the pixel changes are smoother in all the three planes of RGB pixel representation, compared to other pixel representations (e.g YC(,Cr), the pixel prediction technique give better results for colour images in RGB colour format. This is confirmed based on the experiments carried out on the approach for predicting images based on AC values, in different colour representations.

4. Results

4.1. Image compression

The compression using Differential pixel encoding has been tested on more then 100 images. The results from Table 2 show 2% to 8% improvement over JPEG compression while maintaining the same noise level measured in terms of MAD (Mean Absolute Difference). The results have been verified at various bitrates and found to be consistent. The results also show that amount of gain in compression achieved for a smooth textured image like Pills (7.17%) is relatively higher to the gain in compression achieved for a highly textured image like Baboon (2.02%). Hence, it can be concluded that the proposed approach gives better results for smoothly textured images.

Table 2. Compression results compared to JPEG

Image («».)

Baboon (512x512)

Bandon (610x403)

Lena (512x512)

Opera (695x586)

Peppers (512x512)

Pills (800x519)

JPEG

68,940

24,635

32,572

60,170

33,754

54,762

Size

Prop. Method

67,406

23,544

30,460

57,871

31,586

50,833

Gain %

2.22

4.44

6.48

3.82

6.43

7.17

MAD

JPEG

7.31

2.02

2.97

3.85

3.46

2.36

Prop. Method

7.32

2.04

2.99

3.85

3.46

2.37

4.2. Cryptanalysis of DC encryption schemes

Experiments have been carried out on more then 100 images.

The results of the experiments on few of the standard images are shown in Table 3. It can be clearly observed that the reconstruction scheme improves the image quality measured using MAD. For


Original Image } Image with DC j \ values Encrypted I i using stream ciphei i

Reconstructed Image based tm

osetl scheme

.„£na

• • • i H HHENPR

ilfllRllh. -Uppers.

h " i *r":r-

Fig. 3. Cryptanalysis of DC based encryption scheme-

Table 3. Cryptanalysis results of DC encryption scheme

Image

Lena

Peppers

Baboon

Brandyrose

Bandon

Pills

Opera

MAD for DC encrypted Image

53.7

51.2

56.1

59.3

32.4

52.7

59.7

MAD for Reconstructed Image

25.5

14.7

34.9

30.5

11.8

22.3

29.8

example in case of Lena the MAD decreases from 53.7 to 25.5. The reduction of noise can be also observed from Figure 3, it clearly shows that intelligibility of the image has been improved to large extent using the proposed reconstruction scheme. It is also to be noted that the reconstruction of the image was started from the top-left pixel, but the reconstruction can be started from any arbitrary locations and the outputs can be combined to improve the intelligibility of the reconstructed image.

5. Conclusions

In this paper we used linear and orthogonal properties of DCT for achieving an improvement over compression provided by JPEG. In addition we also used the Linear and Orthogonal properties of DCT transformation to demonstrate the vulnerability of lightweight encryption schemes based on encrypting DC values, by reducing the noise considerably introduced by encryption.

References

1. Gregory K. Wallace, "The JPEG still picture compression standard", IEEE transactions on Consumer Electronics* 1991

2. Subhasis Sana, "Image Compression - from DCT to Wavelets : A Review", ACM Crossroads students magazine

3. Salah Alys "Multimedia Security: Survey and Analysis", Multimedia and Networking Research Lab. CTI, DePaul University, Chicago, h t t p : / / vw . mnlab.cs.depaul.edu

http://www


4. Shuiming Ye, Xinggang Lin, Qibin Sun, "Content Based Error Detection and concealment for image transmission over wireless channel", IEEE 2003

5. Yefeng Zheng, "Performance Evaluation of Spatial Domain Error Concealment of Image Recovery", Project Report of ENEE739M, spring 2002

6. Yao Wang; Qinfan Zhu, "Error control and conceal

ment for video communication: A review", Proceedings of the IEEE v 86 n 5 pt , p 974-997 (1 May 1998)

7. Nick Kingsbury, "Image Characteristics", Version 2.9: Jun 8, 2005 9:10 am GMT-5, source: http://cnx.rice.edu/content/mll085/latest/

http://cnx.rice.edu/content/mll085/latest/

164

Genetic Algorithm for Improvement in Detection of Hidden Data in Digital Images

Santi P. Maity* and Prasanta K. Nandi

Bengal Engineering and Science University, Shibpur Howrah, West Bengal

India E-mail: [email protected]*, [email protected]

Malay K. Kundu

Center for Soft Computing Research, Indian Statistical Institute,

Kolkata, West Bengal India


The paper proposes a data hiding scheme in digital images to serve the purpose of authentication and covert communication. The algorithm is implemented in three stages. In the first stage, a region for data embedding is selected in such a way so that maximum number pixel values of the cover image differ from the message pixel values by less than a predefined threshold and a difference signal (D) is formed. The difference signal modulated by proper embedding strength is then added to the cover to achieve stego image. In the second stage, Genetic Algorithm (GA) is used to achieve a set of parameter values (used as Key) to represent optimally this difference signal. In the third stage, the parameter values at the decoder end form an approximate version of the difference signal using linear interpolation scheme and subsequently the embedded data is decoded. The novelty of the proposed scheme lies in higher payload capacity without compromising much for visual and statistical invisibility of the hidden data. The later property is verified as the proposed embedding process, unlike many other reported algorithms, causes very small change in the higher-order statistics of the wavelet coefficients for the stego data. Experiment results also show that the algorithm is robust to non-malicious operations like mean and median filtering, additive noise and moderate image compression.

Keywords: Digital data hiding; Difference signal; GA; Statistical invisibility.

1. Introduction

Data hiding is the art of invisible communication that embeds a secret message into innocuous-looking cover objects such as simple text, images, audio or video recordings. The technique may serve the purposes of unobtrusive military and intelligence communication, covert criminal communication, protection of civilian speech against repressive government. Classical steganography deals with the methods of embedding a secret message, a copyright mark or a serial number in a cover message by slightly modifying the bits in the data. After embedding, the cover object becomes known as the Stego-object. This embedding is characterized by a Key, some data, without the knowledge of which, it would be extremely difficult to detect or remove the embedded material. The essential requirements of image data hiding are visual imperceptibility of the hidden data, security against statistical analysis and robustness to non-malicious operations that a communication channel is to face. The processing may include compression for efficient storage and transmission, mean/median

filtering for the purpose of noise cleaning. However, the degree or depth of this signal processing operations should be restricted to a level so that the stego-object must preserve its commercial value. A data hiding method will be called secure if the stego-objects do not contain any detectable artifacts due to message embedding. In other words, the set of stego-objects should have the same statistical properties as the set of cover-objects. If there exists an algorithm that can guess whether or not a given object contains a secret message with a success rate better than random guessing, the data hiding system is considered broken.1

Literature of digital image staganography is quite rich. Most of these techniques utilize the LSB (least significant bit) embedding applied either directly to pixel values, to the indices in palette images (EZ Stego), or to the quantized DCT coefficients for the JPEG format (J-Steg, JPHide & Seek, Out-Guess).2 Although messages embedded into an image are often imperceptible to the human eye, they often disturb the statistical nature of the image. Farid




Santi P. Maity, Prasanta K. Nandi and Malay K. Kundu 165

studied for a broad range of natural images and reported that there exists strong higher-order statistical regularities among the wavelet coefficients.3If a message is embedded within an image, these statistics are significantly altered. So the design of image steganography algorithms, for a given auxiliary message, may be directed towards the efficient selection of the embedding-region within the cover and subsequent modulation process so that visual and statistical invisibility of the hidden data is well maintained.

Payload becomes an important requirement when information hiding is used for covert communication. However, higher embedding capacity obviously affects visual and statistical invisibility of the hidden information along with robustness performance. These requirements form a multidimensional nonlinear problem of conflicting nature. Genetic algorithm (GA), like an efficient search for optimal solutions in many image processing and pattern recognition problems, can also be used in this topic of research but in reality the usage of the tool has been explored very little. Maity et aZ.4propose an algorithm for data hiding in digital images where GA is used to find out parameter values, namely reference amplitude and modulation index. The performance of the algorithm is compared for linear, parabolic and power law functions used for modulating the auxiliary message.

In this work, Genetic Algorithm is used to achieve a set of parameter values (used as Key) to represent optimally the difference signal (D) that is obtained by subtracting the pixel values of the auxiliary images (messages) from that of the pixel values of the cover image. The difference signal with its proper embedding strength is added to the respective cover data. The parameter values at the decoder end are used to form difference signal using Linear Interpolation method. Message is recovered by doing inverse operation using this difference signal. Experiment results show that the hidden data is secured against statistical analysis and robust to various non-malicious operations.

The rest of the paper is organized as follows: Section 2 presents problem definition and scope of the work. Section 3 briefly describes steganalysis method based on higher order statistics and section 4 presents proposed data embedding method. Section 5 describes performance evaluation while conclusions are drawn in section 6.

2. Problem definition & scope of the work

Different data hiding applications require different payload, typically varying from a few bits in access control, up to at most one hundred bits in authentication and fingerprint problems, may demand much higher payload capacity for information-hiding applications. The last application can be used for covert communication and objective is to embed data in such a way without compromising much of image fidelity and decoding reliability. In additive embedding process, data hiding is accomplished by adding to the host data a scaled version of the auxiliary message. Image fidelity is degraded with the increase in pay-load capacity and embedding strength. The higher value of embedding strength, at the cost of greater visual distortion, increases reliability of decoding.

One possible solution to cope up this trade-off like problem is to map the auxiliary message signal to a difference signal. This is formed by selecting the embedding region within the cover signal so that auxiliary message forms a lower distance difference signal. Now the difference signal can then be added with higher embedding strength so that visual distortion of the cover can be set to an acceptable value. Decoding of message needs the regeneration of difference signal and inverse process will extract the message signal. The reliability of the decoding process depends on how faithfully the difference signal is regenerated. The best decoding is possible if complete knowledge of the difference signal is available at decoder end which may be treated as an overhead problem. To reiterate the problem, the higher the payload capacity, the more number of signal points are needed for regeneration of the difference signal. Thus important point arises how to select those N-signal points that regenerate the M-point difference signal faithfully where N <c M. This can be thought as an optimization problem and GA finds application to yield optimal solutions.

The above problem can be stated mathematically as follows: Given an M-point difference signal (£>), how an approximate signal (D ) be generated using N- signal points where N -C M and (D ) is close resemblance of (D). One way to regenerate better approximation signal using higher order interpolation but in that case computation cost increases

166 Genetic Algorithm for Improvement in Detection of Hidden Data in Digital Images

exponentially with order of interpolation. Linear interpolation is a good compromise between the computation cost and better approximation for regenerated signal. In such case, it is the important point to find which iV-points would generate better approximation and how TV-values affect this approximation function. This is an optimization problem and GA finds application to yield optimal solutions.

3. Steganalysis based on higher order statistics

Farid proposed a universal blind steganalytic detection method3 based on higher order statistics of natural images. We briefly describe two steps as fol-lows:The detection scheme can be separated into two parts : (1) Extraction of a set of statistics, called the feature vectors, for each investigated image. (2) Formation of a Classification Algorithm to separate untouched images from stego-images on the basis of their feature vectors.

Formation of feature vectors

The feature vector, for a certain image, is formed using the first four normalized moments namely Mean, Variance, Skewness and Kurtosis of the wavelet coefficients of vertical (vi[x, y]), horizontal (hi[x, y]), and diagonal (di[x,y]) subbands at scales i — l,2,3....n. Thus we have total 4.3. (n-1) elements of the feature vector / for the test image.

The remaining elements of / are derived from the Error Statistics of an Optimal Linear Predictor. The prediction for a specific subband coefficient is performed considering the 4 neighboring coefficients, the corresponding in the courser scale of the same orientation and coefficients of subbands of other orientations (and scales). The predicted value for the vertical subband Vi[x,y)] is given by

v[ = wiVi[x- l,y] + w2Vi\x + l,y] + w3Vi[x,y - 1]

+ w4Vi[x, y + 1] + w5vi+i [x/2 + y/2] + w6di[x, y]

+ w7di+1[x/2,y/2] (1)

where uik denotes the predictor coefficients. The other two subbands h[x, y)] and d[x, y)} can be predicted in the same way. The optimal predictor coefficients Wk,opt are determined for each subband so that the mean squared error within each is minimised. The log error in the linear predictor is then given by:

eVtlog[x,y} = \og2(vi[x,y}) - log2(v'i[x,y}) (2)

where Vi[x, y] is obtained by inserting Wk.opt into Eq. (1), yields the Log Error. The mean, variance, skewness, and kurtosis of the Log Error of each subband form the remaining 4.3.(n-l) elements of / .

Classification Algorithm

The classification algorithm called the Fisher Linear Discriminant Analysis is used to classify a new image by means of its feature vector. The FLD algorithm is first trained with feature vectors from untouched and stego-images. The algorithm determines a projection axis by means of this training set to project the 24.(n-l)-dimensional space of feature vectors into a one dimensional subspace. The projected feature vector f is referred to as the Detection Variable d. New feature vectors obtained from new images are classified by thresholding d. If d is greater than a certain value, the image is classified as stego, if not, it is classified as untouched.

4. Proposed method

The total process of data embedding and decoding consists of three stages. These are Stage 1: Selection of data embedding regions and formation of difference matrix followed by data embedding, Stage 2: Generation of set of points using GA to optimally represent the difference matrix, Stage 3: Message retrieval. Fig. 1 represents the flowchart for the main module of the proposed work.

A. Stage 1

The choice of cover images is important and influences the security in a major way. Images with a low number of colors, computer art, images with a unique semantic content, such as fonts, should be avoided. Some data hiding experts recommend grayscale images as the best cover-images. We consider gray scale image as cover and the similar type image like text information as message signal since it can preserve contextual information even after various signal processing operations. Steps for the selection of embedding region are as follows.

Step 1 : Input gray scale images as cover and message signal. Step 2: Setting of an appreciable Percentage for matching criteria (82 percent) Step 3 : Selection of a region from the Cover equal

Santi P. Maity, Prasanta K. Nandi and Malay K. Kundu 167

( S M )

Enter the Cover&le in .dat fiom

Enter the Mes sazefik in dat from

Enter the rnaxnumGenezatiDn mint e m

I - IiutiafcE veneration nurtb er e= 1

i '

Window Selection

" Initial™ Pouuktion

z Evaliate fitness of each individual

J PeiformS elect mates

I PeribimCros s o ve r

i PetfomiMutation.

1 r

Stone New Population. s=s*i

Fig. 1. Flowchart for the main module of the proposed work

in the size to that of message signal. Step 4 : Comparison of variation of the pixel values between the cover and message signal. Step 5 : Repetition of the above process by dynamically selecting windows all over the cover image. Step 6 : Once the Percentage matching criteria is satisfied, the process terminates, otherwise, it is continued till the end of the cover image.

Step 7 : Output: (1) If percentage matching criteria is satisfied, then returns the Difference Matrix to ensure a smooth image with little variations. (2) If no matching region is found, returns a null matrix denoting failure of finding the specified percentage matching region.

The difference matrix (D) is then multiplied by proper embedding strength (K), and added to the respective pixel values of the cover image (C). The

168 Genetic Algorithm for Improvement in Detection of Hidden Data in Digital Images

stego image can be obtained as follows:

S = C + K.D (3)

B. Stage 2

The main objective is to find an optimal set of points using GA5 so that an approximate version of difference signal can be generated. 1. Initialization of population: Chromosomal representation of the parameter values. The initial population is formed by taking almost equi-spaced x-y data points with small perturbations. 2. Select Mates: Objective : To select, most of the times, the best fitted pair of individuals for crossover. Step 1 : Input : Population Step 2 : Best fitted pair of Individuals are chosen by Roulette-wheel selection process by adding up fitness values of the individuals to get Sumfitness Step 3 : Then randomly select Individuals to cross 50% of the Sumfitness value in a cumulative way Step 4 : The particular Individual which crosses the 50% criteria in the Cumulative process, is chosen to be one of the mating pool pair. Step 5 : This process is again carried on to find another Individual of the mating pool Step 6 : Output : Pair of Individuals or mating pool

3. Crossover: Objective: To find the Crossover site and to perform crossover between the Mating pool pair to get new pair of more fitted Individuals. Step 1 : Input: Mating pool pair Step 2 : Finding the crossover site in a random manner Step 3 : Exchange the portions lying on one side of crossover site of those mating pool pair Step 4 : Output: New pair of Individuals 4. Mutation: Objective: To mutate or change a particular bit or allele in a Chromosome with a very small probability. Step 1 : Choose a very small mutation probability Step 2 : Depending upon that probability value, change a bit from ' 1 ' to '0' or '0' to ' 1 ' Step 3 : The bit position selected for mutation is the Crossover site 5. Objective function: Objective : To estimate the fitness value of an Individual Step 1 : Input: Population Step 2 : On each Individual of the Population apply 2-D interpolation technique to approximate the original matrix Step 3 : The absolute mean error is evaluated by

subtracting the interpolated matrix from the original matrix Step 4 : The inverse of that absolute mean error is considered to be the fitness value of that particular Individual

C. Stage 3

The final stage of the algorithm is the retrieval process of the message. An approximate version {Dapp) of difference signal (D) is obtained using Linear Interpolation technique among the N points grayscale values calculated by Genetic Algorithm. The approximate cover image matrix Capp is then calculated using the Stego-image S and Dapp as follows:

(sapp = '-' — K.Uapp (4J

The message can be retrieved from the following relation:

IJ : = : {sapp Uapp \ )

where Dapp = K.D.

5. Performance evaluation

The efficiency of the proposed algorithm is tested by embedding difference signal in several cover images. The visual quality of the watermarked image is represented by peak signal-to-noise ratio (PSNR). The watermarked image is shown in Fig. 2(a) and PSNR value for the watermarked image is 39.33 dB. As the number of generations are increased from 1000, 2000 to 4000, observation of Figs, (c), (d) and (e) reveals the fact that the retrieved watermark images are becoming more and more close to the original images. We emphasize on subjective quality for the recognizability of the extracted message rather than any objective measure. The improvement in decoding is borne out by the property of Genetic Algorithm which produces better solutions as the number of generations are increased. The number of parameter values considered here are (20 x 20). Sometimes various linear and nonlinear filtering are used to remove noise from the images. Median filtering is one non linear filtering that removes noises but preserves edge information. Fig. 3 shows the retrieved watermark messages from the median filtered version of the watermarked images. The stego image is filtered using (3 x 3) window. Figs, show how the quality of retrieved messages are improved with number of

Santi P. Matty, Prasanta K. Nandi and Malay K. Kundu 169

(d) (e)

Fig. 2. (a) Watermarked image; (b) watermark image; (c)-(e) retrieved messages after 1000, 2000 and 4000 iterations respectively

iterations although the number of parameters value

remain same.

* infe ify (a) (b) (c)

Fig. 3. Robustness performance of median filtering; (a),(b) and (c) indicate retrieved watermark messages after 1000, 2000 and 4000 iterations respectively

Fig. 4. Stego-test; Black cross-stego test image using proposed algorithm; (b)Colour circles-Sample stego images after various operations; (c)red square-untouched image

Fig. 4 shows the stego-test results. As we ob

serve, this stego-test does not place the test image

within the cluster for sample stego-images. There

fore, the test cannot decide with certainty tha t the

test image is actually a stego-image. This estab

lishes the security of the proposed algorithm against

Farid's steganalytic technique involving higher-order

statistics.

6. C o n c lu s io n s

A da ta hiding algorithm with improved payload ca

pacity is proposed for digital images. GA is used to

achieve a set of parameter values so tha t faithful de

coding of message is possible. Decoding reliability is

improved with the increase of number of iterations in

GA when the set of parameter values are fixed. The

algorithm is proven to be secured against stego-test

based on higher order statistics.

References

1. R. J. Anderson and F. A. Petitcolas , On the limits of steganography, IEEE Journal on Selected Areas in Communications, 16 (4)(1998).

2. R. Anderson, Information hiding, in Proc. of the 1st Workshop on Information Hiding, LNCS-1174, Springer Verlag, (New York, 1996).

3. H. Farid, Detecting Steganographic Message in Digital Images, Dartmouth College, Computer Science, (TR2001).

4. S. P. Maity, M. K. Kundu, and P. K. Nandi, Genetic algorithm for optimal imaperceptibility in image communication, in Proc. 11th Int. Conference on Neural Information processing, ( Kolkata, India, 22-25 Nov. 2004).

5. D. Goldberg, Genetic Algorithms: Search, Optimization and Machine Learning, (Addison-Wesley, Reading, M. A., 1989).

170

High Resolution Image Reconstruction from Multiple UAV Imagery

Jharna Majumdar, B. Vanathy and Lekshmi S.

Aerial Image Exploitation Division, Aeronautical Development Establishment (DRDO),

C. V. Raman Nagar Bangalore - 560 093, India. E-mail: [email protected], [email protected]

The objective of this work is to study and develop a suitable algorithm for High Resolution (HR) reconstruction of images from a sequence of Low Resolution (LR) images obtained from an aerial reconnaissance platform. The aerial images are motion blurred, translated and rotated with respect to each other and have perspective distortion due to the forward-looking sensor. The paper proposes frequency domain approach and feature based approach for image registration followed by a Bayesian MAP estimate for the reconstruction of the high resolution image.

1. Introduction

High Resolution reconstruction is a major area of research in allied branches of science like Artificial Intelligence, Machine Vision and Medical Imaging. With the advancement in the field of Computer Science, it has become possible to explore new algorithms for restoration, using multiple image sequences.

Image resolution depends on the physical characteristics of the imaging system viz., the optic, density and the spatial response of the detector elements. The commonly available imaging systems do not sample the scene according to the Nyquist criterion. As a result of this, high frequency contents of images are destroyed and they appear to be of lower resolution. The image features may be lost considerably if the aliasing is too much i.e. if the image is of a very low resolution.

To improve the resolution of the image, attempt is to be made to recover the lost frequency components. Interpolation is a process used to estimate the signal values in between the sampled values without any further knowledge about the signal. Any number of functions can be generated to approximate the actual function representing the signal. With only the sampled points given, the accuracy of the interpolated pixels depends on how close the estimated signal is to the actual signal.1

Interpolation techniques such as bilinear, bicubic, bicubic splines etc. have been quite common and provide better resolution. However, the number of constraints they impose upon the solution inherently limits all these methods. High Resolution techniques are superior over the interpolation techniques because of the additional constraints imposed on the

HR image after it has been simulated through the imaging process and temporally correlated frames.

2. Operational Scenario

In the present scenario, camera is mounted on the aircraft, which captures continuous scene of the terrain during flight. The images suffer from perspective distortion due to the forward-looking sensor, blurring due to ego motion and PSF. The consecutive image frames are also spatially translated and rotated with respect to one another. Random noise is also added to the down sampled image while acquisition. The aerial images obtained can thus be expressed by the following equation.

gk (m, n) = ak{h[f (x, y)} + rjk (x, y)} (1)

where g^ is the kth observed image frame, / represents the original scene, h is the blurring operator, r]k

is the additive noise term and ak represents a nonlinear function that digitizes the image into pixels and quantizes the resulting pixel values from intensities into gray levels.

3. Modeling of the problem

The HR reconstruction algorithm is described in the following steps:

(1) Estimating a HR image. (2) Simulating the HR image through the imaging

process to yield multiple LR images. (3) Developing a cost function from the estimation

and the measurements. (4) Minimizing this cost function to get an appro

priate HR image.



Jharna Majumdar, B. Vanathy and Lekshmi S. 171

A good estimation of the HR image helps in minimizing the run time of the machine and brings out the convergence in less number of iterations. For the present work, bi cubic interpolation is used as the initial estimation of the HR image.

Simulation of the imaging process forms the backbone of the proposed algorithm. The proximity between the simulation process and that occurring in real life determines the error in the final result and is a crucial factor in establishing good quality HR images. The three steps involved in the simulation of the imaging process are (a) warping the HR image, (b) blurring the HR image and (c) down sampling the HR image to get LR images. Since the process is iterative, the HR image in the above-mentioned steps changes in every iteration and is the best estimate obtained till that stage. In this paper, an algorithm using Bayesian MAP approach is proposed.

3.1. Registration

For this algorithm, it is necessary to estimate the sub pixel motion very accurately by registering the images. This paper uses frequency domain approach2

and feature based approach34 to accomplish the task.

3.1.1. Registration by frequency domain approach

If Fi and F2 are the fequency domain representations of two images that are translated by (X0, Y0) and rotated by 0O with respect to each other, then F\ and F2 are related by the following equation:

F2(u, v) = exp(-j27r(ux0 + vy0)) (2)

F\ (u cos 0O + v sin 0o, —u sin 90 + v cos 60)

If the magnitude in (u,v) domain is changed to (p,0), equation (2) is obtained in the following form:

M2(P,9) = Ml(p,9~eo) (3)

d0 can be separated from the above equation using phase correlation method. The cross power spectrum (i.e. the phase correlation) of the two Fourier transform Fi and i<2is defined as:

F{u,v)F'*{u,v) , . n , „ ,,. • P, \j,,( x| = exp027r(ua:o + vVo)) (4 \F(u,v)F'{u,v)\

Inverse Fourier transform of the right-hand side term of the above equation give a peak at 60, the rotation angle. The image is warped by the rotation angle

so that only translation component remains between the images. This is given in equation below:

F2(u,v) = exp(-j2-ir(ux0 + vy0))Fi(u,v) (5)

Again by taking phase correlation and then inverse Fourier transform, translation parameters xo and yo can be obtained. To get the motion parameters to sub pixel accuracy, the image is first interpolated using bi cubic interpolation and then the parameters are found.

3.1.2. Registration by feature based approach

Feature extraction is an important step in feature based image registration methods. Feature based methods rely on accurate detection of features. A feature is the result of an interpretation of n pixels in a window of (p x p), where p is the size of the window.

The feature based registration scheme involves extracting feature points in the pair of images to be registered, reducing the number of points to retain dominant features and matching the pair using suitable matching techniques. Usual features are edges, corners, junctions, close connected regions or segmented texture regions of the image.

Feature Extraction

In this paper, we have used Harris corner detector as feature extractor. The Harris corner detector is based on an underlying assumption that corners are associated with the maxima of the local function. It is less sensitive to noise in the image since the computations are based entirely on first derivatives.

The Harris corner detector computes a corner-ness value, C(x,y) for each pixel in the image, I(x,y). A pixel is declared as a corner if the value of C(x,y) is below a certain threshold. The value of C(x,y) is computed from the intensity gradients in the x and y direction as follows

C(x,y) = detC s t r — a(traceCg t r) (6)

The local structure matrix Cst ris computed as follows.

C3tT = WG(r;cr) x Jx Jxjy

Jxjy Jy (7)

172 High Resolution Image Reconstruction from Multiple UAV Imagery

where fx and fy denote the first derivative along the x and y direction respectively of the point f(x,y) in the image and v/o(r;a) is a Gaussian filter of selected size, a.

Feature Reduction

After feature extraction, feature reduction is done to retain only dominant features suitable for registration. A large number of feature reduction methods exist, out of which Cornerness Similarity measure is the one which is chosen in this paper. Cornerness is the characteristic property of any interest point (feature point) P and is defined as Cp=| Af + A§ |. To measure the correspondence between 2 points, P and Q in two images, the similarity measure S (P,Q) is defined as S(P,Q) = min(Cp,Cq)/max(Cp,Cq). A point is considered as a good feature if, S(P,Q) >Tc where Tc is a variable threshold.

Estimation of Homography and computation of transformation parameters

Homography is the transformation function that finds the corresponding features and mapping between the coordinate systems of the two images. In this method of matching, the feature points and determining the homography, two pairs of feature points are chosen randomly from the image pair at a time. The homography between the selected feature points is computed and the correctness of the computed homography is checked using a scoring module.

This process is iterated until a good score is obtained or for a fixed number of iterations. The homography or transformation involving two feature point sets is a combination of translation in the x-direction (Tx), translation in the y-direction (Ty), scaling factor (s) and angle of rotation (9 ). This is shown below.

H sCosO —sSinO Tx

sSinO sCosO Ty

0 0 1 (8)

For a pair of corresponding matched points, the homography computation is done to calculate the transformation parameters.

3.2. Reconstruction

3.2.1. Restoration

Restoration is the first step of reconstruction. It is an estimation process that attempts to recover an ideal high quality image from a degraded image.5'6

Restoration is applied to remove.

• Degradation such as blurring due to optical systems aberrations, atmospheric turbulence, motion and diffraction.

• Statistical degradation such as noise and measurement errors.

These two types of degradation lead to conflicting requirements on the restoration filter. To simplify the process, additive noise has been neglected in the present case and only the multiplicative motion blur has been considered. The PSF has been considered to be square Gaussian.

The Optical Transfer Function (OTF) for de-blurring can be written as:

H{u,v) = • -.— -^

v / ^ / ^ e x p (- (^fl±^l j + i (9)

where ax and cr}. are tunable parameters chosen judiciously according to the image. At this stage, we have an initial estimate of the HR image and the optical flow parameters between the LR images. The remainder of the process consists of developing a cost function using the MAP estimate and then minimizing it7.8

3.2.2. Cost Function

Given that we observe "y" (the LR images), a MAP estimate of the High Resolution image Z is formed. The estimate can be computed as shown below.

Z = a,rgmaxzPr(z/y, s) (10)

Here, we are trying to maximize the probability of occurrence of High Resolution pixels given the measured Low Resolution pixel values and the accurately determined optical flow parameters. The value of Z pixels, which maximizes this probability, forms the High Resolution image. Using Baye's Theorem this can be expressed as following:

Pr(y,s/z)Pr(z) Z — argmax0 Pr(y, s) (11)

Jharna Majumdar, B. Vanathy and Lekshmi S» 173

Size*60x60

Fi& 1 Input image

Size=l 20x120

Fig. 2 Input image pair after scaling

i) U singF ourier method R owshift = 4.00 C dumnshift = -12.0 0

(a) Iteration 1 (b)Iteratirai2 (c) Iteration 3 Fig3 Reconstructed images obtained by Fourier based method

if) UsingFeature based method Ro wiMft =4.8 C olumnshift - -15.0 4 Rotation-1.14

parameters for the cost function. The cost function is as follows:

j 2

l ^—^ j

Urn ~~ [(/ ^m1r9r

r=l,AT W = 5 E

m=l,pm + (13)

9 2lf r=l,AT

4. Resul ts

The algorithm described above is implemented in a Pentium IV PC with 512 MB RAM. The test images are the aerial images obtained during actual flight of the aircraft. The registration algorithm uses both Fourier and Feature based approach. The reconstruction algorithm is applied for some selected target regions of interest using both Fourier and Feature based approach. The test set, reconstructed images and the corresponding time required to perform the tasks are given below in Figure 1 and the time chart is given in Table 1.

( aj Iteration 1 (b) Iter ati on 2 ( c) I ieraticn 3 Fig.4 Reconstructed images obtained by Feature based method

Fig. 1. Reconstructed images obtained by Courier based and feature based methods.

Table 1. Time chart for generation of outputs.

Iteration number

Iteration 1 Iteration 2 Iteration 3

Frequency based approach time

(in sees) 337 307 307

Feature based appraoch time

(in sees) 57 58 56

The Probability Distribution Function (PDF) is assumed to be Gaussian because of its desirable property of giving a unique global minimum. The PDF of High Resolution pixels is therefore:

Pr(z) = Kexp - z? (12)

where K is a normalizing constant and of are the covariance factors which can be used as tunable

5. Conclusion

The algorithm proposed in this paper shows that aliasing reduction and resolution enhancement can be achieved by using multiple frames that are rotated and/or translated with respect to each other.

One of the applications of this work deals with the real time processing of digital video in which the primary objective is to use multiple frames to produce a video sequence of high-resolution frames. Another application of this research is the post light analysis of video scenes acquired for surveillance purposes where detailed knowledge of the structure of the scene is required.

6. Acknowledgement

The authors wish to express their heartfelt thanks to Director ADE for his kind permission to publish the paper.

References

1. Michael Irani and Shmuel Peleg? CVGIP .'Graphical Models and Image Processing 53, (1991)

2. B. SriniYas Reddy and B. N. Chatterji, IEEE Trans, on Image Processing 5, 8 (1996)

174 High Resolution Image Reconstruction from Multiple UAV Imagery

3. Jharna Majumdar et al,SPCOM (2004) . ()• 4. J. Shi and C. Tomasi,IEEE Conference on Computer

Vision and Pattern Recognition , (1994) 5. R. Tsai and T. Hua,ng,Advances in Computer Vision

and Image Processing 1, (1984)

6. Hardie et al.,Optical Engineering , (1998) 7. Hardie et al.JEEE Thins, on Image Processing , 6

(1997) 8. Hu He and L. P. Kondi, Journal of Electronic Imag

ing 13, 3 (2004)

175

Image Registration and Object Tracking via Affine Combination

Nilanjan Ray Department of Computing Science

University of Alberta 2-21 Athabasca Hall, Edmonton, Alberta T6G 2E8, Canada

Email: [email protected]

Dipti Prasad Mukherjee Electronics and Communication Sciences Unit

Indian Statistical Institute 203 B T Road Kolkata, West Bengal 700108, India

Email: [email protected]

In this paper we illustrate image registration and object tracking by the use of affine combination of pixel locations. To resolve the pixel location correspondences, we interpolate the image intensity at locations produced by the affine combinations of a set of ordered pixels. Then we show that establishing pixel correspondence between images undergoing affine transformation is a matter of point-to-point comparison of the interpolated image intensities over the affine combinations. Through the use of the affine combination we illustrate how optical flow based technique can be employed for tracking targets, the motion for which is assumed to be affine. When the target motion model is that of a rigid body, we illustrate how the rigid body constraint can be easily accommodated within the same optical flow based computational framework. The contribution of this work is to rigorously establish affine combination as a registration and tracking tool.

1. Introduction

Resolving point correspondence is one of the most fundamental problems that arise in a gamut of image analysis applications, such as, image registration, stereo pair matching, structure from motion, object tracking and so on. Often the transformation model between the images (at least locally) is granted to be an affine transformation, especially when the distance between the camera plane and the object is much more than the dimensions of the objects/structures themselves. Thus affine transformation invariant descriptors' emerge as important tools for finding point correspondences.

Typically an affine descriptor is computed as a window around an interest point (such as, detected corner point, local minimum point within a region, and so on). Examples include maximally stable extremal region2, edge-based and intensity extrema-based region3, Harris -affine detector4 among many others'.

In this paper we concentrate on analyzing a type of affine descriptor that Tell and Carlsson describe as image intensity profile between two interest points5'6. We point out here that this descriptor belongs to a class of affine descriptors characterized by the affine combination7 of interest points. The purpose of this paper is to formally establish affine combination as an affine invariant descriptor that holds certain desirable properties to qualify as a tool in establishing pixel or point correspondence. To this end we show that affine

combinations need not be limited to the intensity profile between two interest points; rather more general descriptors can be formed by more than two interest points. One important property of the affine combination descriptors proposed here (as well as in5'6) is that unlike the other aforementioned affine descriptors it allows point-to-point image intensity comparison for matching interest points. The point-to-point image intensity comparison leaves less possibility for false positive matches.

We illustrate object tracking as another application of affine combination here. The proposed tracking framework with affine combinations makes use of object templates that incorporate both object shape and texture in dense optical flow computation.

2. Affine Combination and Pixel Correspondence

Let X (2-by-l column vector) be a pixel location. The 2D affine transformation of X is given by X' = AX + f ,

where A is a 2by-2 matrix and / i s a 2-by-l column vector; together A and / represent a 2D affine transformation. For a set of points Xu X2, X3, ..., X„, a

linear combination: \aX is called affine combination

when the scalar coefficients sum up to unity7: V t t = i. ;=1

Below we prove certain properties of affine combination



176 Image Registration and Object Tracking via Affine Combination

with regard to affine transformation that will be useful in resolving pixel correspondence.

Let / be an image. Let / ' be the image formed after applying a 2D affine transformation to I. Let X\ and X2 be two pixels (2-by-l column vectors) on image /. Let the corresponding pixels for X\ and X2 after the affine transformation on / ' be respectively X[ and X'2. Then:

X\ - AXt + f and X'2 = AX2 +f, A and /characterize the 2D affine transformation between the two images. The affine combination in this case is X, + /(X2 - X , ) ,

where ?e9t This affine combination defines a straight line passing through X and X2. The following proposition establishes the invariance property of this affine combination. Propla The corresponding point of X, +t(X2 - X , )

after affine transformation is X[ + t{X\ — X[) for any given value of the parameter /, where the corresponding points of Xi arid X2 are given by X\ and X'2

respectively. Proof. To prove the proposition let us consider the identity:

X[+t(X'2-X\) =

AX)+f+t(AX2+f-AXt-f) = A{X^t{X2-X{))+f.

• The next proposition proves the converse of Propla Proplb. If the ordered point pairs <XU X2) on the first image and (X*,X2) on the second image are not

corresponding pairs in an affine transformation, then X , + / ( X 2 - X , ) and X\+t(X'2-X\) are not corresponding points for almost all te% Proof. To prove Proplb, let us assume that the corresponding points ofX, andX2 are given respectively by X[ and X'2 on the second image. By Propla, the

corresponding point of Xi+t(X2-X]) is

X[+t(X'2 — X,'). Thus we need to prove that

X' + tiX^-X',)^ X' +t(X2-X') for almost all

values of t. So, let us examine the solution set of / for

which the equality holds:

x']+t(x'2-x'l) = x"[+t(x;-x;). This equation can be rearranged as:/(X2' - X[-X\ + X') = X,* -X[. By the premise of Proplb (X'X2)*(X*X2*). So both

( X ^ - X . ' - X ' + X*) and X,*-X,' cannot be simultaneously zero vectors. Therefore the solution set of t(X2 - X; - X2* + X') = X\ - X,' for / will either be

empty, or a singleton set {i.e., a unique solution). Thus at most for a single value of /, the corresponding point of X, +t(X2 - X , ) will be X\ +t(X2 - X , ' ) . From the measure theoretic point of view a singleton set is of length zero. Therefore ( l - f )X,+/X 2 and (1-/)X* +/X* are not corresponding points for almost

all /e 91 • Let us now extend the two propositions for ordered point triplets. As before we assume that after the affine transformation the corresponding points ofXi,X2 andX3

are X\, X'2 and X3' respectively. We consider the

affine combination of the three points with two scalar parameters t and s, and prove the following propositions Prop2a and Prop2h

Prop2a The corresponding point of

X, + t(X2 - X,) + s(X3 - X,) after affine

transformation is X[ + t(X'2-X[) +s^-X{) for

every [/ ^feSR2. Proof. Similar to the proof of Propla •

The converse proposition is as follows. Prop2b. If the ordered point triplets (X\, X2, X3) on the first image and (X*,X*,X*) on the second image are

not corresponding triplets, then

X 1 + r ( X 2 - X 1 ) + s ( X 3 - X 1 ) and

X' +t(X'2 - X^) + s(X'} - X*) are not corresponding

points for almost all [/ s]T€ 912. Proof. Let us assume as before that the corresponding points of X , X2 and X3 are given by X\, X'2 and X'3

respectively on the second image. Arguing as in the proof of Proplb, we need to examine the solution set of the following equation:

x;-rt(x'2-x;)+s(x;-x;)= x;+/(x;-x;)+5(x;-x,*),

which can be rearranged as:

[(x2-x;-x;+x,*) (x;-x;-x;+x;)] Ls

x;-x;. Let us denote by

c=[(x2 -x\ -x;+x;) (x; -x ; -x\+x;>] and b = X'~X'r The equation will be consistent (i.e., will have at least one solution) if and only if the rank of the coefficient matrix C equals that of the augmented matrix [C b]1. Let us assume that the equation is

Nilanjan Ray and Dipti Prasad Mukherjee 177

consistent and since C is of size 2-by-2 let us now examine the following three cases: Case I: Rank of C is 2. In this case there will be a unique solution. Case II: Rank of C is 1. In this case the general solution

set can be written as: 4> +

t

s s eNull(C)\

where [f0 %f is a particular solution of the equation, and Null(Q is the null space of the matrix C9 Le., the set of solutions of the homogeneous equation. However, we know that dimension of Nu11(C) is number of columns of C - rank of C. Thus the dimension of Nuli(C) is 1, implying that the solution set in this case is a straight line in 9t*. Case III: .Rank of C is 0. Here both the augmented matrix [C b] and the coefficient matrix C must be zero matrices. This is possible if and only if X\ = X*, X*2 = X* and X*3 = X\. However this possibility is ruled out by the

premise of the proposition. Therefore examining all the three cases, we

conclude that the solution set of [I sf can be at most a straight line. Once again the Lebesgue measure (in this case area) of the solution space will be 0 and clearly insignificant compared to the joint parameter space 9f or a bounded plane in 3t2. •

3. Applications

We describe two applications of the affine combination in image analysis here- registration and tracking.

3.1. Image Registration

We assume as before that images /and / ' are related via an unknown 2D affine transformation, i.e., if X and X'are corresponding pixel locations on the two images respectively, then the intensity invariance holds: / (X) = //(JT/)- Let us assume that a set of pixel locations QCU X2$..., XN) on image /and another set of pixel locations (Yu Y29..., YM) on image / ' are given. We want to find out the correspondence between the two sets under the unknown affine transformation.

Given the propositions about the affine combination discussed in Section 2, we attack this problem by hierarchical matching technique. First, ordered pixel pairs

' from one set are matched against the ordered pixel pairs from the other set. In the next stage of matching, from a few best matching point pairs, the ordered point triplets are formed and matched. Note that to find out the unknown 2D affine transformation (six parameters), one needs three pixel correspondences.

To match the ordered pixel pair (Xl9X2) on the image / with the ordered pixel pair (Yu Y2) on image / ' , we compare the intensity signals point-to-point for the affine combination:

A , ,

where Kt is the number of discrete values of the parameter I. Alternatively, we may also consider normalized cross-correlation between I(Xl^t(X2-Xl)) and I\YX + /(72 - 7 , ) ) . More general match measure such as mutual information can also be considered when the intensity invariance assumption does not hold between /and f.

In the second phase of matching, the ordered triplets are to be matched. In this phase, according to the matching scores obtained in the first phase* we choose only a few feature points from both the images. Note the first phase of matching helps to reduce the computational complexity in the second phase. Next, between ordered triplets pf1?X29X3) and (X*9X*2,Xl),

we obtain MAD score: 1 toXt+tjXt-XJ+siXj-XjU

KtKs ^\ixx;+<*; -x;)+s(x; -x;yj[ where Ks is the number of discrete values of the parameter s. Then best matching ordered triplets are chosen from these scores (an example result is provided in Figure 1). If we need more than three points of correspondent second best matching triplets can also be chosen.

Figure i. Three point correspondences found by hierarchical matching to compute prevailing affine transformation between the images.

3.2. Object Tracking

The properties of affine combination can be utilized in tracking objects within the optical flow framework. Let us assume that S^ is the set of pixels (say obtained via segmentation on the initial video frame) belonging to an object that we want to track. Let Xh X2$ and X3 be three comer pixel locations for the bounding box of SAj (The choice of Xu X2, and X2 can be arbitrary so long as they are non-collinear). Let S be the set of parameter

178 Image Registration and Object Tracking via Affine Combination

associated in the affine combination of Xu X29 and X3: S = { (M) :X I +l (X 2 -X l ) + s(X 3 -X 1 )€S o b j }. If the

motion of the object can be approximated by 2D affine transformation, then tracking S^ is merely keeping track of the pixel locations Xu X29 and X3 over the image frames, because the entire object is always designated via the affine combinations over the known set S of parameter values.

To track the vectors Xu X2$ and X3 over an image sequence, we employ optical flow constraint equation-between two consecutive images /(current frame) and/p

(previous frame)8:

\lx(x>y) iy(x>y)\x yY = -it(x*y)> where (x9 y) is a pixel location, and (4, Iy) is the spatial image gradient at (x9y\ I, is the difference image between / and lp. (X, y ) is the velocity of the pixel (x, y). When

we express the optical flow constraint equation via the comer pixel vectorss it takes the following form:

[(\-t-s)Ix tl„ slx (l-t-s)Iy tlv sl\

h x, x, y, yi y-: vj = -*n where the "." above a symbol denotes its first derivative over time (Note that X\ is explicitly defined as (xb y\\X2

and X3 are defined similarly). We solve this equation for the 6 unknowns [i, x2 x3 yx y2 y3J by the

least square method* since the set of (/, s) for which the equation holds is the set S9 and the cardinality of S is typically much larger than 6.

[i, i 2 i 3 yx y2 y3J is solved and *,... etc. are

updated as follows:

[xx(b) x2(b) x3(b) yx(b) y2(b) y3(b)Y =

[xl x2 x3 yl y2 J3] r + ^[i, x2 x3 y} y2 yj,

where b is a non-negative scalar parameter, the value of which, say a9 is chosen by minimizing the energy functional: a = argmin

[/(*, (b)+ t(x2 (b) - x{ (b)) + six, (b) - xx (b)), uhsy, Q>)+Hy1 (b) - y, (b)) + s(yi (b) - j , (b)))-g(t,s)f where g(t, s) is a template function for the object and it can, for example, be the image intensity of the initial

g(t,s) = I0(x? + t(x°2 - 0 + *(*° -*,°),

The superscript 0 designates the vectors (xuy\) etc. for the target on the initial frame 70. Update of the vectors can also be done in an iterative way for better

frame:

convergence. Note that a multi-resolution approach can also be in place for this minimization process for quick convergence. This proposed tracking framework using affine combination integrates both object shape and its texture/intensity within a single template and the computation is that of solving linear equations.

3.2.1 Rigid Body Constraint

Sometimes the frame to frame motion for an object can be approximated via a rigid body motion. In this case we can impose the constraints that the lengths between any two corner points of the bounding box for the object remain constant, i.e.9

(xx-x2)2+(yx-y2)

2 = C P

(x2-x3f-¥(y2-y3f ^C29

(x 3 "»x 1 ) 2 +(^™ J | )2 -C 3 .

When we differentiate these three equations with respect to time and collect them in a matrix vector form, they become:

x 2 - x 3 x 3~x 2 0 J 2 - J 3 J3-.F2

0 x3-xx yt-y3 0 y3-yK_

x3 yx h yJî® 0 Of. This rigid body constraint equation can now be concatenated with the optical flow constraint equations and solved with least square method (An example of tracking is provided in Figure 2).

0

xx-x3

h

Figure 2. Tracking objects: four frames arc shown. Affine Mid rigid body models are assumed respectively for the near and the far boat. To handle occlusion, we assume there is a front-to-back order for the objects in along the viewing direction.

Summary and Future Work

Nilanjan Ray and Dipti Prasad Mukherjee 179

This paper establishes the use of affine combination in point correspondence resolution. We show that a point-to-point matching of the intensity profile can compare ordered point pairs or triplets toward correspondence. A hierarchical computational strategy is used here to reduce the search. It is also illustrated that object tracking using affine motion assumption can be performed by the use of affine combination.

In our future endeavor we like to use the affine combination (point-to-point intensity profile) matching in planar graph isomorphism. The planar graph is formed by suitably connecting the interest points (e.g., corner points). Although isomorphism/graph matching is known to be a hard problem, heuristic techniques may yield robust results with the use of the affine combination metric.

References

1. K. Mikolajczyk et al., International Journal of Computer Vision, 65,43 (2005).

2. J. Matas et al., Image and Vision Comp. 22, 761 (2004).

3. T. Tuytelaars, International Journal of Computer Vision, 59,61 (2004).

4. K. Mikolajczyk and C. Schmid, In Proc. Of7'h ECCV, Denmark (2002).

5. D. Tell and S. Carlsson, In European Conference on Computer Vision, 1, 814 (2000).

6. D. Tell and S. Carlsson, In European Conference on Computer Vision, 1,68 (2002).

7. A.R. Rao and P. Bhimasankaram, Linear Algebra, (1992).

8. B.K.P. Horn and B.G. Schunck, Artificial Intelligence, 17,185(1981).

180

Progressive Transmission Scheme for Color Images Using BTC-PF Method

Bibhas Chandra Dhara

Department of Information Technology Jadavpur University, Salt Lake Campus


Bhabatosh Chanda

Electronics and Communication Sciences Unit Indian Statistical Institute


In this paper, a new color image progressive transmission scheme has been proposed. In this method, a color image (RGB) first transformed into (O1O2O3) domain using a reversible transformation. Each image plane Oj (1 < t < 3) is first divided into a number of non-overlapping macroblocks. The size of macroblocks depend on visual sensitivity of image planes (O,). Then each block is decomposed like quadtree fashion based on smoothness criterion. The leaf blocks of size 4 x 4 of each plane are then coded by BTC-PF method. The transmissions of different planes are done in interleaving way. Finally, residual planes are encoded and transmitted accordingly. The experimental result shows that the proposed method gives least bit rate with a reasonably good quality.

Keywords: color image progressive transmission, block truncation coding, pattern fitting, full-search progressive transmission tree, reversible color transformation.

1. Introduction

At the present age of information sharing and communication, Internet becomes more popular. But main restriction of Internet is its bandwidth. Specially during search, transmission of large images in raster scan method take a huge time. Again after the completion of transmission, if found that the image is not the desirable one then it is total wastage of both time and bandwidth. Progressive image transmission (PIT) provides a solution for this. In PIT, images are transmitted stage by stage. In the first stage receiver reconstruct an approximation of the original image and accordingly can decide either to continue or abort the transmission. Tzou in "Ref. 1" presented a thorough review and comparisons of some PIT methods. The PIT methods can be classified into: transform domain methods, pyramid-structure methods and spatial domain methods.

Transform domain PIT methods are mainly based on DCT based methods2 and wavelet based method.3,4 Different pyramid structure based PIT methods are reviewed in "Ref. 5". A method based on Laplacian pyramid is presented in "Ref. 6".

Bit plane method (BPM)1 is the simplest and easiest spatial domain PIT method. But initial image quality is very low, an improved BPM (IBPM)7

has been proposed to enhance the image quality in the initial stage with higher data rate. The bit rate of PIT can be reduced by using vector quantization (VQ) based methods.8"10 Quadtree based PIT methods are proposed in "Ref. 11" to reduce the bit rate. Guessing by Neighbor (GBN) method12 exploits the correlations among neighbor pixels. In "Ref. 13", a PIT method using block truncation coding (PBTC) has been proposed.

Almost all the above mentioned methods are the PIT for the grayscale images. These methods can also be used for color image progressive transmission. In straight forward way, a gray scale PIT technique is employed to each color planes (R, G, B) separately and then information is transmitted from each planes with equal importance. However, these methods do not use the correlation between color planes. To achieve good results, first transform an image from RGB to some other domain XYZ. Then certain PIT methods are used on them and transmit more data from visually sensitive plane compared to other planes. These transmissions of data must be in interleaving fashion according to importance. The BPM and PBTC methods can be used in the first approach and let us call these methods as CBPM and CPBTC, respectively. As the expected bit rate be-



Bibhaa Chandra Dhara and Bhabatosh Chanda 181

comes very high. To reduce the bit rate of PBTC for color images, a method using common bit map for each of three color blocks (CBMPBTC) is presented in "Ref. 14". GBN method addresses the second approach for color image transmission.

In this paper, a new color image progressive transmission method is proposed. In this method first a color image (RGB) is transformed into (O1O2O3) domain using a reversible transformation.15 The organization of the paper is as follows. BTC-PF based PIT is described in section 2. Proposed PIT method for color images are given in section 3. Experimental results are reported in section 4. Finally, conclusions are drawn in section 5.

2. Progressive transmission of BTC-PF coded block

In BTC-PF16 method, a block (n x n) tries to fits with some codewords having Q levels. At time reconstruction of the block, the index of the selected codeword and Q different m ean values fit (1 < t < Q) are required. In this experiment image blocks of size 4 x 4 are coded by BTC-PF method using two level patterns. To reconstruct a block, gray values /ii, /X2 and index / of the selected pattern are required. Instead of m and H2 (A*2 > Mi) t w o values A and d defined as A = ' ^ ^ and d = ^ ^ are used. To transmit and reconstruct the blocks a full search progressive transmission tree (FSPTT) is used. A partial structure of FSPTT is shown in Figure 1. The first level has 8 nodes and second and third levels have 32 and 128 nodes, respectively. In this FSPTT, the intermediate nodes (except root node) are also presents at leaf level. The dominating power of the intermediate nodes ensures their occurrence as leaf nodes. (A, d, I) is transmitted progressively to the receiver. The transmission scheme is described below with an example (see Fig. 2).

186 173 158 152

171 171 171 171

167 169 171 182

172 178 182 182

(a)

171 171 171 171

171 171 171 171

(d) 162 162 162 162

181 185 186 191

171 171 171 171

180 180 180 180

180 180 162 162

170 170 170 170

180 180 180 180

(g)

162 162 180 180

180 180 180 180

(b) 170 170 170 170

172 172 172 172

(e)

180 180 180 180

180 180 162 162

180 180 180 180

172 172 172 172

162 162 180 180

160 160 160 160

162 162 162 162

180 180 180 180

(h)

160 160 160 160

160 160 160 160

(c)

162 162 162 162

180 180 180 180

(f) 180 180 180 180

160 160 160 160

180 180 180 180

Fig. 2. Illustration of progressive transmission of a block coded by BTC-PF. (a) Original image block, (b) reconstructed image by BTC-PF with A=171, d=9 and 1=14 of Fig. 1, (c) -(h) step by step refinement of the block using i*=16 and F i = 8 .

Algorithm-I

Step 1 Transmission of gray level A. From A two values Aq — A/F and Ar = A%F are obtained for some fixed value F (=2fe).

l .a Transmit Aq to the receiver and then reconstruct the block with the value Aq*F and root node of FSPTT as the pattern (see Fig. 2c).

l .b Transmit Ar, refine the reconstructed block with previous pattern and A as gray level (see Fig. 2d).

Step 2 Transmit three bits to represent the intermediate pattern (Pu) at level 1 (assuming root at level 0), along the path from root to the selected leaf node, as the intermediate pattern. Along with the pattern index, transmit d value. Like A value from d calculate dq = d/F\ and dr = d%F\ for some fixed F\.

2.a Transmit dr to the receiver and then reconstruct the block using Pi, A, and dr (see Fig. 2e).

2.b Transmit dq, and refine the reconstructed block with Pi, A, and d (see Fig. 2f).

Step 3 Two bits are transmitted to represent the pattern {Pij) at level 2, along the same path. Reconstruct the block with P2J, A and d (see Fig. 2g).

Fig. 1. A partial structure of FSPTT used in BTC-PF.

182 Progressive Transmission Scheme for Color Images Using BTC-PF Method

Step 4 Finally remaining two bits are transmitted to represent the actual pattern (Pk) at leaf level, selected by BTC-PF, and reconstruct refine the block by Pk, A and d (see Fig. 2h).

3. Proposed progressive transmission scheme of a color image

In color image high correlation exists between R, G, B planes. So RGB are transformed into less correlated reversible triplet (O1O2O3).15 This proposed method consists of two phases. In the first phase, the image planes Oi are coded in lossy manner and are transmitted progressively in inter leave format. In the second phase, the residual planes (Oi) resulted from Oi and corresponding reconstructed planes are coded and transmitted.

3.1. RGB to 0\OiO$ conversion

There are standard color conversions like RGB to YIQ, YUV, YCbCr etc. where YIQ etc. have less inter-correlation among the triplet than RGB. However, since data in RGB and as well as transformed domain are considered to be integer, the aforementioned transformations are, in a sense, lossy transformations. In this work, RGB to O1O2O3 transformation15 is used, which is known to be lossless. RGB to O1O2O3 conversion is given by

^ .R + G + B nc. Oi = [ ^ +0.5J ,R-B

Oi = Y—Y~ + 0.5J O3 = B - 2G + R

and corresponding inverse transformation is

B = Ox - 02 + L ^ + 0.5J - L ^ + 0.5J

G = O!-L^+0.5 j

R = o, + o2 + o3 - [y . + 0.5J - [ Y + 0.5J

where [ x J stands for largest integer not exceeding x. 0\ stands for intensity or luminance component, while O2 and O3 together represent chrominance at each pixel.

3.2. Phase I coding

The first phase consists of Algorithm-I (described in section 2) preceded by quadtree partition of each

macroblock.

3.2.1. Coding Scheme: 0\ plane

The 0\ plane represents the luminance component of the color image. First this plane is divided into a number of non-overlapping macroblocks of size 16 x 16. Based on smoothness criterion (say, block variance) each block is decomposed in quadtree manner till we reach blocks of size 4 x 4 . The decomposition of a block can be represented either by single bit or by five bits. An example is shown in Fig. 3. In this coding technique, the leaf nodes of quadtree are actually transmitted. The blocks of size 16 x 16 and 8 x 8 are encoded by only the block mean, while 4 x 4 blocks are encoded by BTC-PF method.

^ r internal node external node

(a) (b)

Fig. 3. Example of the quadtree partition: (a) 16 x 16 block (Big), (b) corresponding quadtree and encoding string is 11010.

3.2.2. Coding Scheme: O2 and O3 plane

Both O2 and O3 represent the chrominance components and contain less information compared to 0\ plane. Like 0\ plane, these two planes are also first divided into macroblocks of size 32 x 32. Then each of them is decomposed until 8 x 8 blocks are reached. All blocks of different sizes are initially coded only by block mean. To achieve the better quality, from each 8 x 8 block one 4 x 4 block is generated using averaging over 2 x 2 window. Then these 4 x 4 blocks are encoded by BTC-PF. Here the reconstruction levels are ^1 and H2, where \i2 is greater than mean of 8 x 8 block and \i\ is less than that. Let the difference between /*i and mean of 8 x 8 is <5i, and for fi2 is 62-

Bibhas Chandra Dhara and Bhabatosh Chanda 183

3.2.3. Progressive transmission in phase I

The positions of blocks of different sizes are transmitted by the bit streams representing the quadtree partition. Then the block mean of 16 x 16 and 8x8 blocks of Oi plane, and that of 32 x 32, 16 x 16 and 8 x 8 blocks of 0% and O3 planes are transmitted along with A value of 4 x 4 blocks of 0\ plane. The transmission of bit stream up to step l.a of Algorithm-I is the stage 1 of the proposed method. Similarly, up to step l.b, 2.a, 2.b, 3, 4 are assumed as the stage 2, 3, 4, 5 and 6, respectively. Then the bit streams consisting of 61, 5%, and seven bits for the selected patterns for 0 2 and O3 planes are transmitted. This is consider as the stage 7 of the proposed method.

3.3. Phase II coding

The second phase is used to achieve even better quality. In this phase, the residual planes Oi are obtained from Oi and their corresponding reconstructed planes. All three residual planes are encoded by BTC-PF method considering the eight patterns at level 1 (assuming root at level 0) of the FSPTT shown in Fig. 1.

3.3.1. Progressive transmission in phase II

Residual planes 0\, O2 and O3 are transmitted in order. These steps are considered as stage 8, 9 and 10 of the proposed method.


In our experiment a number of512x512 color images are used. The performances of the proposed method are evaluated in terms of bit rate and peak signal-to-noise ratio (PSNR). The visual quality of the proposed method at different stages are shown in Figure 4. The stage by stage performance of the proposed PIT method are reported in Table 1. In phase I (up to stage 7) all bit rates are calculated straight way on raw data, but in phase II entropy coding is used. Average results of different spatial domain PIT methods and the proposed method are reported in Table 2 for comparison. Present method is compared with CBPM, CPBTC, CBMPBTC, and GBN methods. Table 2 reveals that, the average quality of reconstructed images at very first step is good enough with consuming least number of bits compared to other methods. At the first stage bit rate

of CBPM, CPBTC and GBN methods is greater or equal to 3 bpp and quality of reconstructed image are 18.106db, 27.828db, and 20.83db, respectively. Results of CBMPBTC method are 1.19bpp and 24.99db, but proposed method gives quality 22.051db with bit rate 0.157 bpp. Hence, after the first sketch of the image, if it is found that the image being transmitted is not the desired one and the transmission is terminated, the proposed method saves maximum time and bandwidth. The table also shows that both PSNR and Cbpp of other methods increase in large functions; where as in the proposed method these increase in small steps. So if the receiver wants to terminate transmission at some middle of two stages, it has to wait for long for the current stage to be completed. The termination is least time consuming for the proposed method.

5. Conclusions

In this paper a PIT for color images using BTC-PF is proposed. This method gives low bit with good qualities of initial images. In this method color images (RGB) are first transformed to (O1O2O3) domain by a reversible transformation and then each image plane is decomposed in quadtree fashion and then blocks are encoded BTC-PF method. Finally, to improve the quality of output images the residual planes are transmitted. The results show that proposed method gives least bit rate compare to other spatial domain PIT methods while achieving reasonably good quality.

References

1. K. H. Tzou, Optical Engineering 26, 581 (1987). 2. T. S. Chen and C. Y. Lin, A new imrpovement of

jpeg progressive image transmission using weight table of quantized dct coefficient bits, in Proceedings of the third IEEE pacific rim conference on multimedia, 2002.

3. J. M. Shapiro, IEEE Transactions on Signal Processing 41 , 3445 (1993).

4. P. Y. Tasi, Y. C. Hu and C. C. Chang, Signal Processing: Image Communication 19, 285 (2004).

5. M. Goldberg and L. Wang, IEEE Transactions on Communications 39, 540 (1991).

6. G. Qiu, IEEE Transactions on Image Processing 8, 109 (1999).

7. C. C. Chang, F. C. Shine and T. S. Chen, A new scheme of progressive image transmission based on

184 Progressive Transmission Scheme for Color Images Using BTC-PF Method

Table 1. The stage by stage performance of the proposed PIT method in terms of PSNR and cumulative bit rate (Cbpp), * indicates the results with entropy coding.

Images

Airplane

Lena

Peppers

Splash

Tiffany

Average

PSNR Cbpp PSNR Cbpp PSNR Cbpp PSNR Cbpp PSNR Cbpp PSNR Cbpp

Stages 1

21.314 0.164 22.597 0.159 21.538 0.200 22.521 0.126

22.285 0.139 22.051 0.157

2 23.865 0.311 25.774 0.301

24.009 0.379 25.671 0.238

25.532 0.261 24.970 0.298

3 24.246 0.460 26.208 0.446 24.308 0.531 25.933 0.318

25.756 0.370 25.290 0.425

4 25.174 0.534 27.122 0.518

24.986 0.607

26.584 0.358

26.042 0.424 25.981 0.488

5 27.217 0.583 27.745 0.566 25.913 0.657 27.973 0.384

27.567 0.460

27.283 0.530

6 27.935 0.632 28.592 0.614 26.223 0.707 28.206 0.410

27.894 0.496 27.770 0.571

7 28.798 0.720

28.978 0.676

27.807 0.886 30.025 0.527

28.650 0.568 28.851 0.675

8* 29.962 1.054

30.302 1.029

28.567 1.240

30.906 0.822

29.138 0.933 29.775 1.015

9* 30.389 1.279

31.128 1.339

29.066 1.539

31.707 1.061

29.688 1.232

30.395 1.290

10* 31.384 1.521

32.185 1.593

29.854 1.801

32.647 1.298

30.818 1.544

31.377 1.551

Table 2. Comparative results of proposed method

Methods

CBPM

CPBTC-16

CBMPBTC

GBN

Prop.

PSNR Cbpp PSNR Cbpp PSNR Cbpp PSNR Cbpp PSNR Cbpp

1 18.106

3 27.828 3.187 24.99 1.19

20.83 3.00

22.051 0.157

2 22.387

6 33.929 6.296 28.34 2.38

22.40 4.82

24.970 0.298

with some other methods.

Stages 3

28.466 9

39.737 9.468 30.67 3.76

31.38 7.82

25.290 0.425

4 34.475

12 45.729 12.577 32.04 5.51

34.46 10.03

25.981 0.488

5 40.490

15 52.328 15.307 33.01 8.00

39.71 13.03

27.283 0.530

6 46.277

18 60.569 17.440 34.32 11.97 46.40 15.79

27.770 0.571

7 51.253

21

28.851 0.675

8

29.775 1.015

9

30.395 1.290

10

31.377 1.551

bit-plane method, in Proceedings of the fifth asia pacific conference on communications and fourth optoelectronics and communications conference, 1999.

8. L. Wang and M. Goldberg, IEE Proceedings 135, 421 (1988).

9. W. J. Hwang and B. Y. Ye, IEEE Transactions on Consumer Electronics 43, 17 (1997).

10. E. A. Riskin, R. Lander, R. Y. Wang and L. E. Atlas, IEEE Transactions on Image Processing 3, 307 (1994).

11. Y. C. Hu and J. H. Jiang, Real Time Imaing 11, 59 (2005).

12. C. C. Chang, T. K. Shih and I. C. Lin, The visual computer 19, 342 (2003).

13. C. C. Chang, H. C. Hsia and T. S. Chen, A progressive image transmission scheme based on block truncation coding, in Lecture notes in computer science, Vol 2105, (Springer-Verlag Berlin Heidelberg, 2001).

14. C. C. Chen and M. N. Wu, A color image progressive transmission method by common bit map block truncation coding approach, in Proceedings of ICCT2003, 2003. K. Komatsu and K. Sezaki, In Proc. of SPIE Visual Communications and Image Progessing 2727', 1094 (1996). B. C. Dhara and B. Chanda, Pattern Recognition 37, 2131 (2004).

15.

16.

Bihhas Chandra Dham and Bhabatosh Chanda 185

\A (I)) M I'D

Fig. 4. (a) Original Image, (b)-(k) Reconstructed images by proposed method in different stages: (b) Stage 1: PSNR=22.597db, Cbpp=0.159, (c) Stage 2: PSNR=25.774db, Cbpp =0.301, (d) Stage 3: PSNR=26.208db, Cbpp-0.446, (e) Stage 4: PSNR=27.122db, Cbpp=0.518, (f) Stage 5: PSNR=27.745db, Cbpp=0.566, (g) Stage 6: PSNR=28.592db, Cbpp=0.614, (fa) Stage 7: P8NR=28.978db, Cbpp=0.676, (i) Stage 8; PSNR=30.302db, Cbpp=1.029, 0) Stage 9: PSNR=31.128db, Cbpp= 1.339, (k) Stage 10: PSNR=32.185db, Cbpp=1.593.

186

Registration Algorithm for Motion Blurred Images

K. V. Arya and P. Gupta

Department of Computer Science & Engineering, Indian Institute of Technology Kanpur,

Kanpur-208 016, India E-mail: {kvarya,pg} @cse.iitk. ac. in

This paper proposes an algorithm for restoring the motion blurred images. The restored images are then used for registering the template image. Restoration and registration of blurred images is very important problem from its application point of view. For restoration of motion blurred images, a modified Weiner filter based technique has been developed. An algorithm to register the restored image with given template image has also been proposed. The experimental results demonstrate that the images can successfully be registered in images with substantial amount of artificial and natural motion blur.

Keywords: Correlation coefficient; Image registration; Image restoration; Motion blur; Wiener filter.

1. Introduction

Due to the relative motion between the camera and scene during the image procurement, the captured images get blurred along the direction of relative motion. The noise and blur mixed with original image data put the constraints on the registration of the images. It is often required to recognize human faces or objects in such blurred images for example tracking and identification of criminals, where image of a human face or number plate of a running vehicle taken in hit and run situation gets blurred due to relative motion between the camera and object/face. This may be done through image restoration and subsequent registration of target face or object. Image registration is the process of finding point to point mapping between two images.1 One of the images is called reference or template image and the other one is called target or sensed image. The reference image is usually kept unchanged and the target image is a newly scanned image whose geometry must be changed.

The restoration of blurred and noisy image depends on the blurring system model. To restore the images from the noisy and blurred images, a number of restoration methods have been developed over the past two decades2"6 . This paper deals with two critical problems. First one is the restoration of image from motion blurred image. Second problem is to register restored target image with the original template image. The point spread function (PSF) parameters (namely blur length and blur direction) proposed in6"11 are used in solving the first problem. A

Weiner filter based image restoration algorithm has been developed in.7 Cole12 has modified Wiener filter. In this work modified Weiner filter based restoration algorithm has been developed. The primary task in restoration is to estimate the power spectrum of the blurred image which is then used to determine the PSF parameters.

To register the restored target image with the original template image, a method based on selective correlation coefficient (SCC)13 is used. The use of mask function in computation of similarity measure suppresses the pixels of noisy region from taking part in computation and hence, makes the registration process fast. In SCC, the binary mask function forces the gray scale images to be converted to binary images and hence, this additional step slows down the process of determining the similarity measure. In this paper a real valued mask function is defined to reduce the number of pixels taking part in the similarity measure computation. The mask function reduces the influence of noisy pixels of the restored image and hence, reducing the number of the pixels taking part in correlation computation which makes the process fast.

The rest of the paper is organized as follows. Section 2 presents the image degradation model and brief idea about determination of PSF parameters. The image restoration method is presented in Section 3. The proposed image registration method is discussed in Section 4. The experimental results are given in Section 5. Conclusions are given in the last section.

K. V. Arya and P. Gupta 187

2. B l u r M o d e l a n d P a r a m e t e r s E s t i m a t i o n

Images captured in uncontrolled environments in

variably represent a degraded version of an origi

nal image due to imperfections in the imaging pro

cess. The model shown in Fig. 1 describes the gen

eralized process of image degradation and restora

t ion. 1 4 ' 1 5 Given degradation function h(x,y) tha t to

gether with noise te rm n(x, y), operates on an input

image f(x,y) to produce a degraded image g(x,y).

The observed blurred/noisy image in the spatial do

main can be given by,

g(x,y) = h{x,y)®f(x,y)+n(x,y) (1)

where the symbol ® indicates the spatial convolu

tion. Since the convolution in spatial domain is the

multiplication in frequency domain, the degradation

model in (1) can be writ ten in frequency domain as,

G(m, n) = H(m, n)F(m, n) + N(m, n)

where G(m,n), H(m,n), F(m,n), and N(m,n) are

the Fourier transforms of g(x,y),h(x,y),f(x,y), and

n(x,y) respectively.

fix, y) Degradation function h(x,y)

(x, y)

Restoration filter rlx.y)

fix, y)

Noise nix, y}

DEGRADATION RESTORATION

Fig. 1. Image Degradation/Restoration Model

The blur direction is identified by applying

Hough transform on the spectrum of blurred image

and identifying the direction of line in the spectrum.

The blur length is found by rotating the Fourier spec

t rum of the blurred image in the estimated direc

tion, then observing the negative value in the inverse

Fourier transform. The detailed procedure to deter

mine these PSF parameters is given in.7

3 . B l u r r e d I m a g e R e s t o r a t i o n

The first task to restore the image from the motion

blurred image is to identify the PSF parameters blur

length and blur direction. In this work these param

eters are identified using the algorithms given in.7

The values of these parameters are then used to de

termine the power spec t rum (H(m,n)) of degrada

tion function. Many approaches to image restorat ion

have been reported in l i tera ture 1 4 but they perform

poorly in the presence of noise. Several variations of

the Wiener filter7'15'16 are used to restore the blurred

image, where image and noise both are considered as

random process.1 7 It then find an est imate f(x,y) of

the ideal image f(x,y) such tha t the mean square

error between them is minimized. In this work the Weiner filter has been modified

as suggested by Cole.12 T h e modified restorat ion filter is represented by the following transfer function,

R(m,n) = F(m, n)

\H{m, n)rF{m, n) + N(m, n) (2)

where R(m,n) is the Fourier transform of r(x,y). Then Fourier transform of the restoration filter output is given by,

F(m,n) = \R(m,n)\2G{m,n)

F{m,n)

\H(m, n)\zF{m, n) + N{m, n) G{m,n)

(3)

where G(m, n) represents the Fourier transform of observed degraded image g(x,y)a,nd given by the following expression,

G(m,n) = \H(m,n)\2F{m,n) + N(m,n)

In (3) the knowledge of power spectrum of original image is required, which is rarely known in motion blurred images. Therefore, (3) can be approximated by the following expression,

F(m,n) = 1

G{m,n) _\H(m,n)\* + K_

where K is constant and can be determined through experiments.

The restored image in the spatial domain is given by the inverse Fourier transform of the frequency domain estimate F{m, n). Therefore, restored image f(x, y) is given

by f(x,y) = Q-1{F(m,n)}

4. I m a g e R e g i s t r a t i o n

This section describes t h e process for registering a

template image in the restored target image. The

schematic block diagram of registration process is

shown in Fig. 2. It is clear tha t the registration pro

cess outputs the restored templa te image wi th regis

tered window for the input blurred templa te image

and query image. The algori thm given here computes

188 Registration Algorithm for Motion Blurred Images

the correlation coefficient as a similarity measure between the target image and sliding template image windows in the restored image.

Input Blurred Image

' , g (x ,y )

Image Restoration Module

1 *

Query Image

A f (x,y)

V Mask Function —«-Tst

Correlation Computation

1

f (x,y

s •J

Output Restored Template 1 Image with Registered

Window j

Fig. 2. Schematic Diagram of Image Registration Process

The window having highest correlation value indicates the location of query image. To compute the correlation coefficient we have modified the scheme proposed by Kaneko et o/.13in which they compute the selective correlation coefficient (SCC) by generating the a selecting mask for each pixel before calculating correlation coefficient. The mask defined is binary in nature and hence, the given image is converted to a mask-image where each pixel has the brightness value either '0' and ' 1 ' . Based on the mask-image those pixels are selected which take part in correlation computation. The binary mask function defined in SCC computation performs poorly in case of restored images as it is not possible to remove the blur and noise completely due to the limitations of debluring techniques. Therefore, in this work the SCC scheme is modified by taking a real-valued mask function. Mask function used here is defined using the concept of German and Reynolds M-estimator,18,19 which reduces the influence of noisy pixels. Hence, noisy pixels have very little contribution in the computation of similarity measure. This makes the similarity measure more effective in the presence of noise.

Here a two-dimensional image is represented as a one-dimensional list of pixel brightness values tak

ing in row major fashion. Let / = {/j}"=1 define a template/query image and / = {/»}"=1 is a restored target image of same size n as query image. The correlation coefficient for each pair of pixels between template and target images is defined as follows,

£? = 1 rrij (ft - P) (ft - J)

E?=i m* ( / i - F ) sjT.Urniih-1)*

where F and / are the average brightness values in restored target image and template image respectively and mask function rm is defined using the concept explained in,18 and is given in (4). Let X{ represents the difference in brightness values of ith pixel in both the images i.e. Xi = (/j — /i)/255. Then for all pixels « = 1,2,--- , n

n+kp if <o > 0 (4)

5. Exper imenta l Resul ts

The experiments are performed on approximately 50 gray scale images having natural as well as artificial motion blur. The four representative images are shown in Fig. 3 to 6. The experimental results with

(c)

Fig. 3. Image with natural motion blur (a) Query image, (b) Blurred target image and (c) Registered window using restored image obtained using the proposed algorithm.

natural blur images are shown in Fig. 3 and Fig. 4 and those with artificial motion blur are given in Fig. 5 and Fig. 6. In all the experimental results Fig. (a), (b) and (c) indicates the query image, blurred

target image and matched window on the restored target image respectively.

(c)

Fig. 4. Image with natural motion blur (a) Query image, (b) Blurred target image and (c) Registered window using restored image obtained using the proposed algorithm.

l l l l l l l i i t i§iks ^

mi^mmr W .ill

• I K * 00 0>)

(c)

Fig. 5. Image with artificial motion blur with blur length 20 and blur direction 35 (a) Query image, (b) Blurred target image and (c) Registered window using restored image obtained using the proposed algorithm.

It is observed that the proposed registration algorithm has correctly matched the query image in all the restored images. The algorithm has performed poorly in the images with artiicial motion blur having blur length and blur direction more than 45 and 65 respectively. But it has indicated the robust performance in cases of natural motion blur.

K. V. Arya and P. Gupta 189

(c)

Fig. 6. Image with artificial motion blur with blur length 25 and blur direction 55 (a) Query image, (b) Blurred target image and (c) Registered window using restored image obtained using the proposed algorithm.

6. Conclusions

Algorithms for automated restoration and registration of motion blurred images are presented here. It is likely to have application in identification tasks in uncontrolled environments. The image restoration algorithm uses the knowledge of PSF parameters viz. blur length and blur direction while the registration algorithm is based on mask-based correlation computation method. Experimental results demonstrate the effectiveness of the algorithms on a wide range of images with natural and varying degree of artificially created motion blurs.

The algorithm may be extended using more sophisticated registration techniques. The point spread function parameters may also be utilized in the registration process for improved correlation computation.

References

1. L. G. Brown, ACM Computing Surveys 24, 326 (1992).

2. H. C. Andrews and B. R. Hunt, Digital Image Restoration (Prentice - Hall, Englewood cliffs, NJ, 1977).

3. W. J. Woods and V. K. Ingle, IEEE Trans. Acoustic Speech and Signal Processing 20, 188 (1981).

4. R. L. Lagendijk, J. Biemond and D. E. Boekee, IEEE Trans. Acoustic Speech and Signal Processing 36, 1874 (1988).

5. R. L. Lagendijk and J. Biemoed, Basic Methods for Image Restoration and Identification, in Hand Book

190 Registration Algorithm for Motion Blurred Images

of Image and Video Processing, ed. A. Bovik (Academic Press, 2000), pp. 125-140.

6. Q. Li and Y. Yoshida, IEICE Trans. Fundamentals E80-A, 1 (1997).

7. R. Lokhande, K. V. Arya and P. Gupta, Identification of parameters and restoration of motion blurred images, in Proc. 2006 ACM Symposium on Applied Computing, (Dijon, France, 2006).

8. M. Cannon, IEEE Trans. Acoust. Speech Signal Process. 24, 56 (1976).

9. M. M. Chang, A. M. Tekalp and A. T. Erdem, IEEE Trans. Signal Processing 39, 2323 (1991).

10. R. Fabian and D. Malah, CVGIP: Graphical, Models and Image Processing 53, 403 (1991).

11. D. B. Gennery, J. Opt. Soc. Amer. 63, 1571 (1973). 12. E. R. Cole, The removal of unknown image blurs

by homomorphic filtering, PhD thesis, Department

of Electrical Engineering, University of Utah, Salt Lake City, (UT, June 1973).

13. S. Kaneko, Y. Satoh and S. Igarashi, Pattern Recognition 36, 1165 (2003).

14. M. R. Banham and A. K. Katsaggelos, IEEE Signal Processing Magazine 14, 24 (1997).

15. R. C. Gonzalez and R. E. Woods, Digital Image Processing (Pearson Education, 2003).

16. I. Pitas, Digital Image Processing Algorithms and Applications (John Wiley arid Sons, Inc., 2000).

17. K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd Edition (John Wiley & Sons, 2002).

18. M. Black and A. Rangarajan, Intl. Journal of Computer Vision 19, 57 (1996).

19. D. German and G. Reynolds, IEEE Trans. Pattern Anal. Machine Intell. 14, 376 (1992).

PART G

Image Segmentation

193

Aggregation Pheromone Density Based Change Detection in Remotely Sensed Images

Megha Kothari and Susmita Ghosh

Department of Computer Science and Engineering Jadavpur University


Ashish Ghosh

Machine Intelligence Unit and Center for Soft Computing Research Indian Statistical Institute

203 B. T. Road, Kolkata 700108, India E-mail: [email protected]

Ants, bees and other social insects deposit pheromone (a type of chemical) in order to communicate between the members of their community. Pheromone that causes clumping or clustering behavior in a species and brings individuals into a closer proximity is called aggregation pheromone. This article presents a novel method for change detection in remotely sensed images considering the aggregation behavior of ants. Change detection is viewed as a segmentation problem where changed and unchanged regions are segmented out via clustering. At each location of data point, representing a pixel, an ant is placed; and the ants are allowed to move in the search space to find out the points with higher pheromone density. The movement of an ant is governed by the amount of pheromone deposited at different points of the search space. More the deposited pheromone, more is the aggregation of ants. This leads to the formation of homogenous groups of data. Evaluation on two multitemporal remote sensing images establishes the effectiveness of the proposed algorithm over an existing thresholding algorithm.

Keywords: Change detection; Remote sensing; Aggregation pheromone system

1. Introduction

Automatic detection of landcover transitions in multitemporal remotely sensed images is of widespread interest due to a large number of real world applications ranging from forestry and agricultural surveys to urban studies and natural disaster management.1"3 In many real world applications of change detection, the objective is to map only one (or, few) landcover transition(s) of interest like agricultural land to urban area, forest to burned area, forest to agricultural land etc. Land-cover change detection using multi-temporal remote-sensing images is a challenging task due to lack of a-priori information about the shape of changed areas, absence of a reference background, differences in light conditions, atmospheric conditions, sensor calibration, ground moisture at the two image acquisition dates, alignment of multi-temporal images (registration) etc.4'5

These factors restrict the use of most classical multi-temporal image-analysis techniques to few remote-sensing change analysis problems. There are several techniques available in the literature 6 ~ n where Neural Networks, Markov Random Field (MRF) and Support Vector Machines (SVM) are used for de

tecting changes in remotely sensed images. In this article change detection is viewed as a segmentation problem, where changed and unchanged regions are segmented using clustering. Clustering is performed considering the aggregation behavior found in ants and ant like agents.

Numerous clustering algorithms have been developed inspired by the ability of ants to cluster their corpses into "cemeteries" in an effort to clean up their nests.12 Besides nest cleaning, many functions of aggregation behavior have been observed in ants and ant like agents.13'14 These include foraging-site marking and mating, finding shelter and defense. For example, after finding safe shelter, cockroaches produce a specific pheromone with their excrement, which attracts other members of their species.14

Tsutsui and Ghosh15 used aggregation pheromone systems for continuous function optimization where aggregation pheromone density is defined by a density function in the search space. Inspired by the aforementioned aggregation behavior found in ants and other similar agents, attempts are already made for solving clustering16 and image segmentation17

problems. In this article this metaphor is used to detect changes in remote sensing images.


194 Aggregation Pheromone Density Based Change Detection in Remotely Sensed Images

2. Aggregation Pheromone Density based Change Detection

As mentioned in the introduction, aggregation pheromone brings individuals into closer proximity. This group formation nature of aggregation pheromone is being used as the basic idea of the proposed technique. Here each ant represents one data/pixel of an image. The ants move with the aim to create homogenous groups of data. The amount of movement of an ant towards a point is governed by the intensity of aggregation pheromone deposited by all other ants at that point. This gradual movement of ants in due course of time will result in formation of groups or clusters. The proposed technique has two parts. In the first part, clusters are formed based on ants' property of depositing aggregation pheromone. The number of clusters thus formed might be more than the desired number. So, to obtain the desired number of clusters, in the second part agglomerative average linkage18 clustering algorithm is applied on these already formed clusters.

2.1. Formation of Clusters

While performing segmentation for a given image, we group similar pixels together to form a set of coherent image regions. Similarity of the pixels can be measured based on intensity, color, texture and consistency of location of different pixels. Individual features or combination of them can be used to represent the pixel of an image. For each pixel, we associate a feature vector x. Clustering is then performed on all the pixels to group them into segments.

Let us consider a data set of n patterns (xi,X2,X3,...,xn) and a population of n-ants (Ai,A2,A3,...,An) where an ant At represents the data pattern Xi. Each individual ant emits pheromone in its neighborhood. The intensity of pheromone emitted by an individual A at x decreases with its distance from it. Thus the pheromone intensity at a point closer to x is more than those at other points that are farther from it. To achieve this, the pheromone intensity emitted by A is assumed to follow a Gaussian distribution. The pheromone intensity deposited at x ' by an ant A (located at x) is given by

AT'(A,X!) = exp~' " i / , (1)

and the total aggregation pheromone density de

posited by the entire population of n ants at x ' is then given by

n . n

_—^ d (x , , x ' ) 2

Ar(x ' ) = 2 > x P 53>—, (2) i = l

where, 8 denotes the spread of Gaussian function. In the similar way, we can compute the total aggregation pheromone density Ar(x) for any point x.

Now, an ant A1 which is initially at location x' moves to the new location x" (which is computed using Eq. 3) if the total aggregation pheromone density at x" is greater than that at x'. The movement of an ant is governed by the amount of pheromone deposited at different points in the search space. It is defined as

„ , Next(A') x"=x' + n. K—t, (3)

where, n t 2

Next(A') = ^ ( x i - x'). exp~ ^ V " ' , (4) i=l

with j] as a step size. This process of finding a new location continues until an ant finds a location where the total aggregation pheromone density is more than its neighboring points. Once the ant Ai finds out such a point xj with greater pheromone density, then that point is assumed to be a new potential cluster center, say Zj (j = 1,2,..., C, C being number of clusters); and the data point with which the ant was associated earlier (i.e., x-,) is assigned to the cluster so formed with the center Zj. Also the data points that are within a distance of (5/2 from Zj are assigned to the newly formed cluster. On the other hand, if the distance between x- and the existing cluster center Zj is less than 6/2 and the ratio of their densities is greater than threshold-density (a predefined parameter), then the data point xi is allocated to the already existing cluster centering at Zj. Higher value of density ratio represents that the two points are of nearly similar density and hence should belong to the same cluster. The proposed aggregation pheromone based clustering (APC) algorithm for formation of clusters is given below.

begin I n i t i a l i z e a, threshold-density, n and C = 0 for i = 1 : n do if the data pattern X; is not already

assigned to any cluster

Megha Kothari, Susmita Ghosh and Ashish Ghosh 195

Compute Ar(xi) using Eq. 2

label 1: Compute new location xj using Eq.

3

Compute Ar(x-)

if (Ar(x'i) > Ar(xi))

Update the location of ant Ai to xj and

goto label 1

else continue

if (C == 0) //If no cluster exists

Consider xj as cluster center Zi and

increase C by one

else

for j = 1: C

if (min(Ar(x/i),Ar(Zj))/max(AT(x;),AT(Zj)) >

thresholdjdensity and d(x[, Zj) < 6/2) Assign xj to Zj

else

Assign xj as a new cluster center say,

Zc+i and increase C by one

Assign all the data points that are at a

distance of 6/2 from xj to the newly formed

cluster Zc+i

end of else

end of for

end of else

end of if (if the data pattern Xi ...)

end of for

end

2.2. Merging of Clusters

In this stage, to obtain the desired number of clusters we apply average linkage18 algorithm. The algorithm works by merging the two most similar clusters until the desired number of clusters are obtained.


3.1. Description of Data Set

Two real multitemporal images namely Mexico and lake Mulargia have been used for experimental purpose. Mexico image is acquired on 18t/l April 2000 and 20"1 May 2002 by Thematic Mapper Plus (TM+) sensor of Landsat-7 satellite. From the entire available Landsat scene, a section of 512x512 pixels has been selected as test site. Significant portion of the vegetation in the aforesaid area between the two dates was destroyed by wildfire. Figs. 1(a) and 1(b) show the channel 4 of the April and May images, respectively. Experts generated the reference map [Fig.

1(c)] by visual analysis of the images with the help of available ground truth concerning the location of wildfire. The reference map is required to assess the change detection errors.

Image of lake Mulargia of Sardinia Island is acquired by Thematic Mapper Plus (TM+) sensor of Landsat-5 satellite on September 1995 and July 1996. Between the two acquisition dates the water level in the lake was increased. Fig. 2 shows a section of 412x300 pixels of channel 4 of September and July image along with the reference map.

3.2. Results

From given multitemporal images we compute difference image by taking the absolute difference of the two co-registered images of different dates. The difference images corresponding to Mexico and lake Mulargia are shown in Figs. 1(d) and 2(d), respectively. Feature vector corresponding to each pixel of the difference image is generated by considering its gray value and the average gray value of neighboring pixels (second order). The algorithm, described in Section 2, is then applied on this difference image to obtain the corresponding change detection map [Figs. 3(a) and 4(a)]. Values of n and threshold-density are kept to 1 and 0.9, respectively and different values of 6 in the range [0, 1] are considered. We determine a stable range of 6 for which clusters were compact. Results are shown for a 6 value taken from this range. The results obtained by the proposed A PC algorithm are compared with those obtained by ground truth based optimal Manual Trial and Error Thresholding (MTET) technique and are depicted in Table 1.

Table 1. Missed alarms, false alarms and total alarms resulting from the proposed APC algorithm and MTET.

Images used

Mexico

Lake

Technique used

APC(<5=0.23) MTET

APC((5=0.57) MTET

Missed alarms

2654 2404

1000 1015

False alarms

760 2187

599 875

Total alarms

3414 4591

1599 1890

It is seen from the table that missed alarms (changed pixels classified as unchanged) obtained by

196 Aggregation Pheromone Density Based Change Detection in Remotely Sensed Images

" -y - >

>*

Fig. 1. Mexico image (a) Acquired on April 2000 (b) Acquired on May 2002 (c) Reference map (d) Difference image.

Fig. 2. Lake Muiargia image (a) Acquired on September 1995 (b) Acquired on.July 1996 (c) Reference map (d) Difference image.

Fig. 4. Change detection map for lake Muiargia image by (a) APC algorithm (b) MTET.

Fig. 3. Change detection map for Mexico image by (a) APC algorithm (b) MTET.

APC and MTET are comparable but false alarms (unchanged pixels classified as changed) obtained by APC are significantly less than those obtained using MTET. In terms of OYerall error, APC performs better than MTET. Figs. 3 and 4 show the change detection map obtained by both the techniques for Mexico and lake Muiargia image, respectively. Comparison of these change detection maps justifies the capability of the proposed APC algorithm for exploiting spatial contextual information to extract the changed regions properly. It is worth mentioning that MTET requires ground truth information to find out the optimal threshold whereas APC is completely unsupervised in nature.

4. Conclusions

In this paper we have proposed a new algorithm for change detection based on aggregation pheromone density, which is inspired by the ants' property to accumulate around points having higher pheromone density, To evaluate the performance of the proposed algorithm experiments were carried out with two multitemporal remote sensing images. Qualitative and quantitative evaluation of the experimental results establishes the superiority of the proposed APC algorithm over existing MTET. The proposed algorithm needs fewer parameters while producing better results. Moreover unlike MTET it does not require any ground truth information.

Megha Kothari, Susmita Ghosh and Ashish Ghosh 197

A c k n o w l e d g m e n t s

Authors would like to acknowledge the Department

of Science and Technology, Government of India and

University of Trento, Italy, the sponsors of the India-

Trento Program on Advanced Research (ITPAR),

under which a project titled " A d v a n c e d Techn iques for R e m o t e S e n s i n g I m a g e P r o c e s s i n g " is being carried out at the Department of Com

puter Science and Engineering, Jadavpur Univer

sity, Kolkata. Authors would also like to acknowledge

Prof. Lorenzo Bruzzone, Italian collaborator of this

project, for providing the images.

R e f e r e n c e s

1. J. A. Richards, Remote Sensing Digital Image Analysis, 2nd edn. (Springer-Verlag, 1993).

2. R. J. Radke, S. Andra, O. Al-Kofahi and B. Roysam, IEEE Transactions on Image Processing 14, 294 (2005).

3. D. Lu, P. Mausel, E. Brondizio and E. Moran, International Journal on Remote Sensing 25, 2365 (2004).

4. J. R. G. Townshend, C. O. Justice, C. Gurney and J. McManus, IEEE Transactions on Geoscience and Remote Sensing 30, 1054 (1992).

5. P. Gong, E. F. Ledrew and J. R. Miller, International Journal on Remote Sensing 13, 773 (1992).

6. S. Patra, S. Ghosh and A. Ghosh, Unsupervised change detection in remote-sensing images using one-dimensional self-organizing feature map neural network, in Proceedings of 9 International conference on Information Technology (ICIT06), (IEEE Computer Society press, Bhubaneswar, India, 2006).

7. S. Ghosh, S. Patra, M. Kothari and A. Ghosh, AN-VESA: The Journal of F. M. University 1, 48 (2005).

8. L. Bruzzone and D. F. Prieto, IEEE Transactions on

Geoscience and Remote Sensing 38, 1171 (2000). 9. T. Kavzoglu and P. Mather, International Journal

on Remote Sensing 24, 4907 (2003). 10. P. A. Brivio, M. Maggi, E. Binagi, I. Gallo and J. M.

Gregoire, Exploiting spatial and temporal information for extracting burned areas from time series of SPOT-VGT data, in Proceedings of the 1st International Workshop on the Analysis of the Multi-Temporal Remote Sensing Images, 2001.

11. F. Bovolo and L. Bruzzone, A context-sensitive technique based on support vector machines for image classification, in Preceedings of 1st International Conference on Pattern Recognition and Machine Intelligence, eds. S. K. Pal, S. Bandyopadhyay and S. Biswas (Springer, 2006).

12. A. L. Vizine, L. N. de Castro, E. R. Hruschka and R. R. Gudwin, Informatica 29, 143 (2005).

13. M. Ono, T. Igarashi, E. Ohno and M. Sasaki, Nature 377, 334 (1995).

14. M. Sukama and H. Fukami, Journal of Chemical Ecology 19, 2521 (1993).

15. S. Tsutsui and A. Ghosh, An extension of ant colony optimization for function optimization, in Proceedings of the 5 Asia Pacific Conference on Simulated Evolution and Learning (SEAL04), Pusan, Korea, 2004.

16. M. Kothari, S. Ghosh and A. Ghosh, Aggregation pheromone density based clustering, in Proceedings of 9 International conference on Information Technology (ICIT06), (IEEE Computer Society press, Bhubaneswar, India, 2006).

17. S. Ghosh, M. Kothari and A. Ghosh, Aggregation pheromone density based image segmentation, in Proceedings of 5 Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP06), (Springer.LNCS, 2006).

18. S. Theodoridis and K. Koutroumbas, Pattern Recognition, 2nd edn. (Elsevier Academic Press, Amsterdam, 2003).

198

Automatic Brain Tumor Segmentation Using Symmetry Analysis and Deformable Models

Hassan Khotanlou, Olivier Colliot1 and Isabelle Bloch

GET-Ecole Nationale Superieure des Telecommunications Departement TSI, CNRS UMR 5141 LTCI

46 rue Barrault, 75634 Paris Cedex 13, France E-mail: {Hassan. Khotanlou, Olivier. Colliot,Isabelle.Bloch} Oenst.fr

1 Current address: LENA UPR 640 CNRS, Paris, France

We propose a new general automatic method for segmenting brain tumors in 3D MRI. Our method is applicable to different types of tumors. A first detection process is based on selecting asymmetric areas with respect to the approximate brain symmetry plane. Its result constitutes the initialization of a segmentation method based on a combination of a deformable model and spatial relations, leading to a precise segmentation of the tumors. The results obtained on different types of tumors have been evaluated by comparison with manual segmentations.

Keywords: Segmentation, symmetry plane, deformable Models, spatial relations, brain tumors, MRI.

1. Introduction

The segmentation of brain tumors in magnetic resonance images (MRI) is a challenging and difficult task because of the variety of their possible shapes, locations, image intensities. The aim of this paper is to contribute to this domain, by proposing an original method, which is automatic and general enough to address the variability issues.

Existing methods are classically divided into region based and contour based methods, and are usually dedicated to full enhanced tumors or specific types of tumors. In the first class, Clark et al.1 have proposed a method for tumor segmentation using knowledge based and fuzzy classification, where a learning process prior to segmenting a set of images is necessary. Other methods are based on statistical pattern recognition techniques such as.2"4 These methods fail in the case of large deformations in the brain. Existing contour based methods are not fully automatic and need some manual operation for initialization. Lefohn et al.5 have proposed a semiautomatic method using level sets. Another segmentation method based on level sets was introduced by Ho et al.6 that uses Tl-weighted images both with and without contrast agent for tumor detection. A method by deformable model and neural network was introduced by Zhu and Yang7 that processes the image slice by slice and is not a real 3D method.

In this paper we introduce a fully automatic method for the segmentation of different types of tumors in 3D MRI, based on a combination of region based and contour based methods. In the first step,

described in Section 2, we use the mid-sagittal approximate symmetry plane and detect tumors as an asymmetry with respect to this plane. In the second step, detailed in Section 3, a precise segmentation is obtained using an original combination of deformable models and spatial relations. Results are then presented in Section 4.

2. Tumor detection based on symmetry

In this section we detail the first step of the proposed approach, by explaining our method to compute the symmetry plane of the brain and then the method to detect tumors based on this plane.

2.1. Computation of the approximate symmetry plane

As proposed in,8 the computation of the approximate symmetry plane is expressed as a registration problem. A degree of similarity between the image and its reflection with respect to a plane is computed. The best plane is then obtained by maximizing this similarity. This optimization is performed using downhill simplex method and is initialized by the plane obtained from principal inertia axes, which proves to be close to the global optimum.

Let u be a unit vector in M3 and IIUid a plane in R3 orthogonal to the vector u and passing at the distance d from the coordinate origin. We denote by eu,d(/) the reflection of image / with respect to the plane nu,d: euM(f)(x,y,z) = f(eu,d(x,y,z)). An im-

http://Oenst.fr

age / is called reflection symmetrical if there exists a reflection plane IIUjrf such that eu^(f) = f.

The idea is to compute a symmetry measure A*u,d(/) of ^ê image / with respect to an arbitrary reiection plane IIUid, and to find the plane leading to the maximal symmetry degree and the corresponding value of symmetry measure / i( /) :

u€S2, «f€R+

In this case the symmetry measure /Au,d(/) c a n be defined as the similarity between images / and eu>£j(/). In this work we use the following symmetry measure:

„ m , ll/-eu,d(/)H2

lhi9dKJ) - l 2||j||2

2*2. Tumor detection

In our previous work9 we have used the fuzzy pos-sibilistic C-means (FPCM) classification algorithm for tumor detection and we obtained good results for detecting hyper intensity tumors (full enhanced tumors). However this method is difficult to generalize to any type of tumor while keeping it automatic. Therefore we suggest another approach, using the approximate symmetry plane.

(c) (d)

Fig. 1. (a) One axial slice of the original 3D image, (b) Brain mask and symmetry plane, (c) Another example, (d) Brain mask of image (c) with symmetry plane.

Pathological brains are usually not symmetric, thus the symmetry plane is computed on the seg-

Hassan Khotanlou, Olivier Colliot and IsabeUe Block 199

mented brain. In the normal brain, it has been observed that the symmetry plane of the grey level brain image and of the segmented brain are approximately equal. The segmentation of the brain is performed as in.10 The algorithm summarized in section 2.1 is then applied on the binary image of the brain. Applying this method to images containing tumors provides a good approximation of the mid-sagittal plane, despite the asymmetry induced by the tumors, thanks to the preliminary segmentation of the brain. This is illustrated in Figure 1.

Now tumors can be detected by evaluating this asymmetry with respect to the obtained plane. We assume that tumors are localized in only one hemisphere. This hemisphere is found by comparing the grey level characteristics (mean and standard deviation) of grey matter, white matter and CSF computed in the whole image on the one hand, and in each hemisphere on the other hand. Let H n denote the histogram of grey levels in the normal hemisphere and H p the histogram in the pathological hemisphere. The histogram difference H a = H p — H n provides useful information about new intensity classes induced by the tumor. In the case of a tumor without edema (as in Figure 1(c)) a positive peak can be observed in M3 that shows the tumor intensity range (see Figure 2(a)) and we can use a thresholding and morphological operations to extract the tumor (Figure 2(b)).

In the case of a tumor with edema (as in Figure 1(a)) we observe two peaks in M3 (Figure 3(a)). Because the intensity of edema is always lower than the intensity of the tumor, the i rs t peak correponds to the edema and the second peak to the tumor. We have considered the peaks with more than 300 voxels, this threshold being based on the analysis of M3

for several normal brains. The negative peaks observed in H s correspond

to normal tissues, around the tumor and the edema, since these tissues are less represented in the hemisphere containing the pathology than in the other hemisphere. These tissues can therefore been obtained automatically (Figures 2(c) and 3(c)). They will be used for introducing spatial relations in the next section.

3. Refined segmentat ion

In this section we detail the second step of the proposed approach: the previous detection of the tumor

200 Automatic Brain Tumor Segmentation Using Symmetry Analysis and Deformable Models

1 Ti&ue Around the Turner

Ettera, Tumor

Fig. 3. (a) Graph of H 3 for image (a) of Figure 1. (b) Extracted tumor after morphological operations, (c) Tissues around the tumor.

to constrain this model by spatial relations between the tumor and other tissues, as an adaptation of the method described in11 for normal brains.

3 .1 . Spatial relations constrained deformable model

The evolution of the deformable surface X is described by the following dynamic force equation:12

7 ^ r = Fint(X) + F e x t (X) , where F i n t is the internal force that specifies the regularity of the surface and Fext is the external force that drives the surface towards image edges. The chosen internal force is Fint — c*V2X-/?V2(V2X), where a and /? respectively control the surface tension and rigidity, and V2

is the Laplacian operator. It is then discretized on the simplex mesh using the finite difference method.12

In our case, the external force is not only derived from image edges but also constrains the deformable model to satisfy spatial relations to the surrounding tissues. The spatial relations are represented by fuzzy sets in the image space,13 from which a new fuzzy force is derived. The external force is then written as:11 Fext = Av + /xF/j where v is a classical external force such as gradient or balloon and FR is the force attached to the spatial relationships.

Tisam Around tb® *t*umor / s •

. > i ! : :

Fig. 2. (a) Graph of H s for image (c) of Figure 1. (b) Extracted tumor after morphological operations, (c) Tissues around the tumor.

is used to initialize a deformable model. We propose

3.2. Constrained deformable model for tumor segmentation

Spatial relations are useful to guide the recognition of objects in images since they provide an important information about the spatial organization of these objects. Two main classes of spatial relations can be considered: topological relationships, such as inclusion, exclusion and adjacency, and metric relationships such as distances and orientations. Here we use a combination of topological and distance information.

The evolution process of the deformable model can be guided by a combination of several relations, via information fusion tools. Here, two types of information are available: the initial detection and the surrounding tissues. Therefore we use (i) the distance from the initial segmented tumor, and (ii) the tissues around the tumor which were obtained in the previous step. The idea is that the contour of the tumor should be situated somewhere inbetween the boundary of the initial detection and the boundary

Hassan Khotanlou, Olivier Colliot and Isabelle Bloch 201

of the normal tissues (excluding the background). A fuzzy set representing the relation "near the tumor" is defined as an increasing function of the distance. A distance map in the normal tissues to its complement (tumor and background) is computed, and a fuzzy set is derived again using an increasing function. These two relations are represented as fuzzy sets in the image space. They are illustrated in Figure 4.

(a) (b) (c)

Fig. 4. Spatial relations used for segmentation on two examples, (a) Near the tumor, (b) Relation provided by the normal tissues, (c) Fusion of the two relations.

These relations are combined using a conjunctive fusion operator (a t-norm), leading to a fuzzy set /J,R. The resulting fuzzy set provides high values in the region where both relations are satisfied, and lower elsewhere. The fuzzy force is derived from this fusion

where d is a distance map to the kernel of fiR (i.e. points x for which /J.R(X) = 1). The classical external force is calculated by Generalized Gradient Vector Flow12 based on an edge map obtained from Canny-Deriche edge detection.

4. Resul t s and conclusion

We have applied the method to 10 different real 3D Tl-weighted MRI (of size 256 x 256 x 124). These images contain tumors with different sizes, intensities, shapes and locations. This allows us to illustrate the large field of application of our method. The evaluation of the segmentation results was performed through a quantitative comparison with the

results of a manual segmentation. Let us denote by A the manually segmented tumor and B the tumor segmented by our method. We used three measures to evaluate the results, as proposed in14 which are:

• overlap: |4S§| ; • Hausdorff distance between A and B, defined

as max.(h(A,B),h(B,A)) where h(A,B) = maxaeAminxesd(a,6), and d(a,b) denotes the Euclidean distance between a and b (a and b are points of A and B respectively);

• the signed distances from the surface of B to the surface of A are computed, and the absolute average value of distances is derived.

The segmentation results for the two cases of Figure 1 are shown in Figure 5. In the first case, the initial detection based on symmetry analysis only provides a part of the tumor. The whole tumor is successfully recovered by the second segmentation step using the deformable model and the spatial relations. Even in the second case, where the initial detection is already quite good, the second step provides a more precise boundary of the lesion.

The quantitative results obtained by comparing the automatic segmentations with the available manual segmentations are provided in Table 1 for 10 cases. For the overlap, all values are greater than 85% (note that values above 70% are generally considered as good results), and most of them are greater than 91%. The distance-based evaluations should be compared to the voxel size, which is typically 1 x 1 x 1.3 mm3 for all images. The Hausdorff distance in these 10 cases is always less than five voxels, and often even smaller. It should be noted that this measure is particularly severe since it is a maximum distance and provides the error for the worst point in the segmentation. The average distance is less than one voxel, which means that in average the automatic contour is very close to the manual one (less than one voxel distance). Due to the partial volume effect, the obtention of more precise results would require to work at sub-voxel level.

All these results show the high accuracy of the proposed method, which was also confirmed by a visual evaluation performed by medical experts.

As a conclusion, the proposed hybrid algorithm using region based and contour based methods proves to be efficient to segment brain tumors in 3D

202 Automatic Bmin Tumor Segmentation Using Symmetry Analysis and Deformable Models

in) (l>) (c)

Fig. 5. Final segmentation results on two different cases. (a) Initial detection superimposed on an axial slice, (b) Final segmentation, (c) Result superimposed on a sagittal slice.

Table 1. Evaluation of the segmentation results of tumors on a few 3D MR images for which a manual segmentation was available.

Dataset Overlap Hausdorff Average

Tumor 1 Tumor 2 Tumor 3 Tumor 4 Tumor 5 Tumor 6 Tumor 7 TumorS Tumor 9 Tumor 10

(%)

95.56 91.32 96.12 88.24 90.08 91.21 95.05 85.86 88.63 91.63

(mm)

2.66 5.20 1.32 1.51 3.14 2.40 4.02 3.62 4.91 3.83

(mm)

0.62 1.41 0.41 1.21 1.15 1.01 1.29 1.31 0.92 0.84

MR images. I ts application to several datasets with

different tumor sizes, intensities and locations shows

tha t it can automatically detect and segment very

different types of brain tumors with a good quality.

Our method can be applied as well to T2-weighted

images or FLAIR images. However it may fail in the

case of a symmetrical tumor across the mid-sagittal

plane, but this case is very rare. Future works aims

at combining several modalities such as T2-weighted

and FLAIR to develop the segmentation of edema

and in i l t ra t ion around the tumors.


Hassan Khotanlou is supported by Bu AM Sina Uni

versity, and Olivier CoUiot by a grant from Paristech

- lie de FVance.

R e f e r e n c e s

1. M. Clark, L. Lawrence, D. Golgof, R. Velthuizen, F. Murtagh and M. Siibiger, IEEE Transactions on Medical Imaging 17(April 1998).

2. M. Kaus, S. WarBeid, A. Nabavi, E. Chatzidakis, P. Black, F. Jolesz and R. Kikinis, Segmentation of meningiomas and low grade gliomas in MRI, in MIC-CAI, (Cambridge UK, 1999).

3. N. Moon, E. Bullitt, K. Leemput and G. Gerig, Model-based brain and tumor segmentation, in ICPR, (Quebec, 2002).

4. M. Prastawa, E. Bullitt, S. Ho and G. Gerig, Mêdical Image Analysis 18, 217 (2004).

5. A. Lefohn, J. Gates and R. Whitaker, Interactive^ GPU-BasM Level Sets for SD Brain Tumor Segmentation, tech. rep., University of Utah (April 2003).

6. S. Ho, E. Bullitt and G. Gerig, Level set evolution with region competition: Automatic 3D segmentation of brain tumors, in ICPR, (Quebec, 2002).

7. Y. Zhu and H. Yang, IEEE Tmnsactions on Medical Imaging IS , 55 (1997).

8. A. Tuzikov, O. Colliot and I. Bloch, Pattern Recognition Letters 24, 2219(oct 2003).

9. H. Khotanlou, J. Atif, O. Colliot and I. Bloch, 3D Brain Tumor Segmentation Using Fuzzy Classification and Deformable Models, in WILF, (Crema, Italy, 2005).

10. J.-F. Mangin, O. Coulon and V. R*ouin, Robust brain segmentation using histogram scale-space analysis and mathematical morphology, in MICCAI, (Cambridge USA, 1998).

11. O. Colliot, O. Camara and I. Bloch, Pattern Recognition 39, 1401 (2006).

12. C. Xu and J. Prince, IEEE Transactions on Image Processing 7, 359 (1998).

13. I. Bloch, Image and Vision Computing 23, 89 (2005). 14. G. Gerig, M. Jomier and M. Chakos, Valmet: a new

validation tool for assessing and improving 3d object segmentation, in MICCAI, (Utrecht, Netherlands, 2001).

203

Edge Recognition in MMWave Images by Biorthogonal Wavelet Decomposition and Genetic Algorithm

C. Bhattacharya and V. P. Dutta

DEAL (DRDO) Raipur Road, Dehradun 248001, India

E-mail: [email protected], [email protected]

In this paper, we present a multiresolution approach to recognize the edges of objects in MMWave images by generating multiscale image approximations. Bounds of threshold for the multiscale edge operator are optimized by genetic algorithm (GA). Equivalence of 2-D multiscale discrete wavelet transform to Canny edge operator is exploited here to optimize the bounds of hysteresis threshold. The invariance of object boundaries to dyadic coarser change of scale is shown in the results.

Keywords: Multiresolution; Edge Recognition; Genetic Algorithm; Biorthogonal wavelet

1. Introduction

Passive sensors such as millimeterwave (MMWave) imaging sensors have the advantage that they require no radio transmission and are all-weather, day-night imaging instruments. But the images acquired are of poor resolution, low dynamic range and hence are of poor contrast; images remain blurred due to scanning antenna of the sensor. Recognition of objects of interest, therefore, is difficult in MMWave images purely by human vision. Secondly, information content in such images can be enhanced by image fusion from other passive sensors requiring automatic object recognition methods.

One important and widely accepted low-level processing method is to detect the edges of objects of interest at multiple scales.1-6 We pursue this approach for MMWave images, as object boundaries remain invariant to dyadic change of image resolution. Multiscale edge detection property of wavelet transforms has been proved to be equivalent to the multiscale nature of the edge detector devised by John Canny.3 This algorithm,1'5,6 popularly known as the Mallat-Zhong discrete wavelet transform (M-Z DWT) exploits the inherent edge extraction property of DWT by evaluating the local maxima of modulus of DWT of images. Another way of utilizing the multiscale analysis property of DWT is to generate progressively lower resolution (higher scale) detail components of image through two-dimensional (2-D) DWT and synthesizing the fine resolution edge map from these multiscale detail components.4

In this paper, we exploit the equivalence of multiscale Canny edge detector to multiresolution edge

map generated by 2-D DWT instead of using one-dimensional (1-D) edge operators. The advantage in our method from M-Z DWT implementation is that by utilizing 2-D DWT, hierarchical scale image approximations and horizontal, vertical and diagonal edge maps at dyadic coarser resolutions are simultaneously generated. We select a particular biorthogonal wavelet filterbank (CDF3.5) pair to prove this equivalence of edge detection. For a Gaussian window of normalized width CTJ = 1, smoothing the image by Canny operator and approximation by scaling operator in biorthogonal DWT remain equivalent. It is well known that final edge extraction in Canny operator is stable with respect to false edges because of hysteresis in the upper and lower bounds of threshold. Still streaking and discontinuities in edges occur because of non-optimized threshold limits.3 Selection of proper threshold bounds is crucial so that no new edge segments are created at dyadic coarser resolutions. Here, genetic algorithm (GA) is utilized to optimize the hysteresis threshold bounds by minimizing the difference in the edge map created by 2-D DWT and Canny edge operator on the dyadic scale image approximations. Finally, the results of the method for recognizing object boundaries in dyadic scale MMWave image approximations are presented.

2. Edge Detection by projection of 2-D D W T

In the M-Z DWT interpretation of wavelet edge detector, let 8(u) be the 2-D scaling function for dyadic decomposition where u = (ul, u2) G TZ?. Wavelet de-



204 Edge Recognition in MM Wave Images by Biorthogonal Wavelet Decomposition and Genetic Algorithm

tail basis functions in two dimensions are chosen as

x dO[u) 2_ d6{u) du\ ' du2

Dilation of 6{u) over 2j scales is 0j(u) = 2-id{2'ju). The dilated wavelets are

^smj.-iim. (1) Wavelet transform (WT) for image f(u) in each of two dimensions k = {1,2} is

Wf{x) = (/(«), rf(x - u)> = (/ * $ ) ( u ) (2)

where * is convolution of two continuous functions. The equivalence of multiscale Canny edge oper

ator to the M-Z DWT is then proved to be1

( ^ / ( « ) ) = 2 J V ( / ^ ) ( U ) - ( 3 )

The gradient operation over convolution of /(it) with 0j (u) in Eq. (3) is the multiscale Canny edge operator. Therefore, M-Z DWT represents the horizontal and vertical details in the image provided the gradient operation in the two sides are proportional. The modulus of WT is

Mf(u,V) = y/(W}f(u))' + (W?f(u))*

and orientation is given by

, fWff(u)\ a = arctan „ r , , . . .

Edges are co-ordinates where the modulus of M-Z DWT is maximum in an one-dimensional (1-D) neighborhood along a direction of increasing a. In its implementation 1-D DWT in both directions are performed on the scaled, smoothened approximation (/ * Oj)(u). Since dyadic decimation in DWT fil-terbanks make the edge locations shift-variant over scales, M-Z DWT is implemented in undecimated DWT filterbanks by inserting 2J zeroes in between the filter coefficients.

In our approach, 2-D projections of the approximation and detail components of image are generated instead of 1-D undecimated DWT. Using biorthogonal filterbanks for perfect reconstruction, projections of approximations and details in horizontal, vertical and diagonal directions are simultaneously available by Mallat's fast 2-D synthesis of DWT components.2

Analysis and synthesis in 2-D biorthogonal spaces are V?._ x> = V? ffi Wj. Thus, finer resolution approximations spaces are the sum of coarser resolution

biorthogonal approximation and details subspaces. Multiresolution tensor product spaces are V?-x, =

V(j_i) ® V(j-\) a n d in 1-D case, V(j_i) = Vj © Wj. Multiresolution biorthogonal projection spaces are given by the distributive property of summation over tensor product as

(4) By closure property of multiresolution spaces

^(j_i) 3 ^ 3 ^(j+i) -> ' • •' therefore, projections of coarser resolution spaces in Eq. (4) are continued it-eratively on Vj. Vj is spanned by dilated shifts in the scaling function Oj(u) and Wj is spanned by ipj(u) shown in Eq. (2). Since convolutions are commutative biorthogonal projections of f{u), that is WT components of f(u) in V,2-^ are

pv5 n/(«) = [ ((he))*e]){u)

©((/**] )*$)(«) ©((/*02)*V])(«) © ( ( / * $ ) *$)(«)] (5)

This shows that at 2^~1) scale the image is summation of approximation at coarser 2J scale and respectively the horizontal, vertical and diagonal edges represented in the projections of detail components of f(u). Modulus of detail projection is

M/(u,2>') = >/(( /*^)*^) 2 + ((/*«?)*^)2

(6) Edges are coordinates in the image where this modulus is maximum in the direction of increasing angle between the two components. Contribution of the diagonal projection is neglected in the reconstruction of image from edges.1

3. Implementation of the Edge Detector

3.1. Biorthogonal Cohen-Daubechies-Feaveau (CDF 3.5) filterbank pair

Scaling and wavelet functions used here for multi-scale edge detection are shown in Fig. 1. 6j(u) in Fig. 1(a) provides symmetric Gaussian smoothing derived from cubic spline function. tpj{u) in Fig. 1(b) is its anti-symmetric derivative that is a quadratic spline function. Biorthogonal analysis filters of CDF

C. Bhattacharya and V. P. Dutta 205

0.8

0.4

0.2

Aniljnn leafing function

> : / ; /

^ i

\ !

\ :

MlfcttNMllltfanctta

- 1 0 1 tsnptenunbtr

(a) (b)

Fig. 1. (a) Interpolated scaling function for CDF3.5; (b) interpolated analysis function for CDF3.5.

3.5 on interpolation produce the basis functions in Fig. 1. The coefficients of analysis and synthesis low pass filters respectively are2

1 3 3 1' h = [h-i ho h\ h<z] = V2

8 8 8 8

= [h

= v/2

h = \h-5 h-i /i_3 h-2 h-\ ho h\ hi hj, h^ h$ h§\ ' 5 15 19 97 26 350 512 512 512 512 512 512 " '

350 26 97 19 15 5 512 512 512 512 512 512

The highpass analysis and synthesis filters are derived as

To derive multiresolution projection components in Eq. (5) in pixel domain 2-D DWT is performed by standard row over column 1-D dyadic analysis filter-bank2 followed by interpolation in synthesis filter-bank.

We show in Fig. 2 comparative performance in detecting edges by 2-D DWT edge detector and Canny edge operator for two scales of successive dyadic coarse resolutions. A relatively simple image of printed circuit board is taken to demonstrate the detectors' performance for horizontal and vertical edges. Magnitude of edge intensity for 2-D DWT edge detector in Fig. 2(b),2(e) is derived by Eq. (6). Orientations of edges are determined by Canny's algorithm for suppression of non-maximal magnitudes in a direction normal to the edges. Smoothing in the Canny edge operator is done by dyadic scaling function shown in Fig. 1(a). The width <TO for Gaussian smoothing at fine resolution is taken to be 1 so that the smoothing operator for image approximations in Fig. 2(a),2(d) is dilated by 2K Therefore, the edge

map in Fig. 2(c),2(f) is by the derivative operation in Canny edge operator and then suppression of non-maximal local magnitude is carried out.

3.2. Threshold Optimization by Genetic Algorithm (GA)

It is well known that Canny edge operator is stable against false edge detection because of upper and lower bounds in the hysteresis threshold. A modulus maximum point is declared an edge point if and only i f T l o w < M / ( u , 2 ' ) < T h i g h .

Although there is close similarity in the edge map in Fig. 2(c),2(f) with 2-D DWT edge map, streaking occurs at higher scales. This is because of nonoptimum selection of thresholds from scale to scale. We search for the optimum bounds of threshold using GA since the two edge maps are mathematically equivalent at each scale. Application of GA for image segmentation is reported somewhere else.7 Typically, fixed-length binary strings in the search population are programmed to crossover and mutate in order to locate regions of interest in the search space. Three basic genetic operators: selection, crossover, and mutation guide this search. Lower and upper thresholds are coded in 16 bits. Probability of crossover is 0.9 and probability of mutation is 0.01 with each generation of 30 population. Here the objective function is to minimize \Mf(u,2j) - V ( / * 6j)(u)\ evaluating the hysteresis threshold at each iteration until the termination condition for genetic search is reached.

4. Results

Results of implementation of the proposed edge detection method in actual multiresolution MMWave scenes are shown in Fig. 3 and 4. The original MMWave ground scene collected by the W-band imaging sensor developed at DEAL, Dehradun is shown in Fig. 3(a). Contrast in the original scene is poor because of low dynamic range of MMWave reflectivity. The contrast-enhanced image is shown in Fig. 3(b). Fig. 4(a)-4(c) show the dyadic coarser resolution approximations of image in Fig. 3(b) derived by 2-D DWT operator.

Corresponding edgemaps by Canny edge operator are respectively in Fig. 4(d)-4(f). It is seen in the edgemaps of Fig. 4(d)-4(f) that prominent objects boundaries are located without streaking. Secondly, object boundaries remain invariant to scale and no

206 Edge Recognition in MM Wave Images by Biorihogonal Wavelet Decomposition and Genetic Algorithm

(a) scale 2 image approximation, (b) 2-D D W T edge map, scale 2. (c) Canny edge map, scale 2

Fig. 2. (a),(d) Projection of image approximation at two dyadic coarse resolutions; (b),(e) edge map at two successive coarser resolutions derived from detail projections by 2-D DWT; (c),(f) edge map at two successive coarser resolutions by Canny edge operator over image approximations in (a),(d).

"IT*

Fig. 3. (a) Ori^hud MM Wove ground scene; (b) contract enhanced MM Wave- ground scene. Table 1. Hysteresis threshold bounds optimized by GA.

Scale (2J)

21

2 2

2 3

Low threshold T\ow

0.4170

0.5847

0.5076

High threshold Thigh

2.4691

2.4044

2.4702

new edge is created as scale becomes coarser. This

is possible because of the optimized bounds of hys

teresis threshold, which are output of search by GA

over 2-D D W T and the Canny edge operator at each

scale. T h e optimized hysteresis threshold for three

edge maps in Fig. 4(d)-4(f) are given in Table 1 be

low.

5* C o n c l u s i o n

In this paper, recognition of edges of objects of in

terest is shown by multiresolution image approxima

tions. The analysis of M-Z D W T is extended to 2-

D D W T as edgemap and approximations of the im

age are available simultaneously. For this, C D F 3.5

biorthogonal wavelet filterbank is used to derive the multiresolution projections. The hysteresis threshold bounds of Canny edge operator are optimized by GA based search by minimizing the distance between the 2-D D W T edge m a p and the Canny edge m a p in MMWave images.


The authors acknowledge Shri K. Sivakumar and his

team for providing the MMWave image of the sensor

produced in DEAL, Dehradun.

R e f e r e n c e s

1. S. Mallat and S. Zhong, IEEE Trans. Pattern Analysis and Machine Intelligence 14(July 1992).

2. S. Mallat, A Wavelet Tour of Signal Processing. (Academic Press, 1999).

3. J. F. Canny, IEEE Trans. Pattern Analysis and Machine Intelligence 8, 679(November 1986).

4. J. I. Siddique and K. E. Earner, Wavelet-based multiresolution edge detection utilizing gray level edge maps, in Proc. IEEE Int. Conf; on Image Processing (ICIPf98), 1998.

C. Bhattacharya and V. P. Dutta 207

K « - * • * " * •

(a) scale 1 image approximation, (b) '2-D DW'I c-i 1 >•,» m;u>. scale 1 , •!•:• (,.,iiiii\ iiU-r in.-i|). scale 1

r 4... I--*e c •«[& i W i

!

...

t &:

S .a^. tf;

~~ „»

. U^y

.- ;~r ., ,y

K&

, „

^ ^ '"-" • s ^

V)

"?

.^-:

."-'•'

~>~.-,«v

V ?

X * <

i /

4«y

a'i

fc

' i 1 i

Ji

(d) scale 2 image approximation, (e) 2-D D W T edge map, scale 2, (f) Canny edge map, scale 2

Fig. 4. (a),(b),(c) MMWave image approximations at three dyadic resolutions; (d),(e),(f) Corresponding edge map by Canny edge operator with optimized bounds of threshold.

5. Q.-H. Lu and X.-M. Zhang, Multiresolution edge de- 50, 999 (2003). tection in noisy images using wavelet transform, in 7. B. Bhanu and S. Lee, Genetic Learning for Adaptive Proc. IEEE Int. Conf. on Machine Learning and Cy- Image Segmentation. (Kluwer Academic Publishers, bernetics (ICMLC'05), 2005. 1994).

6. K. H. Kim and S. J. Kim, IEEE Trans. Biomed. Eng.

Extended Markov Random Fields for Predictive Image Segmentation

R. Stolkin

Stevens Institute of Technology Hoboken, NJ 07030, USA.


M. Hodgetts

Cambridge Research Systems Ltd. UK.

A. Greig

University College London UK.

J. Gilby

Sira Ltd.

UK

Since the 1970s, there has been increasing interest in the use of Markov Random Fields (MRFs) as models to aid in the segmentation of noisy or degraded digital images. MRFs can make up for deficiencies in observed information by adding a-priori knowledge to the image interpretation process in the form of models of spatial interaction between neighbouring pixels. In data fusion problems, interaction might also be assumed between corresponding pixels in two different kinds of image of the same scene. Alternatively, temporal interaction might be assumed between corresponding pixels in consecutive frames of a video sequence. In object tracking or robotic navigation problems, a similar relationship may exist between pixels of an observed image and those of a predicted image, derived from models of the motion and scene. In all of these cases the MRF model can be extended to incorporate this additional knowledge. This paper explains the theory of Extended-Markov Random Field (E-MRF) segmentation techniques, surveys the research which has been crucial to their development and presents results from new work in this area with an application to robotic vision in conditions of extremely poor visibility.

Keywords: MRF; EM; Markov Random Field; Expectation Maximisation; segmentation; tracking; poor visibility

1. Introduction

Fundamental problems in computer and robot vision include the recognition and tracking of viewed objects against some background. Predominantly these processes are reliant, at some level, on image segmentation,1 dividing observed images into regions of object and background.

The problems of interpreting images under conditions of extremely poor visibility have received comparatively little attention from the computer vision community, with existing research largely motivated by underwater robotics applications. In contrast, the human visual system is often able to robustly interpret images that are of such poor quality that they contain insufficient explicit information to do so. We assert that such a system must function

by utilising prior knowledge of the scene in several forms.

Extended-Markov Random Fields (E-MRFs) provide a probabilistic framework for combining observed image data with expectations of that data, based on additional knowledge or prediction, during image segmentation. The following sections briefly introduce Markov Random Field (MRF) segmentation and then survey key literature in the development of both spatio-temporal and spatio-predictive E-MRF techniques. The use of E-MRF models for data fusion is also discussed. Results are presented from recent research, using E-MRF segmentation within an Expectation Maximisation feedback algorithm for robot vision in extremely poor visibility environments.


R. Stolkin, M. Hodgetts, A. Greig and J. Gilby 209

2. Markov Random Fields

Since the 1970s, there has been increasing interest in the use of MRF models to aid in the restoration and segmentation of digital images,2-4 since they can make up for deficiencies in observed information by adding a-priori knowledge to the image interpretation process in the form of models of spatial interaction between neighbouring pixels. Hence, the classification of a particular pixel is based, not only on the intensity of that pixel, but also on the classification of neighbouring pixels. Here, we are concerned with binary segmentation for object tracking, in which pixels can take either of two discrete values, namely object or background. Simplistically, pixels are more likely to belong to the object class if their nearest neighbours are also members of the object class and similarly for background pixels. Historically, the mathematical concepts originate in the statistical mechanics and mathematics literature.5-7

When segmenting an image containing TV" pixels, we seek for the ith pixel a class label, C,, which maximises the joint probability:

P(Ci) = P(C1,C2,...Ci,...CN) (1)

Unfortunately, this implies that such a probability distribution must explicitly characterise the joint statistics of every pixel. In a binary image, this would consist of 2N permutations, an impossibly massive space to search, every time a pixel needs to be classified. This combinatorial explosion is avoided by treating the image as a Markov random field, the fundamental notion associated with Markovianity being that of conditional independence,8 meaning that the probability distribution that describes a particular pixel can be de-coupled from the classifications of all other pixels in the image, except for those in a small local neighbourhood. For the pixel at image location

P(Ci,j) = •P(Ci,j)Ci+mj+n(mi„ek)) (2)

where k denotes a small local neighbourhood around (i, j ) , typically taken to include the eight nearest neighbour pixels (see figure 1).

To evaluate this expression for specific permutations of neighbourhood class labels, the MRF is characterised by a Gibbs distribution of the form:

e-Ui,3

P(d,i) = —%- (3)

where Z is included as a normalising constant to prevent equation (3) returning probabilities greater than one. The exponential part of this equation is defined as:

Uij = 2_^ J\Ci,j, Ci+mj+n) (4) m,n£k

where J is a function defined as:

[ - 1 , if a = b Ja,b ={ (5)

\ 0 , iia^b

('./)

Fig. 1. Conventional Markovian neighborhood. Grey squares indicate those pixels whose classifications influence the classification of pixel (i,j).

3. Extending Markov dependency

The origins of E-MRF ideas can be traced to the work of Bouthemy,9'10 who is concerned with the interpretation of murky underwater image sequences for robot navigation. Crucially, Bouthemy extends the notion of Markov dependency to include, not only contributions from a given pixel's neighbourhood in the observed image, but also a contribution from the corresponding pixel in the previous frame of the image sequence. Thus Markov dependency becomes both spatial and temporal.

E-MRF image models have also been used for data fusion. Jones,11'12 uses an E-MRF model to combine a high resolution visible light image with a relatively low resolution thermal image from an infrared camera, for a surveillance application. Extended-Markov Dependency is assumed between pixels in the visible light image and corresponding pixels in the infra-red image.

Fairweather13 and Hodgetts,14 also concerned with underwater robotics, extend Markov dependency such that the local neighbourhood surrounding any particular pixel also includes a contribution from the corresponding pixel in a predicted image (figure 2).

210 Extended Markov Random Fields for Predictive Image Segmentation

Predicted image Segmentation via maximum likelihood

Fig. 2. Extended Markovian neighborhood. Grey squares indicate those pixels whose classifications influence the classification of pixel (i, j). The extended nighbourhood now includes the corresponding pixel from a predicted image.

The predicted image is projected using a 3D model of the object being tracked and an estimate of camera position based on a Kalman filtered model of the robot trajectory. The method is tested on a variety of degraded images, demonstrating superior performance to both conventional2-4'15 and also spatio-temporal9 '10 MRP segmentation.

Equations (2) to (5) describe a conventional MRF image model in which pixel class labels are considered to be spatially dependent. However, Markov dependency is now extended to include a contribution from a predicted class label, dj, derived from a previous frame,9'10 a predicted image,13'14 or a corresponding image from a different kind of imaging device.11'12 Now:

"{î,j> = "{î,ji î+m,j+"(m,T.efc)' î,}) = ~y

(6) The exponential part of the Gibbs distribution now consists of weighted components:

Ui,j = 2_j SiJ(Ci,j>Ci+m!j+n) + S2J(Cij,Cij) m,n€k

(7) where Si and 52 are weighting constants which adjust the relative significance of information derived from the observed image versus information derived from the predicted image.

Thus, the Extended-Markov Random Field model provides a convenient means of determining the prior probability distribution for any particular pixel class label, while incorporating additional prior knowledge in the form of temporal or predictive dependencies, or fused data from an alternative imaging system.

We wish to segment each pixel (i,j), of intensity Jy, by choosing a class label, CV,, which maximises the a-posteriori probability P(Cij\Iij). From Bayes' law we have:

P{Cij | J y ) oc P(Iij \dj) x P{Ca) (8)

where the prior probability, P(CV,), is modelled by the Extended-Markov Random Field, equation (6).

Various approaches are possible for determining the class conditional probabilities, P(Iij\dj). In general it is not possible to know the true shape of these distributions over an image without already knowing the true segmentation for that image.

Often class conditional distributions are estimated from historical data, e.g. averaged or best fitted over a training set of images for which the true class labels are known. This approach may not be appropriate if, for example, lighting conditions change significantly with time over the image sequence, as might be expected in an underwater environment where a directional light source is mounted on a moving vehicle. Different distributions may be necessary for different images.

If the class conditional distributions tend to conform to a specific shape or model (often a normal distribution is assumed), then that model can be best fitted directly to the observed image statistics.16

Other authors13 '14 assume accurate prior knowledge of class conditional distributions when computing conventional, spatio-temporal or spatio-predictive MRF probabilities.

Here we describe a novel approach, which makes additional use of prior knowledge, for predicting the shape of class conditional distributions, P{Iij\Cij). The vision system is allowed to re-learn new class-conditional models for each image frame by making the approximation:

PiliilQj) * Pil^Ca) (9)

where CV, denotes the predicted class label of the pixel (i,j). In other words, the predicted image (found by projecting the object model based on estimated camera co-ordinates) is used to define a set of provisional (predicted) class labels, C_, for the observed image, from which means and variances of pixel intensities for each image region (object and


background) can be estimated. The validity of this approximation is obviously dependent on how closely the estimated camera co-ordinates approximate the true camera co-ordinates.

We approximate the class conditional distributions with Normal distributions which are particularly useful since they are of exponential form. The prior probabilities (equation 6) are also of exponential form and so it is easy to arrive at a log-likelihood function. The overall likelihood for a particular classification of a particular pixel is now:

P ( C 4 j ) x P ( / y | C y )

Where ai.. and uc., are the variance and mean of the class conditional distribution of pixel intensities that corresponds to the choice of CV, that is currently being considered for pixel (i,j). This results in the negative log-likelihood function:

/ _, S\J{Cij, C7j+mij-)-n) + D2J(Ciyj,Cij) m,n€k

-l^ltV2 (11) Cij

Si and 52 are weights which determine the significance of the class values of nearest neighbour pixels and predicted pixels respectively. They thus effect the relative significance of observed and predicted data. It is not obvious how these values should be determined and other researchers9,10'13-15 suggest experimenting to find useful values for these constants by trial and error. In good visibility, it is desirable to rely on observed information while taking comparatively little notice of error prone predictions derived from extrapolating the previous camera trajectory. Hence Si will be large and 52 comparatively small. Conversely, given the absence of observed information in bad visibility conditions, it is necessary to make greater use of predicted information. In this case, much larger values of 52 must be used. Future work may investigate methods by which these values can be automatically adjusted in response to varying visibility conditions.

The space of all possible image interpretations contains many variables since it is necessary to consider all possible class label permutations over all pixels in the image. It is not possible to search this

space (of size 2N where N is the number of pixels in the image) exhaustively in order to locate its global minimum (minimum negative likelihood). A variety of iterative algorithms have been suggested, which attempt to optimise the set of pixel class labels with respect to a statistical criterion. For a review and comparison of several of these techniques see Dubes.15

Simulated Annealing,4 (SA), belongs to the class of stochastic relaxation algorithms. SA is theoretically guaranteed to find a globally optimal labeling, however it is relatively computationally expensive and slow. In contrast, the Iterated Conditional Modes, (ICM), algorithm2'3 is not guaranteed to find an optimum set of pixel labels, being vulnerable to convergence on local minima. It is, however, several orders of magnitude faster than simulated annealing and therefore more suitable for real time applications. Other approaches include Maximiser of Posterior Marginals,2'3'15'17 (MPM), Highest Confidence First,18-19 (HCF), and Graph Cut methods.19-20

For simplicity, we use the ICM algorithm in our work for proof of principle, as do the authors of much of the related research.

5. Mutual refinement of segmentation and pose estimates via the EM / E-MRF algorithm

We have further developed these ideas in several ways. Firstly, we relearn class models at every frame, equation (9). Secondly, we note that the use of prediction results in an iterative feedback scheme, (figure 3), which can be shown to be a variant of the Expectation Maximisation (EM) algorithm.21

An estimated camera position (derived from a motion model) is used to generate a predicted image by projecting a known model of the observed object. Predicted class labels, from the predicted image, help segment the observed image via E-MRF segmentation. The object model is then best fitted to the segmented image to extract an improved camera position estimate, which is then recycled as an input to the next iteration. Iteration can be terminated when no further changes to pixel class labels occur, or when changes in estimated position fall below a threshold. The number of iterations will vary with the accuracy of the initial position estimate.


The object distribution sharpens and pulls to the right as the algorithm learns that the object is relatively light coloured and consistently so. Correspondingly, the background distribution pulls to the left as the algorithm learns that background pixels are dark.

Fig. 3. An iterative segmentation and pose estimation scheme for robot navigation. An estimated camera position is used to project a predicted image which helps segment an observed image using E-MRF. A model of the observed object is then fitted to the segmented image to extract an improved estimate of camera position relative to the object. The improved camera position is used to project an improved predicted image and the process is iterated until convergence

6. Results

We demonstrate E-MRF segmentation with an image taken from our data set of real video sequences with known ground-truth.22 These images feature a scale model oil-rig structure viewed in extremely bad visibility. Figure 4 shows the results of four iterations of E-MRF segmentation. The superimposed red wire frames indicate the position estimates and corresponding predicted image segmentations at each iteration. Note that the initial predicted position (first image) is significantly erroneous, and that the algorithm iteratively homes in on the true position and corresponding correct segmentation. Also note that the model fitting process enables the algorithm to ignore the large artefact (bottom right corner of the image) caused by backscattering of spotlight illumination. Projecting a predicted image from the most recent position estimate gives a very clean secondary segmentation (defined by the red wire frame outline in the figures).

Figure 5 shows the results of thresholding the image and segmenting using a conventional MRF model. Both these segmentation methods are inadequate for such poor visibility conditions, although the thresholded image is useful for initialising the E-MRF optimisation.

Additionally, our technique for predicting class conditional distributions, equation (9), enables these distributions to be relearned with each iteration. Figure 6 shows how these distributions change over four iterations.

Fig. 4. Both camera position estimate and image interpretation are mutually refined over four iterations of E-MRF segmentation. Left column shows the E-MRF segmentations which improve with each iteration. Right column shows the resulting position estimates and corresponding predicted images. Incorrect pixel labels (in the secondary projected segmentation) at each iteration are 12%, 7.2%, 1.6% and 0.8%

7. Conclusion

MRFs have long been used as models for segmenting individual images. More recently, researchers have explored ways of extending these models to incorporate the extra knowledge inherent in video sequences. These extensions include assuming temporal relationships between pixels of successive image frames,


^^P Tl Fig. 5. Thresholded image (left), used to initialise the E-MRF segmentation and conventional MRF segmentation (right).

Fig. 6. Re-learning of class conditional distributions with each iteration.

and also relationships between pixels in an observed

image and corresponding pixels in a predicted image.

Similar techniques have also been used to fuse da ta

from two different kinds of imaging device.

E-MRF techniques are particularly useful in con

ditions of extremely poor visibility, for which con

ventional M R F models and other segmentation tech

niques are inadequate.

These ideas have often been presented in the con

text of underwater robotics. However, it is possible

tha t E-MRF techniques could be applied to many

other da ta fusion tasks which involve discretely parti

tioned spaces, e.g. 2D and 3D medical imaging (com

bining information from different kinds of imaging

device) and computat ional ocean modelling (fusing

now-casts and previously generated forecasts).

R e f e r e n c e s

1. N. Pal and S. Pal. A review on image segmentation techniques. Pattern Recognition, 26(9): 1277-1294, 1993.

2. J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, series B, 36:192-236, 1974.

3. J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, series B, 48:259-302, 1974.

4. S. Geman and D. Geman. Stochastic relaxation: Gibbs distributions and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9:721-741, 1984.

5. W. Gibbs. Elementary Principles of Statistical Mechanics. Yale University Press, 1902.

6. A. Markov. Extension of the law of large numbers to dependent events (in russian). Bull. Soc. Phys. Math. Kazan, 2(15):155-156, 1906.

7. E. Ising. Zeitschrift Physik, 31, 1925. 8. J. Zhang, P. Fieguth, and D. Wang. Random Field

Models. Academic Press, 2000. 9. P. Bouthemy and P. Lalande. Determination of ap

parent mobile areas in an image sequence for underwater robot navigation. Proceedings of IAPR Workshop on Computer Vision: Special Hardware and Industrial Applications, pages 409-412, 1988.

10. P. Bouthemy and P. Lalande. Motion detection in an image sequence using gibbs distributions. Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1651-1654, 1989.

11. G. Jones, M. Hodgetts, R. Allsop, N. Sumpter, and M. Vicencio-Silva. A novel approach for surveillance using visual and thermal images. Proceedings of the DERA/IEE Workshop on Intelligent Sensor Processing, 2001.

12. G. Jones, R. Allsop, and J. Gilby. Bayesian analysis for fusion of data from disparate imaging systems for surveillance. Image and vision computing, 21:843-849, 2003.

13. A. Fairweather. Robust Interpretation of Underwater Image Sequences. PhD Thesis, University of London, 1997.

14. M. Hodgetts, A. Greig, and A. Fairweather. Underwater imaging using markov random fields with feed forward prediction. Journal of the Society for Underwater Technology, 23(4):157-167, 1999.

15. R. Dubes, A. Jain, S. Nadabar, and C. Chen. Mrf model-based algorithms for image segmentation. Proceedings IEEE 10th International Conference on Pattern Recognition, pages 808-814, 1990.

16. J. Kittler, J. Illingworth, and J. Foglein. Threshold selection based on a simple image statistic. Computer Vision, Graphics and Image Processing, 30:125-147, 1985.

17. J. Marroquin, S. Mitter, and T. Poggio. Probabilistic solution of ill-posed problems in computational vision. Journal of the American Statistical Association, 82:76-89, 1987.

18. P. Chou and C. Brown. The theory and practice of bayesian image labelling. International Journal of Computer Vision, 4:185-210, 1990.

19. T. Meier, K. Ngan, and G. Crebbin. A robust marko-vian segmentation based on highest confidence first. Proceedings of the international conference on image processing, 1:216-219, 1997.


20. Y. Boycov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. Proceedings of the International Conference on Computer Vision, 1:377-345, 1999.

21. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algo

rithm. Journal of the Royal Statistical Society, series B, 39:1-38, 1977.

22. R. Stolkin, A. Greig, and J. Gilby. A calibration system for measuring 3d ground truth for validation and error analysis of robot vision algorithms. Journal of Measurement Science and technology, 2006.

215

External Force Modeling of Snakes Using DWT for Texture Object Segmentation

Surya Prakash and Sukhendu Das

Visualization and Perception Lab, Department of Computer Science and Engineering,

IIT Madras Chennai-600 036, India.


Snakes, also known as active contours are extensively used in computer vision and image understanding applications. They are energy minimizing deformable contours that converges at the boundary of an object in an image. Deformation in contour is caused due to internal and external forces acting on it. Internal force is derived from the contour itself and external force is invoked from the image. Traditional active contours proposed by Kass et al. only work for the normal intensity images and fail to perform segmentation task in presence of texture. This limitation comes due to the limited ability of the external force present in the traditional snake, which uses directly image pixel's intensity information for its formulation. Here, we present a new external force modeling technique for snakes, which can work in presence of texture. It uses texture features for external force modeling, which are derived using discrete wavelet transform. To demonstrate our model, we use various synthetic and natural texture images.

Keywords: Snakes, active contours, texture, object segmentation, wavelet, DWT

1. Introduction

Snakes,1 also known as active contours are extensively used in computer vision and image understanding applications. They are energy minimizing deformable contours that converge at the boundary of an object in an image. Deformation in contour is caused because of internal and external forces acting on it. Internal force is derived from the contour itself and external force is invoked from the image. The internal and external forces are defined so that the snake will conform to object boundary or other desired features with in the image. Snakes are widely used in many applications such as segmentation,2 shape modeling,3 edge detection,1 motion tracking4 etc. Active contours can be classified as either parametric active contours1'5 or geometric active contours6'7 according to their representation and implementation. In this work, we focus on parametric active contours, which synthesize parametric curves within image domain and allow them to move towards the desired image features under the influence of internal and external forces. The internal force serves to impose a piecewise continuity and smoothness constraint whereas external force pushes the snake towards salient image features like edges, lines and subjective contours. External force in the traditional snake is defined as the negative of the image gradient. In the presence of such external force, snake is attracted towards large image gradients i.e.

towards the edges in the image. So if it is applied to the textured images, it will often get stuck on local texel (micro-units or cells of a texture) edges and converge at non-object boundary.

To overcome this effect, we here present a new class of external force for textured images, which we name as texture force. The snake in presence of texture force runs over the texture image surface and detects the object boundary of a texture surface, on a background texture. Texture force does not use directly the image pixel intensity values for its modeling. It considers the texture properties of the image.

2. Background

2.1. Parametric Snake Model

A traditional active contour is denned as a parametric curve v(s) = [x(s),y(s)}, s G [0,1], which minimizes following energy functional

i^(a|v' (s)|2+/3|v (s)\2)+Eext(v(s))ds (1)

where, a and f3 are weighting constants to control the relative importance of the elastic and bending ability of snake respectively, v (s) and v (s) are the first and second order derivatives of v(s) and Eext is derived from the image so that it takes on its smaller values at the feature of interest such as edges, object boundaries etc. For an image I(x,y), where (x,y) are spatial co-ordinates, typical external energy is



216 External Force Modeling of Snakes Using DWT for Texture Object Segmentation

defined as follows to lead snake towards step edges1

Eext = -\VI(x,y)\2 (2)

where, V is gradient operator. A snake that minimizes E must satisfy following Euler equation

av"(s)-0v""(s)-VEext = O (3)

Eq. 3 can also be viewed as force balance equation

Fint + Fext = 0 (4)

where, F i n t = av"(s)-/?v""(s) andF e x t = ~VEext. Fint, the internal force, is responsible for stretching and bending and F e x t , the external force, attracts snake towards the desired features in the image.

2.2. Discrete Wavelet Transform and Scalogram

The discrete wavelet transform (DWT) analyses a signal based on its content in different frequency ranges. Therefore, it is very useful in analyzing repetitive patterns such as texture. DWT decomposes a signal into different bands (approximation and detail) with different resolution in frequency and spatial extent. Let I(x) be the image signal and ipUtS(x) be a wavelet function at a particular scale, then signal filtered at point u is obtained by taking the inner product of the two < I(x),tpUtS(x) >. This inner product is called wavelet coefficient of I(x) at position u and scale s.8 Scalogram9 of a signal I(x) is the variance of this wavelet coefficient:

«;(«,«) = E { | < ! ( * ) , ^ u , . ( x ) > | 2 } (5)

The w(u, s) has been approximated by convolving the square modulus of the filtered outputs with a Gaussian envelop of a suitable width.9 The w(u, s) gives the energy accumulated in a band with frequency bandwidth and center frequency inversely proportional to scale.

mm rjr (a) i b ;

Fig. 1. (a) Synthetic texture image, (b) Magnified view of the 21 X 21 window of texture cropped at point P (marked in RED color), shown in Fig. 1(a).

3. Texture Feature Extract ion

In this section, we explain how the wavelet transform is used to extract texture features necessary for texture force estimation. It discusses the computational framework based on multi-channel processing. We use DWT-based dyadic decomposition of the signal to obtain texture properties. A simulated texture image shown in Fig. 1(a) is used to illustrate the computational framework with the results of intermediate processing.

Modeling of texture features at a point in an image involves two steps: scalogram estimation and texture feature estimation. To obtain texture features at a particular point (pixel) in an image, a n x n window is considered around the concerned point (see Fig. 1(b)). Intensities of the pixels in this window are arranged in the form of a vector of length n2

whose elements are taken column wise from the nxn cropped intensity matrix. This intensity vector (signal), which basically represents the textural pattern around the pixel, is subsequently used in the estimation of scalogram.

(b) (c)

Fig. 2. (a) 1-D texture profile of the texture window shown in Fig. 1(b), (b) Scalogram of the signal shown in Fig. 2(a), (c) Texture feature image for the image shown in Fig. 1(a).

Scalogram estimation: An input signal, obtained after arranging the pixels of n x n window as explained above, is used for the scalogram estimation. This signal is decomposed using wavelet filter. We use orthogonal Daubechies 2-channel (with dyadic decomposition) wavelet filter. Daubechies filter with level-L dyadic decomposition, yields wavelet

Surya Prakash and Sukhendu Das 217

coefficients {AL, DL, DL-J, .., Dt} where, Ai represents approximation coefficient and Di's are detail coefficients. The steps of processing to obtain scalo-gram from the wavelet coefficients are similar to that described in.10'11 Fig. 2(b) presents an example of scalogram obtained for signal shown in Fig. 2(a) using level-4 DWT decomposition.

Texture feature estimation: Once the scalogram of the texture profile at a particular point is obtained, a post-processing step is carried out to eliminate non-significant bands of the scalogram and only significant bands are used for the further processing. This is done since only significant bands contain major texture features information. Removal of non-significant bands helps in removing the redundant information and making the computation fast. Let wavelet decomposition is done up to level-L and it gives following wavelet bands B = {AL,DL,DL-i,..,Di}. Let Bi is a ith wavelet band in set B. We use following algorithm to determine significant and non-significant bands.

for each Bi do if variance for band Bi < threshold

Bi is non-significant band else

Bi is significant band end-if

end-for

where threshold is decided empirically. Variance of band Bi is defined as var(I?j) = E[(JBJ—//J)2] where Hi is the mean of all wavelet coefficients belonging to band Bi. Once the estimation of scalogram and its significant bands is over, significant bands are used for texture feature estimation. Texture features are estimated from the "energy measure" of the wavelet coefficients of the significant bands. This texture feature is similar to the "texture energy measure" first proposed by Laws.12

Let for pixel k, Dk is the set of all significant bands and w is a wavelet coefficient in a band. Then the energy measure of pixel k is calculated to be the averaged Zi-norm as

** = ;;{ £!>} (6) VxeDkwex )

where, n is the sum of cardinalities of all the members of Dk. These energy measures for all pixels in an image constitute "texture feature image" which is further used in texture force modeling. Fig. 2(c) shows a texture feature image for the texture image shown in Fig. 1(a). Pixels belonging to the same texture region exhibit same energy level.

4. Modeling of Texture Force

This analysis is based on the gradient present in the texture feature image. Let for a given texture image I(x, y), F(x, y) be the texture feature image obtained as explained in previous section. The external energy of the snake based on the gradient present in the texture feature image can be defined as follows (similar to Eq. 2)

Et?t=-\VF(x,y)\2 (7)

As done in Eq. 4, texture force (external force), which causes the change in this energy (i.e. El^t), can be defined as follows

F% = -VElft (8)

To find the object boundary, active contour deforms so it can be represented as the time varying curve v(s,t) = [x(s, t),y(s,t)] where s € [0,1] is arc-length and t € R+ is time. Dynamics of the contour in presence of texture force can be governed by following equation

7 V t = F i n t + F * S (9)

where, Vt is the partial derivative of v w.r.t. to t, —yvt is the damping force and 7 being an arbitrary non-negative constant. Fint and F ^ are internal and texture forces respectively. The contour comes to rest when the net effect of the damping, internal, and texture force reaches to zero, which eventually happens when deforming contour reaches the texture object boundary. The texture force developed here pushes active contours towards texture object boundary.


To get the boundary of a particular object using active contours in presence of texture, a contour is initialized near the desired object. Contour is then allowed to deform towards the object boundary until it

218 External Force Modeling of Snakes Using DWT for Texture Object Segmentation

latches around the object. In case of textured images, object boundary is identified as the point where texture property changes i.e. where two texture regions meet. In texture surface, snake in presence of texture force stops moving as it gets different texture region. For a snake to stop at the texture boundary! net effect of the dampingf internal and external forces should be zero for all snake points at the object boundary. To demonstrate the performance of the snake in the presence of texture force, various kinds of synthetic and natural textures are used. We have used Daubechies 8»tap 2-channel liter for DWT decomposition.

The first example is of a texture image (Fig. 3) composed of two textures taken from a widely used Brodatz photographic album.13 Contour is initialized around the central texture and is allowed to shrink in presence of texture force. Snake took 14 iterations to converge at the central texture boundary. Texture features at snake points are estimated by taking a 15 x 15 window at each point. DWT decomposition was done up to level-4. Texture feature image for this test image in shown in Fig. 4(a). The resulting segmentation is shown in Fig. 4(b), where the identified texture object boundary is shown in dark black color.

Fig. 3. Texture image compose*i of two Brodatz textures.

(a) (b)

Fig. 4. (a) Texture feature image for the texture image shown in Fig. 3, (b) Final Segmentation result of Fig. 3. Dark black contour shows the estimated boundary of the central texture.

We present another segmentation result for an image (Fig. 5) composed of two Brodatz textures. Contour convergence to the central texture boundary, in presence of texture force, took 11 iterations. Texture features at contour points are estimated by

taking a 13 x 13 window. DWT decomposition was done up to level-4. Texture feature image for this test image is shown in Fig. 6(a). The resulting segmentation is shown in Fig. 6(b), where the identiled texture object boundary is shown in dark black color.

Fig. 5. Texture image composed of two Brodatz textures.

(a) (1»)

Fig. 6. (a) Texture feature image for the texture image shown in Fig. 5S (b) Final Segmentation result of Pig. 5. Dark black contour shows the estimated boundary of the central texture.

Fig. 7. Natural real life test image of zebra.

Fig. 8. (a) Texture feature image of zebra (shown in Fig. 7), (b) Final Segmentation result for zebra. Dark black contour shows the estimated boundary of zebra

Fig. 7 shows a natural real life test image of zebra. Contour convergence to the boundary of zebra,

Surya Prakash and Sukhendu Das 219

in presence of texture force, took 15 iterations. Texture features at contour points are estimated by taking a 11 x 11 window. DWT decomposition was done up to level-4. Texture feature image of zebra is shown in Fig. 8(a). The resulting segmentation is shown in Fig. 8(b), where the identified boundary of the zebra is shown in dark black color.

(a) (b)

Fig. 9. Comparison of segmentation results of zebra image, (a) Segmentation result obtained using proposed technique, (b) Segmentation result obtained in.14 Dark black contour shows the estimated boundary of zebra in both the cases.

6. C o n c l u s i o n

In this paper, we have introduced a new external force for snakes, which we call as texture force. Snake, in presence of texture force, can be used for the texture object segmentation. To model texture force, first texture features are estimated using wavelet decomposition which are further used in texture force modeling. Texture force is subsequently used in parametric snakes for texture object segmentation. Main novelty of this study is in the representation of the texture features and the modeling of texture force based on them. We validate our model with a few synthetic and natural texture images. Results obtained using proposed technique are quite satisfactory. In Fig. 9, we compare our segmentation result of zebra with the segmentation result obtained by Sagiv et al. in14 for the same. The result obtained by proposed technique (Fig. 9(a)) is comparable with

the result obtained in14 (Fig. 9(b)). Since proposed segmentation technique uses parametric snake, it is computationally less expensive compare to the technique presented in14 which uses geodesic active contour.

References

1. M. Kass, A. Witkin and D. Terzopoulos, International Journal of Computer Vision 1, 321(January 1988).

2. F. Leymarie and M. D. Levine, IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 617 (1993).

3. D. Terzopoulos and K. Fleischer, The Visual Computer 4, 306 (1988).

4. D. Terzopoulos and R. Szeliski, Tracking with Kalman snakes, in Active Vision, eds. A. Blake and A. L. Yuille (MIT Press, Cambridge, MA, 1992) pp. 3-20.

5. L. D. Cohen, CVGIP: Image Understanding 53, 211 (1991).

6. V. Caselles, F. Catte and T. Coll, Numerische Math-ematik 66, 1 (1993).

7. V. Caselles, R. Kimmel and G. Saprio, International Journal of Computer Vision 22, 61 (1997).

8. S. Mallat, A Wavelet Tour of Signal Processing (Academic Press, 1999).

9. M. Clerc and S. Mallat, IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 536(April 2002).

10. T. Greiner and S. Das, Automatisierungstechnik (2005).

11. S. G. Rao, S. Das, T. Greiner and M. Kalra, Accepted in IEE VIE Conference, Bangalore, September 26-28 (2006).

12. K. Laws, Textured image segmentation, PhD thesis, (Dept. of Electrical Engineering, University of Southern Californiai, 1980).

13. P. Brodatz, Textures: A photographic album for artists and designers (1966).

14. C. Sagiv, N. Sochen and Y. Zeevi, IEEE Transactions on Image Processing 1, 1 (February 2004).

I-FISH: Increasing Detection Efficiency for Fluorescent Dot Counting in Cell Nuclei

Shishir Shah*

Quantitative Imaging Laboratory University of Houston

Department of Computer Science Houston, TX, U.S.A.

* E-mail: [email protected]

Fatima Merchant

Advanced Digital Imaging Research, LLC. 2450 South Shore Blvd., Suite 305

League City, TX U.S.A. E-mail: [email protected]

This paper presents a methodology and results for increased detection efficiency (IDE) in the analysis of Fluorescent in situ Hybridization (FISH) images by recovering the radiance curve of the imaging device (CCD camera) used to capture the fluorescence signals and remapping the nonlinear luminance to generate a high contrast high dynamic range image. The remapping is based on a criterion that allows for maximum detectability of the signals. This leads to images where saturated values are attenuated while the weak signals are amplified. An automated dot counting algorithm is used to process the remapped images and evaluate the dot detection efficiency. Results of dot counting on a set of 2000 images are presented and compared to those obtained without the use of the proposed methodology.

Keywords: Image Enhancement; Nonlinear Radiance Mapping; Fluorescence in situ Hybridization

1. Introduction

Chromosomal aberrations are a variation from the normal, either in structure or number of chromosomes, which result from an exchange of genetic material between two or more chromosomes or from a rearrangement of genetic sequences contained in a single chromosome. The analysis of such aberrations can be useful both in a clinical and in a toxicological context. Different staining techniques allow analysis of different kinds of abnormalities. A particularly useful cytogenetic technique for the analysis of aberrations is Fluorescence in situ Hybridization (FISH).1

FISH relies on the use of fluorescent dyes that are attached to the DNA probes and when excited by one wavelength of light, emit light at a second, longer, wavelength. In typical FISH study, three fluorescent dyes are used, one to stain the cell nuclei, and two to detect DNA sequences. FISH signals in interphase cells become visible as colored dots. The analysis of a preparation consists of detecting the dots, after which the number of dots per cell can be counted or their relative positions can be measured. With chromosome enumeration, dots are counted for a large number of cells to determine the distribution of chromosomes per cell and to be able to detect small aberrant sub-populations. In an clinical setting, the num

ber of cells to be analyzed depends on the frequency of aberrant cells and the count frequency. In practical situations, this can vary from a few to more than 10,000 cells.2"4

FISH has been applied to preimplantation genetic diagnosis of X-linked diseases,5-7 common ane-uploidies using either human blastomeres or oocyte bodies8 and, more recently, translocations.9 Although FISH is widely used for genetic analyses, its reliability and accuracy depends on the types of probes, cells and their fixation. The major disadvantage associated with the use of fluorescent dyes is the weak signal emanating from a cell or its labeled constituents. In typical studies, a background pixel consists of 5,000 photons, a DAPI-stained nucleus pixel is an additional 3,000 photons, and a labeled chromosome pixel can be an additional 10,000 photons. These numbers are at least a factor of 100 smaller than the values that are obtained for images acquired through a conventional CCD camera at room-level illumination.10,11 Several investigators have evaluated the hybridization efficiencies of interphase FISH using probes specific for chromosomes 13, 18, 21, X and Y. The results reported vary from 60% to 98% for XY detection and 42% to 88% for autosomal aneuploidy detection depending on the type



Shishir Shah and Fatima Merchant 221

of cells and the choice of probes.10 '12-16 These results can be acceptable for rapid detection of chromosome aneuploidies, however, when the cells sought represent a very low proportion of the total cell population, a very high efficiency is desired while maximizing specificity and sensitivity. Proposed approaches target either the optimization of FISH protocols including alternate selection of fluorophores or identification of different cell types for investigation.8,17

In this paper, we present an approach to improve the efficiency of detecting fluorescent signals in FISH images by recovering the radiance map of the CCD camera. In general, there exists a nonlinear relationship between the true radiance of the fluorescent probe and the detected pixel value. Although the charge collected by a CCD element is proportional to its irradiance, most cameras apply a nonlinear mapping to the CCD outputs before they are written to the storage medium. The most significant nonlinearity in the response curve of any camera is at its saturation point, where any pixel with a radiance above a certain level is mapped to the same maximum image value. Further, due to the limited dynamic range of the camera, two signals having radiance values that exceed the mapping limits of the camera cannot be sensed simultaneously. This leads to saturation on one extreme or extremely weak or missing signals on the other. We use a simple technique to recover the response curve by using a set of images taken with varying but known exposure durations. This allows us to generate a high dynamic range (HDR) image that covers the full extent of the radiance captured by all the images. We further analyze the distribution of each of the fluorescent signals and remap the HDR image values to generate a high contrast image with compressed dynamic range. The remapping is based on a criterion that allows for maximum detectability of the signals. This leads to images where saturated values are attenuated while the weak signals are amplified. By testing our approach on a large dataset of FISH images, we show that dot counting specificity and sensitivity can be improved.

Rest of this paper is organized as follows: Section 2 describes the imaging setup and the method for recovering the CCD response curve to generate a high dynamic range image. Image contrast enhancement and remapping of the computed response curve is presented in section 3. Results of the developed

methodology are presented in section 4 along with comparison of results obtained on the same dataset processed without correcting for the camera response curve. Finally, conclusions and a summary of this study are presented in section 5.

2. High Dynamic Range Imaging

Computing and acquiring images with high dynamic range can be done in various ways. In this paper, we utilize the method proposed by Debevec & Malik.18

Interphase FISH slides labeled with DAPI, FITC, and Texas Red were used for analysis. A series of images at exposures varying from 0.03 seconds to 1.4 seconds in intervals of 0.04 seconds were collected. These images formed the input for recovering the camera response curve. If Ei is the irradiance value at pixel i, Atj is the set of known exposures, Z^ is the image pixel value at spatial index i at exposure j , the camera response function / can be computed as:

Zij = /(EtAtj) (1)

Assuming / to be monotonic, the irradiance Ei and the inverse function of / can be solved in the least-square error sense by rewriting equation 1 as:

g(Zij)=\nEi + \a&tj (2)

where g = I n / - 1 . To better enhance the weak and the saturated intensities, a weighted quadratic function is used that allows the function g and the irradi-ances Ei to be recovered that best satisfy the set of equations arising from 2. Recovering g only requires recovering the finite number of values that g{Z) can take since the domain of Z, pixel brightness values, is finite. Letting Zmin and Zmax be the least and greatest pixel values, N be the number of pixel locations and P be the number of images, the problem is formulated as one of finding the (Zmax — Zmin + 1) values oig(Z) and the N values of In Ei. This is done by solving the following quadratic objective function:

JV P

° = 5ZZ)N2i i ) (» (^ - ln£i - in Atj)]3 (3) »=i j = i

+A £ MZ)g"(Z)}2

Zmin+l

where w defines the weight factor that moderates weak and saturated intensities and A is a regular-ization parameter that moderates the smoothness

222 I-FISH: Increasing Detection Efficiency for Fluorescent Dot Counting in Cell Nuclei

term given by the sum of squared values of the second derivative of g to ensure that the function g is smooth.

Once the response curve g is recovered, it can be used to quickly convert pixel values to relative radiance values, assuming the exposure Atj is known. We compute the curve based on one set of images and use the same to recover the radiance values for every other image taken while imaging all the cells. The radiance curve is independently computed for each of the three image channels used to capture the fluorescence signal based on DAPI (blue), FITC (green), and Texas Red (red) labeling of the nuclei. A representative set of images taken to recover the response curve are shown in figure 1 and the corresponding response curve shown in figure 2.

Fig. 1. Images acquired to recover the camera response curve at 0.1, 0.5, 1.0, and 1.4 seconds.

Normalized Response Curve (g(Z))

1 • « • Green Channel • Red Channel

] • Blue Channel

:f

- / ,:«?'

^ * r * ' < ^

0 50 100 150 200 250 300 pixel value (Z)

Fig. 2. Recovered response curve for the CCD for each of the three color channels.

3. Contrast Enhancement and Image Remapping

Processing an image globally based on its recovered irradiance values leads to computational complex

ities. It also generally results in loss of contrast, thereby resulting in the loss of image details. We present an irradiance profile mapping algorithm that combines local filtering and global processing based on center-surround Retinex filters to overcome dot detection problems commonly encountered such as signal blooming due to the proximity of two fluorescence dots of very different intensity. For example, if a dim dot is close to a bright dot, the bright pixels can influence the processing of the dim area and can cause a black region around the bright area. On the other hand, local filtering alone tends to make pure black and pure white low contrast areas turn gray, which can lead to problems of separatibility of two fluorescent dots.

In our approach, each of the image channels are processed independently. Luminance of each channel is treated by the Retinex-based adaptive filter method. The luminance tp is assumed to be linear with respect to scene radiance and is first globally compressed. Then, the Retinex-based adaptive filter method is applied in the log-domain to the globally corrected luminance ip' to result in the treated luminance ipnew In general, the method consists of two parts: a preliminary global irradiance mapping followed by Retinex-based local processing. The global irradiance mapping that is applied to the linear image performs a first compression of the dynamic range. The global irradiance mapping function used is a power function. The curvature of the function that determines the adaptation state depends on the mean luminance of the total field of view.19 Consequently, we compute the exponent of the power function from the average luminance of the image. Let ip be the luminance image, whose maximum value is 1. The non-linear luminance I/J' is given by

if = ip^ (4)

where the value of - is an affine function of the av-7

erage luminance of the image, tpai defined as:

where N is the number of pixels in the luminance image i\) and p is the pixel-value in ip. The coefficient of the affine function were defined experimentally based on assigning - = 1 since a high or average key image


is not globally compressed. As the average luminance decreases, the exponent decreases, thereby increasing the sensitivity for dark areas.

After global processing, local adaptation is performed using a surround-based Retinex method. Traditionally, surround-based Retinex methods20-22

compute a new value for each pixel by taking the difference between the log-encoded treated pixel and the log-encoded value of a mask. The mask represents a weighted average of the treated pixel's surrounding area. A drawback of the surround-based methods is that small filters tend to make pure black or pure white low contrast areas turn gray. This is due to local normalization. To overcome this problem, we use a weighted mask function m(x,y) that is sensitive to the conservation of local details. Thus, each pixel in the new luminance image is computed as:

^new{x,y) = log(?//(:r,y))-p(x,2/)-log(ra(x,y)) (6)

The (3 factor is based on a sigmoid function and maps white to white and black to black. For a pixel of high intensity, the mask is weighted by a value close to 0. Since the mask is subtracted from the log-encoded luminance, it effectively keeps the pixel bright. Similarly, a pixel of low intensity is weighted by a value close to 1, which has the effect of maintaining black. This function lets the middle gray values change without constraint while restricting the black to remain black and the white to remain white, thereby allowing maximum spread of the weak and very bright intensities. The function can be written as:

MX' ») = l ~ 1 + e-a.(^(x,v)-0.5) (?)

where a modulates the slope of the sigmoid. To enhance the separability and detectability of

fluorescent dots in the image, we further adapt the filter to follow image edges. In this way, a bright area has less influence on the treatment of a neighboring dim area. The mask is thus computed specifically for each pixel where m{x,y) is given by:

360 r m a x _ r2

Y^ Yl ^'(x + rcos(6),y + rsm(0)) -e °o7- (8) 0=0 r=0

where 6 is the angle in the radial direction, r is the distance to the central pixel and GQ^ is set to be &high or &iow based on the presence or absence of a high contrast edge, respectively. The numerical values of (Thigh and G\OW are chosen to be fractions of

the image size. In our implementation, we use the Canny edge detector to detect high contrast edges.23

The thresholds for strong and weak edges are fixed values chosen experimentally and kept the same for all images.

4. Results

The developed approach has been tested on a large dataset of images where DAPI was used as the nucleus counterstain, FITC, and Texas Red used to mark the desired chromosomes. In collecting the dataset, each image was captured based on a preset exposure and the HDR image generated from the pre-computed radiance profile. Figure 3 shows three images acquired based on low, medium, and high exposure settings. The corresponding remapped HDR image is shown in figure 4. All images are of lymphocytes from cultured blood from normal specimens.

(a) (b) (c)

Fig. 3. Image acquired at three exposure settings (a) low; (b) mid; and (c) high

Fig. 4. Image that has increased dynamic range based on the irradiance profile of the camera.

To analyze the dataset for dot counting, an algorithm similar to one proposed by Netten et al.10

was implemented. A small set of 50 images was used to train the dot detection algorithms and 2,000 independent images used to evaluate the performance. The dot count in all 2,000 images was performed in

224 I-FISH: Increasing Detection Efficiency for Fluorescent Dot Counting in Cell Nuclei

Fig. 5. Example of dot counting efficiency based on low dynamic range image (a & b) vs. high dynamic range image (c & d ) .

are not detected or segmented and missed in the dot count, and dust or debris is identified and counted as false dots. Table 2 gives the percentage error for each type, as obtained from 3,000 nuclei that were detected in both of the automated methods. Figures 5 and 6 present two examples of the result of automated image analysis and dot detection based on the low dynamic range image and the remapped HDR image as input.

Table 1. Summarization of dot detection efficiency across manual, automated analysis of low dynamic range images, and high dynamic range images.

# of Images # of Detected Cells

Dot 1 Detected Dot 2 Detected

Truth 2000 6,482 9,076 7,778

FISH 2000

4,731(73%) 6,534(72%) 5,211(67%)

FISH-IDE 2000

6,125(94.5%) 8,440(93%) 7,544(97%)

Fig. 6. Example of dot counting efficiency based on low dynamic range image (a & b) vs. high dynamic range image (c & d ) .

three ways: manually, applying the automated algorithm to low dynamic range image, and applying the automated algorithm to the remapped high dynamic range image. Manual results are used as "ground truth". Result of the dot count is shown in table 1. To gain a better understanding of the performance increase due to the proposed approach, several types of errors that occur are also evaluated. Dot counts can be erroneous when one of the following errors is made: a single dot is split and counted as two or more, two or more overlapping dots are not segmented properly and get counted as one dot, dots

Table 2. Summarization of errors seen in dot detection using automated system to analyze low dynamic range image and the remapped high dynamic range image.

# of Nuclei Split Dots

Overlapping Dots Missed Dots False Dots

Automated FISH 3000 0.8% 3.2% 0.9% 3.4%

Automated FISH-IDE 3000 0.2% 0.1% 0.1% 0.4%

5. Conclusion

In this paper we have presented a methodology to increase the efficiency of detecting fluorescent signals in FISH images by recovering the radiance map of the CCD camera. We further analyzed the distribution of each of the fluorescent signals and remapped the HDR image values to generate a high contrast image with compressed dynamic range. The remapping was based on a criterion that allowed for maximum detectability of the signals. This resulted in images where saturated values were attenuated while the weak signals were amplified. We implemented a simple dot counting image analysis system and tested the detection efficiency based on both the original images and the remapped HDR images. Results obtained on a dataset of 2000 images clearly show that dot counting specificity and sensitivity can be improved by recovering the radiance curve of the CCD


and generating a high contrast remapped HDR im

age before using any algorithm for dot counting. The

increased efficiency of detecting dots in FISH image

can have critical implications in a range of clinical

and research applications from preimplantation ge

netic diagnosis to translocation analysis.

R e f e r e n c e s

1. D. Pinkel, T. Straume and J. Gray, Proc. National Academy of Science 83, 2934 (1986).

2. A. C. Carothers, Cytometry 16, 298 (1994). 3. K. R. Castleman and B. White, Bioimaging 3, 88

(1995). 4. R. E. Kibbelaar, F. Kok, E. J. Dreef, J. K. Kleiverda,

C. J. Cornelisse and A. K. Raap, Cytometry 14, 716 (1993).

5. D. K. Griffin, L. J. Wilton, A. H. Handyside, R. M. Winston and J. D. Delhanty, Human Genetics 89, 18( April 1992).

6. S. Munne, J. Grifo, J. Cohen and H. U. Weier, American Journal of Human Genetics 55, 150(July 1994).

7. J. C. Harper, E. Coonen, F. C. S. Ramaekers, D. A. J. Delhanty, A. H. Handyside, R. M. L. Winston and A. H. N. Hopman, Human Reproduction 9, 721 (1994),

8. S. Munne, C. Mdrquez, C. Magli, P. Morton and L. Morrison, Molecular Human Reproduction 4, 863 (1998).

9. C. M. Conn, J. C. Harper, R. M. Winston and J. D. Delhanty, Human Genetics 102, 117(January 1998).

10. H. Netten, I. T. Young, L. J. van Vliet, H. J. Tanke, H. Vroljik and W. C. Sloos, Cytometry 28, p. 110 (1997).

11. J. C. M. et al., Methods for ccd camera characterization, in SPIE Proceedings: Image Acquisition and Scientific Imaging Systems, (San Jose, CA, 1994).

12. P. N. Rao, R. Hayworth, K. Cox, F. Grass and M. J. Pettenati, Prenatal Diagnostics 13, 233 (1993).

13. K. Klinger, G. Landes and D. S. D. et al., Am J Human Genetics 51 , 55 (1992).

14. K. W. Klinger, Ann NY Acad Sci 731, 48 (1994). 15. M. T. Little, S. Langlois, R. D. Wilson and P. M.

Lansdorp, Blood 89, 2347 (1997). 16. D. van Opstal, J. O. van Hemel and B. H. E. et al.,

Prenatal Diagnostics 15, 705 (1995). 17. J. Yan, E. Guilbault, J. Masse, M. Bronsard,

P. DeGrandpre, J.-C. Forest and R. Drouin, Clinical Genetics 58, 309 (2000).

18. P. Debevec and J. Malik, Recovering high dynamic range radiance maps from photographs, in Proceedings of the 24th annual conference on Computer graphics and interactive techniques., (ACM Press/Addison-Wesley Publishing Co., 1997).

19. D. Alleysson and S. Siisstrunk, On adaptive non-linearity for color discrimination and chromatic adaptation, in IS&T First European Conference on Color in Graphics, Image, and Vision, (Poitiers, France, 2002).

20. L. Meylan and S. Siisstrunk, Bio-inspired image enhancement for natural color images, in IS&T/SPIE Electronic Imaging 2004: The Human Vision and Electronic Imaging IX, (San Jose, CA, 2004).

21. E. H. Land, National Academy of Sciences of the United States of America 83, 3078 (1986).

22. Z. Rahman, D. D. Jobson and G. A. Woodell, Journal of Electronic Imaging 13, 100(January 2004).

23. J. Canny, IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 679(November 1986).

226

Intuitionistic Fuzzy C Means Clustering in Medical Image Segmentation

T. Chaira


A. K. Ray

Department of ECE IIT Kharagpur

Midnapore, WB-721302, INDIA E-mail: [email protected]

O. Salvetti

ISTI-CNR, Via G. Moruzzi 56124 Pisajtaly

E-mail: Ovidio. Salvetti@isti. cnr. it

This paper presents a novel intuitionistic fuzzy C means clustering method using intuitionistic fuzzy set theory. This clustering method has also been used for identification of mammogram cysts. In this method, the hesitation degree has been taken into account along with the membership degree, where the cluster centers may converge to a desirable location than the cluster centers obtained by fuzzy C means algorithm. Experimental results on mammogram images show the effectiveness of the proposed method in contrast to existing fuzzy C means algorithms.

Keywords: intuitionistic fuzzy set, fuzzy clustering, image segmentation, hesitation degree.

1. Introduction

Today most of the biomedical images arising from diverse imaging modalities such as, x-ray, magnetic resonance imaging (MRI), computed tomography (CT) and so on are interpreted visually and qualitatively by radiologists. A major area of research in automated diagnosis of such images involves the detection of the region of interest that includes segmentation of the objects embedded in the uncertainty. The uncertainty may include noise, vagueness in class definitions, imprecise gray levels. The membership degree for a given sample image depends on the distance between the image patterns to a cluster center and denotes the degree to which the pattern belongs to a particular cluster. However, since membership degrees are imprecise and it varies on person's choice, there is some kind of hesitation degree (which arises from the lack of precise knowledge).

Attanassov in 1986 introduced the concept of intuitionistic fuzzy set where the non-membership degree is not equal to 1- membership as in ordinary fuzzy set. This is due to some hesitation present in assigning the membership degree. The first fuzzy method to segment the regions of an image by clustering was introduced by Bezdek.1 The algorithm requires a priori definition of number of

classes and its result depends on the number. Relating to the past work, there has been many methods using fuzzy and crisp theory. Ferahta et al.2 had given the idea of optimal classes where the number of classed are not predefined. They used a second criterion function that relates to Shannon's entropy. Rhee and Hwang3 introduced a idea of using type 2 fuzzy set in constructing the clustering algorithm. According to them, the cluster centers using type 2 fuzzy algorithm converge more to desirable locations than in type 1 fuzzy algorithm. About the intuitionistic fuzzy set theory in image processing, as it is just beginning to develop, there has been no work relating to clustering in the literature. The objective of this paper is to develop a novel method, an intuitionistic fuzzy c means algorithm, to segment medical images. In the construction of our algorithm, the following steps required are as follows:

1 An intuitionistic fuzzy set has been created using Sugeno's intuitionistic fuzzy generator.

2 modifying the objective criterion. 3 updating the cluster center using intuitionistic

fuzzy set.

The paper is organized as follows: In section 2 the Preliminaries about the intuitionistic fuzzy set is presented while Section 3 tells about the



T. Chaira, A. K. Ray and O. Salvetti 227

construction of intuitionistic fuzzy set using Sugeno's generator. Section 4 provides the intuitionistic fuzzy c means algorithm and Section 5 outlines the introduction of a new criterion functioin. Section 6 presents the results and discussion and finally the conclusion has been drawn in section 7.

2. Preliminaries

Attanassov's4 intuitionistic fuzzy set emerges from the simultaneous consideration of membership values /x and non-membership values u of the elements of a set. An intuitionistic fuzzy set in X is given by

A = {X,HA(X),I>A(X)\,X € X)}

where HA '• X —• [0,1], VA '• X —> [0,1] with the condition 0 < HA(X) + VA(X) < 1 where HA{X) and VA{X) are the membership and non-membership degrees of an element x to the set A. When VA(X) = 1 — HA{X) for every x in set A, then the set A becomes a fuzzy set. For all intuitionistic fuzzy sets, Attanassov also indicated an intuitionistic degree, TTA(X), which arises due to lack of knowledge about the membership degree, of each element x in A and is given by: TVA(X) = 1 - HA{X) - VA{X).

Obviously, 0 < XA{X) < 1.

Due to the hesitation degree, the membership values lie in the interval [fiA(Xi), fiA(Xi) + 1VA(Xi)] .

3. Construction of intuitionistic fuzzy set

In order to construct Attanassov's intuitionistic fuzzy set (IFS) from fuzzy set, intuitionistic fuzzy generators are used. From the definition of intuitionistic fuzzy generators by Bustince et al.5 :

A continuous, decreasing, and increasing function 4> : [0,1] will be called continuous, decreasing, and increasing intuitionistic fuzzy generator if <j)(x) <{l-x) for all are [0,1].

4>(0) < 1 and 0(1) = 0.

Bustince et al. obtained intuitionistic fuzzy generator from Sugeno's generating function. Sugeno's generating function is given as:

fx(x) = \ln{l + X(x)) with limx-ôfx(x) = x

For A > 0, the Sugeno's intuitionistic fuzzy generator

Mx) = T^t with A ^ °

Then0 A ( l ) = O,0A(O) = l This Sugeno's intuitionistic fuzzy generator has

been used to create an intuitionistic fuzzy set.

To construct an intuitionistic fuzzy set from a fuzzy set, intuitionistic fuzzy generator has been used here. For determining the non-membership values, Sugeno's intuitionistic fuzzy generator </> has been used in addition to the membership function /z(x) of the elements a; of a fuzzy set. In this condition with the help of Sugeno's intuitionistic fuzzy generator, the IFS becomes:

A[FS = { i l W ( , ) , / " ^ f f j s 6 X} (1) 1 + \.HA(X)

with A > 0

The hesitation degree from (1):

nA{x) = l-nA(x)- / " ^ ^ (2) 1 + \.HA{X)

For every IFS, 0 < (J,A(X) + UA{X) < 1 holds. Since the denominator 1 -I- \.HA{X), is greater

than 1, so 1 + ^ ' ( ^ is less than 1 — ^,4(^)1 for all xeX.

So, the verification that AIFS belongs to IFS is as follows:

HA(X) + VA(X) =

M*) + i+x.^t) ^ ^ ( ^ + l ~ »AW < 1 This Sugeno's intuitionistic fuzzy generator has

been used to create an intuitionistic fuzzy image for

image clustering.

4. Intuitionistic Fuzzy C-means Clustering

Conventional fuzzy C means algorithm cluster the feature vectors by searching for local minima of the

228 Intuitionistic Fuzzy C Means Clustering in Medical Image Segmentation

following objective function:

Jm(U,v;X) = ^=iEnk=lU^Dik

where Dik is some similarity measure between Vi (cluster center) and Xk (points in the pattern) or the attributer vectors and the cluster centers of each region, c is the number of clusters, n is the number of data points. Minimization of Jm is based on suitable selection of U and V using an iterative process through the following equation:

Uik = * _1_,

v i k = ^ lp Vt,fc. 2-,k=i uik

In order to incorporate intuitionistic property in conventional fuzzy clustering algorithm, the cluster centers are to be updated. So initially the hesitation degree has been calculated using Eq. (2). Due to the hesitation degree, the membership value has been updated as: uik = Uik + 7rifc>

where Uik denotes the conventional fuzzy membership.

Now using the intuitionistic membership matrix, the cluster centers, as in the conventional fuzzy clustering, may be given as:

v.{ _ ELi "'<*"** (3)

Using Eq. (3) the cluster center has been updated and also the membership matrix is updated. At every iteration the cluster center and the membership matrix are updated. The algorithm stops when the updated membership matrix and the previous matrix i.e.

maxi,k\U*rw-U;rV\<* e is a user defined value and has been selected as e = 0.03 .

5. Introduction of new criterion function

In the proposed method, the criterian function contains two terms-the first term is the general term as in conventional clustering algorithm which is an intra class distance and the second term aims in maximizing the good points in the class. The goal is to minimize the entropy of the histogram of an image.

The second function is:

So the final criterion function that has been minimized becomes:

' = £,£,*&<& +jrt,**1* i = l fc=l i = l

Pi is the probability of the ith class as:

Pi = IiT,k=iuik, ke [1,N].


Tests have been performed on several medical mam-mographic images. The images and the image descriptions have been obtained from http://www. imagin is . com/breasthealth/ultrasoundJ.mages2. asp. The value of A = 2 in Eq. (2) has been chosen in experimentation. Some of the results are shown below. The features of each image consists of are energy, standard deviation and entropy. The proposed method has been compared with conventional fuzzy c means and Type 2 fuzzy clustering algorithms. The image in Fig. (1) shows a breast cyst with tiny accumulations of fluid that are the most common cause of benign (non-cancerous) breast lumps in women. Complex cysts can be filled with debris and may sometimes require aspiration to confirm that they are indeed benign cysts. Both single and multiple cysts are very common. In the figure, two adjacent breast masses: one a debris filled cyst, the other a simple cyst is shown. From the results it is observed that with intuitionistic fuzzy clustering algorithm, the debris in the cysts are clearly detected than with the conventional fuzzy c means clustering where more solid particles areas are seen. Type 2 fuzzy clustering is somewhat similar to the proposed method but the solid particles are not shown clearly and also the outer boundary is not continuous.

Fig. (2) shows a breast mass, which contains a group of breast cells that are clustered together more densely than the surrounding breast tissue. Masses can be palpable (able to be felt) or nonpalpable (unable to be felt) and may be benign or malignant. The image contains large cyst with layered debris and a solid component. From the result of the proposed method using intuitionistic fuzzy C means clustering algorithm, the solid components or the debris inside

http://www

T. Chaira, A. K. Ray and O. Salvetti 229

•**«• ,\v|ii»

(a) (b) (c)

Fig. 1. (i) image, (ii)Conventional Fuzzy Cluster, iii) type 2 cluster, (iv) intuitionistic fuzzy cluster.

(a) (b) (c) (d)

Fig. 2. (i) image, (ii) Conventional Fuzzy Cluster, iii) type 2 cluster, (iv) intuitionistic fuzzy cluster.

M

J ** **feJ t • • *

(c)

Fig. 3. (i) image, (ii) Conventional Fuzzy Cluster, iii) type 2 cluster, (iv) intuitionistic fuzzy cluster.

the cyst are properly detected whereas with conventional fuzzy c means clustering and type 2 fuzzy clustering, the solid particles in the cyst are shown with more prominence.

The image in Fig. (3) shows a solid irregular breast mass with calcification (calcium deposits). Ultrasound does not reliably show image calcifications properly in all the cases. From the result it is observed that the result with Intuitionistic fuzzy C means, the calcification area is clearly detected

whereas with conventional fuzzy c means clustering, the solid particles in the cyst are shown in more areas. In type 2 fuzzy cluster, the calcification areas are somewhat better than conventional fuzzy c means but not properly detected specially at the upper right and the lower right portion of the image. The area of micro-calcification is well delineated using the proposed approach based on intuitionistic fuzzy clustering. The spreading of calcification is also properly visible with more prominence in the proposed method.

230 Intuitionistic Fuzzy C Means Clustering in Medical Image Segmentation


This paper provides a new approach to fuzzy clus

tering using intuitionistic fuzzy set theory. The

method takes into account the hesitation in assign

ing the membership degree. The algorithm has been

tested on mammogram images for detection of micro-

calcification and the results have been observed to be

bet ter compared to Type 2 fuzzy and conventional

fuzzy C means algorithm.

R e f e r e n c e s

1. J. C. Bezdek, L. O. Hall, and L. P. Clark, Review of MR setgmentation technique in pattern recognition,

Med Phy 10:20(1993) pp. 33-48. 2. N. Ferahta et al., New fuzzy clustering algorithm ap

plied to RMN segmentation, IEEE Trans, on engi-neerig, computing, and technology V12,(2006) pp. 9-13.

3. F. C. H. Rhee and C. Hwang, A Type-2 fuzzy c means clustering algorithm, Proc. in Joint 9th IFSA World Congress and 20th NAFIPS International Conference 4 (2001) pp. 1926-1929.

4. K. T. Atanassov, Intuitionistic fuzzy sets, Theory and Applications, Series in Fuzziness and Soft Computing (Phisica-Verlag, 1999).

5. H. Bustince et al, Intuitionistic fuzzy generators Application to intuitionistic fuzzy complementation, Fuzzy sets and systems 114(2000) pp.485-504.

231

Remote Sensing Image Classification: A Wavelet-Neuro-Fuzzy Approach

Saroj K. Meher, B. Uma Shankar and Ashish Ghosh

Machine Intelligence Unit Indian Statistical Institute

203 B. T. Road, Kolkata 700108, INDIA E-mail: {saroj.t, uma, ash}@isical.ac.in

The present article proposes a wavelet-neuro-fuzzy (WNF) system for classification of land covers of remote sensing images. This classifier incorporates a new architecture for neuro-fuzzy (NF) system that expand the input space of conventional NF (CNF) systems. The performance of this new NF classifier is compared with the CNF and the conventional multi-layer perceptron (MLP) with original multispectral features of remote sensing images. Experimental study demonstrated the superiority of this NF classifier. Incorporation of wavelet features into this classifier improved its performance. Particularly, with biorthogonal3.3 wavelet the proposed NF classifier outperformed all others. Results are evaluated qualitatively and quantitatively.

Keywords: Remote sensing; pattern classification; fuzzy set; neural networks; neuro-fuzzy; wavelet transform

1. Introduction

Classification of land covers in remote sensing images is a complex task because of low illumination, low spatial resolution of sensors, rapid changes in environmental conditions and overlapping nature of regions like vegetation, soil, water, concrete structures.1 Moreover, the gray value assigned to a pixel is an average reflectance of different types of land covers present in the corresponding pixel area. Therefore, a pixel may represent more than one class with varying degree of belonging. Thus assigning a unique class label to a pixel with certainty is one of the major problems. Conventional methods cannot deal with this imprecise representation of geological information. Fuzzy set theory introduced by Zadeh2 provides a useful technique to allow a pixel to be a member of more than one category or class with graded membership. The significance of fuzzy set theory in the realm of pattern classification3 is justified in different areas including representation of graded membership values of pixels in remote sensing imagery for various land cover classes.3'4 Neural networks (NNs)5

on the other hand, are aimed to emulate the biological nervous system to make it suitable for collective decision making. Therefore, an integration of neural and fuzzy systems known as neuro-fuzzy (NF) technique enables one to design intelligent decision making systems. Many research efforts have already been made for the design of NF systems.6""9

Further, remotely sensed images comprise information over a large range of variation of frequencies that changes over different regions. They have

both spectral features with correlated bands and spatial features correlated in the same band. An efficient utilization of these spectral and spatial (contextual) information can improve the classification performance compared to the non-contextual information based methods. Research efforts have been made to take advantages of neighboring pixels (tex-tural) information1 computed from the gray level co-occurrence matrices10 for classification purpose. These methods are computationally expensive as they require an estimation of autocorrelation parameters and transition probabilities. Also, the texture elements are not easy to quantify. Later on Gaussian Markov random fields (GMRF) and Gibbs random fields were proposed to characterize textures.1 The above-mentioned conventional statistical approaches to texture analysis are restricted to the analysis of spatial interactions over relatively small neighborhoods in a single scale. Recently wavelet transform (WT)11 has received much attention as a promising multi-scale tool for texture analysis of remote sensing images, both in spatial and frequency domains.12

In the present study we have explored the advantages of both wavelet features (WF) and a new NF architecture for classification of land covers of remote sensing images by proposing a wavelet-neuro-fuzzy classifier (WNF). The number of input nodes of NN of this NF classifier is equal to the product of the number of classes and the number of features of input patterns. Number of nodes in the output layer remains the same as the number of classes. The proposed NF classifier is different from other exist-

232 Remote Sensing Image Classification: A Wavelet-N euro-Fuzzy Approach

ing NF classifiers.7"9'13-15 In most of the existing NF methods each feature of the input pattern vector is encoded with three linguistic variables low, medium and high. Thus the number of input nodes of the network is equal to 3 x (number of input features). The output nodes of the networks are provided with fuzzy class labels (computed from the training data) for learning. On the contrary, in the proposed NF method each feature of the input pattern is encoded with a sequence of membership values where each element of the sequence represents the degree of belonging of that feature to the corresponding class. The number of input nodes of the network becomes equal to the product of the number of input features and the number of classes present in the data set. The output nodes are provided with the existing class labels of the input patterns during training.

The performance of the proposed NF classifier is compared with that of a neural network (i.e., MLP)5 and conventional NF (CNF)16 (where the input nodes of the NN is equal to the number of features of the input patterns) classifier. Use of WF to these classifiers provided better accuracy compared to original features. Among various wavelets used in the present study, biorthogonal3.3 (Bior3.3) wavelet is found to be more suitable for the present problem.

2. W T based feature extraction

To deal with the non-stationary signals (e.g., remote sensing images) many research works are carried out. Efforts are made to overcome the disadvantages of Fourier transform17 that assumes the signal to be stationary. Short time Fourier transform (STFT) and wavelet transform (WT)11 '17 '18 are examples of such transforms. However, selection of window function is a major problem in STFT which is eliminated in WT. WT tries to identify both the scale and space information of the event simultaneously which make it more useful for analysis of remote sensing images.

WT is used for feature extraction from the multi-spectral input images. In this process, each spectral (band) image is decomposed using 2-D WT which provides four sub-images (wavelet coefficients) from an input image: one in the low frequency range (LL) and three in the high frequency range (LH, HL and HH). After inverse WT (called reconstruction) of these sub-images we get wavelet features (WF) of the original image. The WF thus obtained are cascaded as shown in Fig. 1 and are used for classification.

Classifier -*\ Class map

Fig. 1. Classification procedure

3. Classification techniques

Here we describe the three classifiers used for the present investigation.

iVeuraZ classifier: Both the original and WF (generated by WT) are used as input to an MLP. The MLP used in the present experiment has three layers; namely, input, one hidden and output. The number of input-layer nodes is equal to the number of features of the data patterns and the number of output-layer nodes is equal to the number of classes. MLP uses back-propagation (BP) learning algorithm5 for weight updating. The BP algorithm reduces the sum of squared error called cost function (CF), between the actual and desired output of output-layer neurons in a gradient descent manner.

Neuro-fuzzy classifier: In the present study a conventional NF technique similar to the architecture used by Lee et o/.16 is considered. The NF classifier is based on MLP described in the previous paragraph with alterations in operation to accommodate fuzzy data. The NF system can be summarized as a three step process. At first the data pattern is prepro-cessed using appropriate membership function (MF) to obtain the corresponding degree of belonging for each feature of a pattern to a particular class. These membership values of the features are then supplied as input to the NN in the next step which produces fuzzy output. Finally, output of the NN are defuzzi-fied using a MAX (maximum) operation to assign a class label to the input pattern.

Proposed neuro-fuzzy classifier: The new architecture of NF classification system extract feature-wise fuzzy information of all classes. This

Saroj K. Meher, B. Uma Shankar and Ashish Ghosh 233

classifier (Fig. 2) works in three steps and is described briefly in the following paragraphs.

the highest output node value. In case of ties, it is resolved arbitrarily.

Fuzzification Defuzzification

input pattern iona ern

1

B

S

F

F ^ I )

• B

! F

N E U R A L

N E T W 0 R K

CxD dimensional feature space

Fig. 2. Neuro-fuzzy classification model

Fuzzification: The first step of the system performs the fuzzification of input patterns using a n-type MF2 that generates the feature-wise degree of belonging of the input pattern to different classes. The membership fd,c(

xd) of feature value Xd of a pattern x thus generated expresses the degree of belonging of dth features to cth class; where x = [x\,X2,---Xd, ...xo]T, with d = 1,2, ...D and c = 1,2,...C.

Thus for a multi-featured pattern x, the membership matrix after fuzzification process can be expressed as:

F(x)

/ i , i ( * 0 h,\(x2)

ha(xi) haix2)

fuc(xi) /2,c(z2)

JD,I(XD) fofiixD) ••• /D,C(XD).

This fuzzified pattern matrix is used as input to a NN in the second step. The NN performs the classification task.

Defuzzification operation: The output of the NN provide the degree of belonging of a pattern to different classes which can further be used for higher level image analysis. If one needs to assign a hard class label instead of fuzzy label, a MAX operation is more appropriate for defuzzification. In this step the pattern is classified to class c corresponding to

»;

I . * - * * .

(b)

Fig. 3. Original (a) IRS-1A (band-4) and (b) SPOT (band-3) images

4. Experimental results and analysis

Although experiments were performed on a number of images, we present here the results of two remote sensing images. The images cover an area around the city of Calcutta having six major land cover classes namely, pure water (PW), turbid water (TW), concrete area (CA), habitation (HAB), vegetation (VEG) and open spaces (OS). These images (Fig. 3) are obtained from IRS4 (Indian remote sensing) and SPOT4 (Systeme Pour d'Observation de la Terre) satellites. The IRS image has a spatial resolution of 36.25m x 36.25m and has four spectral bands,

234 Remote Sensing Image Classification: A Wavelet-Neum-Fuzzy Approach

Fig. 4. Classified IRS-1A image using new NF with (a) original features, and (b) Bior3.3 wavelet features

whereas the SPOT image has a resolution of 20m x 20m with three spectral bands.

Selection of training samples are made according to the ground truth of the land cover regions. After training, the classifier is used for classifying the land covers of the whole image (i.e., unknown regions).

We have used wavelets from different groups (Daubechies (Db), biorthogonal (Bior), Coiflets, Symlets).18 However, results are given for four wavelets namely, Db3, Db6, BiorS.3 and Bior3.5, as their performances are found to be better than others. These wavelets are implemented with the mul-tiresolution scheme given by Mallat.11 During the decomposition of an image by WT, four sub-sampled versions of the original image are obtained. This decomposition can be extended to more than one level

that can provide more information. To have an objective evaluation, we computed the average entropy providing a measure of information of the image. We found that the entropy value is not changing significantly after two levels of decomposition.

Two performance measures namely, 0 index4 of homogeneity and Davies-Bouldin (DB) index19 of compactness and separability are used. 0 is defined4

as the ratio of the total variation and within-class variation. For a given image and given number of classes, the higher the homogeneity within the classes, the higher would he the 0 value. The DB index is based on the evaluation of some measure of dispersion within the classes and the distance between the prototypes of the classes. The smaller the DB value, the better is the partitioning.19

4*1. Classification of images

The six land covers of both the images are classified using the above mentioned three classifiers. F¥om the visualization point of view, it' is observed that the new NF classification method is performing better in classifying the land covers (i.e., segregating different regions) compared to the rest two. Hence we have shown only the classified images using this method with original spectral features in Figs. 4a and 5a. Further, the performance of all the classifiers are improved with WFs and we found that Bior3.3 wavelet is the best for the current problem. Thus we have shown the classified images using the proposed WNF (with Bior3.3 wavelet) classifier in Figs. 4b and 5b, where various land cover classes and known objects are found to be clearly visible. These objects are more or less visible in case of other classifiers and with different WFs. But the segregation of land covers and some known structures are better for the WNF classifier. This finding is supported by a quantitative evaluation.

In Table 1 we put the values of the validity measure for different methods. FVom the table we found that individually the new NF classification method is providing the maximum 0 and minimum DB values which support the superiority of the, method. The indices are improved further for all the classifiers with WF (Table 1). Bior3.3 is found to be a more suitable wavelet for land cover classification. With the use of WFs, the land cover classes became more clear and structures are found to be more prominent. Consid-

Table 1. p and DB value for different classifiers

Class i f ica t ion m e t h o d

Training patterns MLP

Conventional NF New NF

Db3 + MLP Db6 + MLP

Bior3.3 + MLP Bior3.5 + MLP

Db3 + CNF Db6 + CNF

Bior3.3 + CNF Bior3.5 + CNF Db3 + new NF

i Db6 + new NF I Bior3.3 4- new NF

Bior3.5 + new NF

I R S i m a g e

P 9.42 7.15 7.75 8.61 7.69 7.18 7.78 7.20 8.13 7.77 8.22 7.89 8.89 8.63 8.91 8.62

DB 0,56 0.93 0.81 0.70 0.90 0.92 0.89 0.92 0.78 0.80 0.77 0.80 0.68 0.69 0.67 0.69

S P O T i m a g e

P 9.33 7.05 7.69 8.55 7.50 7.11 7.58 7.10 7.98 7.72 8 . IS 7.73 8.77 8.57 8.82 8.59

DB 1.49 3.35 2.45 1.98 3.19 3.34 3.1§ 3.34 2.34 2.40 2.82 2.41 1.91 1.97 1.90 1.97

(b)

Fig. 5. Classified SPOT image using new NF with (a) original features, and (b) Bior3.3 wavelet features

Saroj K. Mehert B. Uma Shankar and Ashish Ghosh 235

ering all cases we can infer that the proposed WNF classifier is outperforming other classifiers.

It needs to be mentioned here that the computational complexities of the proposed NF method is more compared to the existing methods if the data set is having more than 3 classes. However this increase in cost is marginal in comparison with the improved classification accuracies.

5. Conclusion

We have proposed a wavelet-neuro-fuzzy (WNF) classifier for land cover classification of remote sensing images. This has been compared with conventional NF and MLP based classifiers and found to be superior. It is observed that all the classifiers are providing better results with wavelet features compared to the original feature based classification. Bior3.3 wavelet based features are found to be more suitable. Particularly, Bior3.3 with new NF classifier (WNF) outperformed all others. The performances of the classifiers were evaluated visually and quantitatively.

Acknowledgments

Authors would like to thank D8T, Gol and University of Trento, Italy, for sponsoring the collaborative project titled Advanced Techniques for Remote Sensing Image Processing.

References

1. B. Tso and P. M. Mather, Classification Methods for Remotely Sensed Data Taylor and Rrancis (2001).

2. L. A. Zadeh, Information Control 8, 338 (1965). 3. L. I. Kuncheva, Fuzzy Classifier Design Springer-

Verlag (2000). 4. S. K. Pal et a/., International Jounal of Remote Sens

ing 21, 2269 (2000). 5. S. Haykin, Neural Networks; A Comprehensive

Foundation 2nd edn. Prentice Hall (1998). 6. S. K. Pal and A. Ghosh, International Journal of

Systems Science 27, 1179 (1996). 7. S. K. Pal and S. Mitra, Neuro-Fuzzy Pattern Recog

nition: Methods in Soft Computing John Wiley & Sons, USA (1999).

8. S. Abe. Pattern Classification : Neuro-Fuzzy Methods and Their Comparison Springer Verlag (2001).

9. J. S. R. Jang et al., Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence Pearson Education (1996).

10. R. M. Haralick and K. S. Shanmugam, Remote Sensing of Environment 3, 3 (1974).

236 Remote Sensing Image Classification: A Wavelet-Neuro-Fuzzy Approach

11. S. Mallat, A Wavelet Tour of Signal Processing 2nd edn. Academic Press (1999).

12. J. Yu and M. Ekstrom, Pattern Recognition 36, 889 (2003).

13. V. Boskovitz and H. Guterman. IEEE Transactions on Fuzzy Systems 10, 247 (2002).

14. P. Gamba and F. Dellacqua. International Journal of Remote Sensing 24, 827 (2003).

15. A. Baraldi et al, IEEE Transactions on Geoscience and Remote Sensing, 39, 994 (2001).

16. S. G. Lee et al., Proc. IEEE International Conference on Fuzzy Systems 2, 1063 (1999).

17. G. Strang and T. Nguyen, Wavelets and Filter Banks Wellesley-Cambridge Press (1996).

18. I. Daubechies, Ten Lectures on Wavelets SIAM (1992).

19. D. L. Davies and D. W. Bouldin, IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 224 (1979).

PARTH

Multimedia Object Retrieval

239

An Efficient Cluster Based Image Retrieval Scheme Using Localized Texture Pattern

Saumen Mandal and Sanjoy Kumar Saha

CSE Department Jadavpur University

Kolkata - 700 032, India E-mail: [email protected], [email protected]

Amit Kumar Das

CST Department Bengal Engineering and Science University

Howrah - 711 103, India E-mail: [email protected]

Bhabatosh Chanda

Electronics and Communication Science Unit Indian Statistical Institute Kolkata - 700 108, India


We have designed and developed a CBIR system based on texture features, we have worked with a database consisting of various texture images. An image is divided into number of blocks. Based on its texture pattern, each block is replaced by a texture value and 'texture image' corresponding to the original image is obtained. A set of descriptors obtained from this 'texture image' describes the image. Finally, a simple but novel cluster based retrieval scheme is presented which is free from the drawbacks of fc - means algorithm. Experiment shows the effectiveness of the texture features and the clustering scheme.

Keywords: CBIR; clustering; texture co-occurrence matrix; texture pattern moment

1. Introduction

Texture is a feature that has been extensively explored by various research groups. Haralick et al.1

proposed the co-occurrence matrix representation of texture features and then meaningful statistics are extracted from the matrix. Smith and Chang2 proposed some statistics (e.g., mean and variance) from the Wavelet sub-bands. Number of researchers have described texture using different form of wavelet transformation.3'4 Fournier et al.5 have used Gabor Filters. Kaplan et al.6 used fractal dimension as a measure of texture property. Tamura et al.7 presented visually meaningful texture properties, like coarseness, contrast, directionality, line-likeness etc. Sharing a similar view Liu and Picard8 used periodicity, directionality and randomness as texture features. Thus, researchers have tried a wide variety of texture descriptors.

In case of retrieval, precision and response time are the two major issues. Corresponding to the query image submitted by the user, similar images are retrieved from the database. As the similarity measure

is a function of the features describing the image, capability of features has an impact on precision. Apart from that, the retrieval scheme also has an important role. In case of linear search, all the images in the database are compared. Thus, in terms of precision, performance may be good. But, for a large volume of database the response time will be prohibitive. As an optimization between precision and response time, several indexing schemes have come up.

Number of tree based indexing scheme9'10 are proposed by researchers. In these schemes, the distance space is divided into number of spatial domains. But, they suffer from the curses of dimensionality. In3'11 database is organised on the basis of the distance with respect to one or more key/pivot elements. In these schemes also efficiency falls as the dimensionality of the features increases. A k nearest neighbour graph (knn graph) based scheme has been proposed by Sebastian and Kimia12 where each node represents a database element and it is connected to its k nearest neighbour. But, maintaining the graph





240 An Efficient Cluster Based Image Retrieval Scheme Using Localized Texture Pattern

for a large database is not trivial. The basic purpose of indexing is to reduce the

search space. So that, the retrieval becomes fast. Clustering technique also can serve the same. It is a tool for exploring the underlying structure of a given data set. Based on some measure of similarity, a set of rule is specified to assign the data elements to a cluster domain.13-17 Thus, the elements in the database are distributed in number of cluster domains and the search operation is carried out with in a few clusters instead of the whole data set.

Distance based, density based and hierarchical clustering are the major approaches. .RT-means algorithm17'18 is one of the widely used distance based algorithm. Initial K cluster centres are provided and iteratively data points are assigned to the closest cluster centres. Centres are recalculated. Thus, it goes on till the convergence is met. if-medoids and CLARANS19 also follow similar partitioning scheme with some modification. In density based approach, for each point of a cluster the density of the data points in the neighbourhood has to exceed a threshold. DBSCAN,20 OPTICS21 etc. are examples of density based algorithms. But, setting of density threshold is difficult and the density based clustering is comparatively slow. In hierarchical clustering algorithms, the data points are decomposed in to several levels of partitions which are represented by dendo-grams. BIRCH,22 CURE23 etc. follow this approach. The prohibitive cost of dendogram creation is a disadvantage of this approach.

This paper is organized as follows. In section 2, we have described the formation of texture cooccurrence matrix and the computation of the texture features including the texture pattern moment. It is followed by the description of the proposed clustering scheme in section 3. Experimental result is presented in section 4 and finally, concluded in section 5.

2. Texture Feature

We perceive the texture as repetitive or quasi-repetitive patterns. However, the gray-level cooccurrence matrix1 captures only the co-occurrence of the intensity values, it can not reflect the repetitive nature of the patterns in true sense. In order to overcome this limitation, we present a texture cooccurrence matrix24 for describing the texture of the image.

2.1. Texture Co-occurrence Matrix

Usually a small patch of finite area of an image is required to feel or measure local texture value. The smallest region for such purpose could be a 2 x 2 block. So in order to compute the texture cooccurrence matrix, the intensity image is divided into blocks of size 2 x 2 . Then the gray level of the block is converted to binary by thresholding at the average intensity. The 2 x 2 binary pattern obtained this way provides an idea of local texture within the block. By arranging this pattern in raster order a binary string is formed and corresponding decimal equivalent is its texture valued). Thus we get 15 such texture value as block of all l's does not occur.

A problem of this approach is that a smooth intensity block and a coarse textured block may produce same binary pattern and hence, same texture value. To solve this problem, all such smooth blocks (i.e. intensity variance is less than a threshold) are assigned texture value 0.

Texture values of the blocks that we have obtained represent the texture pattern quite effectively. But, it does not reflect the contrast information of the blocks. In order to get rid of such problem, the texture value of a block is modified to incorporate the contrast information.Thus, Tj, the new texture value of a block is computed as follows:

Ti = U x 16 + Si

where, Si is the contrast in the 2 x 2 block. In our experiment,

Si = (gn - ffio)/16

where, gn is the average value of the pixels in the block marked as 1 and gw is the same for the pixels marked as 0. Thus, gn — g^ represents the contrast of the block. As the old value is multiplied and contrast is divided by 16, it is ensured that contrast information will not dominate over the contribution of texture pattern and blocks with different texture pattern will have different values.

Thus, we get the scaled image whose height and width are half of that of the original image and the pixel values are same as texture values. This new image may be considered as the image representing the texture of the original image and referred as 'texture image'. Once the 'texture image' is obtained,

S. Mandal, S. K. Saha, A. K. Das and B. Chanda 241

co-occurrence matrix corresponding to that is computed. Care is taken to make it independent of translation, rotation and size.

2.2. Computation of Features

The texture co-occurrence matrix provides the detailed description of the image texture, but handling of such multivalued feature is always difficult. To obtain more perceivable features, based on the matrix the statistical measures like energy, entropy, contrast in texture, homogeneity in texture, texture moments are computed as it is done in case of gray-level cooccurrence matrix.

We further propose texture pattern moment as a new texture descriptor. A pixel in 'texture image' represents the texture pattern of a 2 x 2 block of the original image. Thus, the shape descriptor corresponding to the 'texture image' reflects the organisation of the block level texture patterns. On the other hand, such organisation reflects the overall texture of the original image. Thus, the shape descriptor for the 'texture image' acts as the texture descriptor for the original image. This analysis has led us to propose texture pattern moment as a feature. As moment is the most widely used shape descriptors, we also compute moments of various order for the 'texture image' and refer them as the texture pattern moment. Multiscalar Texture Co-occurrence Matrix

Depending on the size of repetitive pattern, texture of an image can be classified as micro or macro texture. The presented feature is capable of handling the micro textures. In order to enhance its power to represent the macro texture we obtain multiscalar texture co-occurrence matrix.25

It is computed as follows. In first iteration, we obtain the texture co-occurrence matrix corresponding to the original gray-scale image and features are computed from it. In subsequent iteration, we consider only the pixels of alternate row and column in the image used for computation of feature in previous iteration. Texture co-occurence matrix and subsequently the features are computed corresponding to this new image.

3. Clustering Technique

In the simplest way, clustering can be described as the grouping together of similar data elements in a cluster. -ftT-means algorithm is the most widely used

distance based clustering scheme. But, it suffers from number of disadvantages. First of all, number of clusters required is to be specified by the user and initial cluster centres are also to be provided. It is a difficult task. On the other hand, the performance is very much sensitive to these parameters. Moreover, the algorithm is very much affected by the outliers. The major advantage lies in its simplicity. It is very easy to implement.

Proposed distance based clustering scheme draws motivation from the simplicity of X-means algorithm. On the other hand, it releases the burden of specifying the set of crucial parameters from the users. In our work, we consider a new term called split point instead of the conventional cluster centres or centroids. The split points play the similar role as cluster centres at the time of splitting a cluster. But, the way split points are chosen, they are not centrally located in the clusters.

In the proposed scheme, initially, all the elements are considered to be in a single cluster. The most distant points in a cluster are considered as split centres. Elements are assigned to the nearest split points. A cluster is splitted only if the variance of the distance of the elements from the split point in the cluster exceeds a predefined threshold (th). SPL, the split point list maintains the set of split points corresponding to the clusters. The process continues iteratively till splitting is required for none of the clusters. The algorithm is as follows:

1. Assume a l l elements belongs to a single c lu s t e r .

2. Find the most d i s t a n t elements ( s p l i t points) in the c l u s t e r .

3. Assign the elements to the nearest s p l i t po in t s .

4 . I n i t i a l i z e the SPL. 5. flag=0; 6. for each cluster {

compute v, the variance of the

distance of elements from split point,

if (v > th) {

flag-1.

find split points and update SPL.

}

>

7. if (flag==l) {

242 An Efficient Cluster Based Image Retrieval Scheme Using Localized Texture Pattern

distribute all the elements to the

nearest split points and goto step 5.

}

8. Stop.

The clustering algorithm is simple enough and easy to implement. It does not require any initial guess of split points or centroids. The scheme is applicable for various distance/similarity measure. The clusters may take the polygonal shape of wide variation depending on the distance measure. But, the scheme may be affected by outliers because the splitting operation is carried out with respect to the most distant points. To minimise this, a cluster merging algorithm may be taken up after obtaining all the clusters. If the cluster size falls below a limit then the elements are reassigned to the other nearest split points. In our experiment, limit of cluster size has been specified in terms a minimum number of elements. Thus, the effect of outliers has been marginalised, particularly in the context of the retrieval of top order similar images. Like, if-means algorithm, in this scheme also, in each iteration all the elements are distributed. Moreover, finding the most distant point in the cluster has the complexity of o(n2), where, n is number of elements in the cluster. Thus, the clustering process becomes slow. But, in the context of CBIR, the cluster formation can be done offline and furthermore, it is not a frequent operation.

Table 1. Comparison of precision (in %) using different texture features.

P(10)

P(20)

P(30)

Gray-level Co-occurrence

Matrix

56.81

44.54

38.18

Texture Co-occ. Matrix (without texture pattern

moment)

70.81

61.59

52.18

Texture Co-occ. Matrix (with

texture pattern moment)

74.18

64.86

55.78

In order to insert an element, it is first placed in to the nearest cluster. In case the cluster becomes a candidate for splitting then the cluster formation algorithm, from step 4 onward is to be followed. Deletion of an element may also lead to such splitting. It may lead to merging of cluster also. To find out the similar elements against a query element, nearest cluster is searched and according to the distance

measure, ordered output may be obtained. In order to avoid the boundary problem and to improve the precision of retrieval, instead of the nearest cluster, search may be continued to few neighbouring clusters also.

4. Experimental Result

We have carried out the experiment using the images of Brodatz texture database*. The database contains 112 images of size 640 x 640. Each image is divided into sub-images of size 64 x 64, 128 x 128 and 256 x 256. Taking the samples from these sub-images of different size, we have formed a set of 50 images corresponding to the original 112 images. Thus, the experimental database contains 5600 images.

In order to study the performance of the 'texture image' based features, we have carried out Euclidean distance based linear search considering each database image as the query image. Performance is compared with Gray-level co-occurrence matrix1

based system also. To form the multiscalar version, we have considered upto 3rd iteration for both the system. Along with energy, entropy, homogeneity and contrast, we have also considered 5 texture moment (upto 2nd order only). Thus, 27-dimensional feature vector is formed. The second and third column of table 1 reflect that precision of retrieval is better for the 'texture image' based features. Finally, texture pattern moments are also included. As the high order moments are not perceivable, we have considered 3 values (only first order moment and one of the second order moments (ran)) for both - texture moment and texture pattern moment. Thus, after third iteration we obtain 30-dimensional feature vector. The last column of table 1 shows that the performance further improves. Thus, the effectiveness of the proposed texture feature is established.

Selection of th determines the cluster size. A low value may be very conservative. On the other hand, a high value is too liberal to include lot of elements in a cluster. So, it is to be chosen judiciously. We have considered th = .01 in our experiment. The database has been partitioned in to 5 clusters. Number of elements in the clusters varies from 800 to 1500. To obtain the similar images against a query, linear search is carried out in the cluster with the nearest split

a www. ux.his. n o / ~ tranden/brodatz. html

S. Mandal, S. K. Saha, A. K. Das and B. Chanda 243

Table 2. Comparison of precision (in %) using different Search strategy-

P(10)

P(20)

P(30)

Linear Search

74.18

64.86

55.78

Proposed Cluster based Search

74.06

63.89

54.94

point. Table 2 shows that P{ri) (precision after the retrieval of n images) in case of cluster based retrieval is comparable with that of linear search. On the other hand, as the number of comparison is drastically reduced, retrieval becomes faster. In our experiment, using each database image as the query image, on an average an improvement of speed by a factor of 5 has been obtained. Thus, the proposed scheme is capable to provide faster response without compromising the accuracy at large.

5. Conclus ion

In this paper, we have presented the texture features based on ' texture image' formed by dividing the original gray-scale image in to 2 x 2 blocks and replacing the block by a texture value, we have also proposed texture pat tern moment as a texture descriptor. For retrieval a simple but effective clustering scheme is proposed. Experiment shows the usefulness of the proposed feature and the cluster based retrieval strategy. Further work may be carried out to find an alternate approach for split point selection. So that , the cluster generation algorithm becomes faster. Moreover, incorporation of clustered indexing (i.e. the use of index within the cluster) may lead to further speed gain.

R e f e r e n c e s

1. R. M. Haralick, K. Shanmugam and I. Dinstein, IEEE Trans, on SMC 3(11), 610 (1973).

2. J. R. Smith and S. F. Chang, Automated binary texture feature sets for image retrieval, in Proceedings of IEEE Intl. Conf. on ASSP, (USA, 1996).

3. A. P. Berman and L. G. Shapiro, Computer Vision and Image Understanding 75, 175 (1999).

4. B. Ko, J. Peng and H. Byun, Pattern Analysis and Applications 4, 174 (2001).

5. J. Fournier, M. Cord and S. Philipp-Foliguet, Pattern Analysis and Applications 4, 153 (2001).

6. L. M. Kaplan, SPIE 3312, 162 (1998). 7. H. Tamura, S. Mori and T. Yamawaki, IEEE Trans,

on SMC8(6), 460 (1978). 8. F. Liu and R. W. Picard, IEEE Trans, on PAMI

18(7), 722 (1996). 9. A. Guttman, R-trees: a dynamic index structure for

spatial searchigng, in Proc. of ACM SIGMOD conf. on management of data, 1984.

10. S. T. Leutenegger, M. A. Lopez and J. M. Edgington, A simple and efficient algorithm for r-tree packing, in Proc. of 13th International Conf. on Data Engineering, 1997.

11. J. Barros, J. French, W. Martin, P. Kelley and M. Cannon, SPIE 2670, 392 (1996).

12. T. B. Sebastian and B. B. Kimia, Metric-based shape retrieval in large databases, in Proceedings of International Conf. on Pattern Recognition, (Canada, 2002).

13. E. Ruspini, Information Control 15(1), 22 (1969). 14. R. Duda and H. Hart, Pattern Classification and

Scene Analysis (Wiley, New York, 1992). 15. J. Hartigan, Clustering Algorithms (John wiley, New

York, 1988). 16. J. C. Bezdek, Pattern recognition with Fuzzy Objec

tive Function Algorithms (Plenum Press, New York, 1981).

17. A. K. Jain and R. C. Dubes, Algorithms for clustering (Prentice Hall, New Jersey, 1988).

18. D. Pollard, Tha Annals of Statistics 9(1), 135 (1981).

19. R. T. Ng and J. Han, IEEE Trans, on Knowledge and Data Engineering 14(5), 10003 (1996).

20. M. Ester, H. P. Kriegel, J. Sander and X. Su, A density-based algorithm for discovering clusters in large spatial database with noise, in Proc. of Intl. Conf. on Knowledge Discovery and Data Mining, 1996.

21. M. Ankerst, M. Breunig, H. P. Kriegel and J. Sander, Optics: Ordering points to identify the clustering structure, in Proceedings of ACM Conf. of Special Interest Group on Management of Data, 1999.

22. R. Ramakrishnan, T. Zhang and M. Livny, Data Mining and Knowledge Discovery (1997).

23. S. Guha, R. Rastogi and K. Shim, Cure: An efficient clustering algorithm for large databases, in Proc. of ACM SIGMOD, 1998.

24. S. K. Saha, A. K. Das and B. Chanda, CBIR using perception based texture and colour measures, in Proceedings of Intl. Conf. on Pattern Recognition, (Cambridge, UK, 2004).

25. S. K. Saha, A. K. Das and B. Chanda, Image retrieval using multiscalar texture co-occurrence matrix, in Proceedings of the 6th Intl. Workshop on Pattern Recognition in Information Systems, (Paphos, Cyprus, 2006).

244

Feature Selection Based on Human Perception of Image Similarity for Content Based Image Retrieval

P. Narayana Rao, Chakravarthy Bhagvati, R. S. Bapi, Arun K. Pujari, B. L. Deekshatulu

Dept. of Computer and Information Sciences University of Hyderabad

Hyderabad 500046. E-mail: [email protected], {chakcs,bapics,akpcs,bides} @uohyd.ernet.in

In this paper, we present an approach that enables selection of appropriate low-level features based on human perception for measuring image similarity in a Content-based image retrieval (CBIR) system. Human perceptual information is captured by three psychophysical experiments designed to explore different aspects of similarity. Their outcomes are used to formulate fitness functions in a genetic algorithm that selects a subset from an initial collection of popular low-level image features such as colour, texture and structure. The reduced subset of features is, thus, correlated to human perception and represents an attempt to bridge the semantic gap. We quantitatively validate our approach by building a CBIR system using such features and evaluating the retrieval precision.

1. Introduction

Images are usually characterized by low-level descriptors such as color, texture and shape, which do not correspond to the high-level concepts that humans perceive. The difference between machine and human perception is known as semantic gap. The extent to which it is handled determines the system's reliability and efficiency. The semantic gap may be narrowed by using methods such as Relevance Feedback (RF),1

linguistic indexing of images,2 or modelling human visual perception.3

Modelling human perception of similarity can be by developing either a computational model for early human vision or a similarity function consistent with human perception. The former is related to properties of the human eye and the latter is about extracting useful relations from the similarity ratings. Though these methods cannot capture the subjectivity of the user, they can capture human judgments of visual similarity.

Several groups have been involved in research on incorporating human perception into CBIR systems. For example, Thomas et al4 used similarity metric based on a multiscale model of the human visual system. The multiscale model includes channels which account for perceptual phenomena such as color, contrast and orientation selectivity. Mojsilovic et al.5

used multi-dimensional scaling(MDS) to extract the perceptual dimensions and used hierarchical clustering to derive rules for similarity judgment. Guyader et al.6 trained Gabor filter parameters so that the filter categorization matches categorization by hu

mans. Celebi and Aslandogan3 used human similarity ratings to develop a weighted Manhattan distance measure for focused monochrome images.

We extend the work of Celebi and Aslandogan3 by designing psychophysical experiments to model similarity in CBIR context from a clustering and classification perspective. We also experimented with several databases containing natural and remote sensing colour images to assess the generality of our approach.

The paper is organized into six sections. Section 2 is an overview of our approach. Section 3 describes the psychophysical experiments and their outcomes. Section 4 deals with the use of a genetic algorithm for finding the optimal set of low-level features based on the psychophysical experiments. Results on a content based image retrieval system (CBIR) that validate our approach to capture human perception are given in Section 5, followed by conclusions in Section 6.

2. Approach

Our approach is succinctly illustrated in Figure 1. Human perception is modelled through psychophysical experiments in which human users assess three different aspects of similarity in a CBIR context. The first, identical to that of Celebi et a/.,3 is an estimation of similarity between pairs of images. Pairwise similarity between randomly selected pairs of training images is computed and labelled into four categories on a scale of 1 (extremely similar) to 4 (not similar). The results are coded in a similarity matrix. The second aspect, capturing the CBIR spirit,


P. Narayana Rao, Chakravarthy Bhagvati, R. S. Bapi, Arun K. Pujari and B. L. Deekshatulu 245

divides the set of training images into two classes — similar and dissimilar — with respect to a specific reference (query) image. The outcome is two clusters of images. The third, a classification approach, assigns class labels to the selected set of training images.

On the other hand, from the computer perspective, we start with a set of popular low-level colour, texture and structure features. Genetic algorithm (GA) is then used to select a subset of features. The GA objective or fitness functions are designed to reward conformity with the outcomes of the psychophysical experiments. Our hypothesis is that the reduced subset of features optimally capture the human notion of similarity expressed via the psychophysical experiments. The hypothesis is then validated by building a CBIR system and measuring the precision of the retrieved images.

HUMAN _> PERCEPTION

FF1, 2, 3: Fitness Functions corresponding to Expts. 1, 2, 3

F ig. 1. Overview of our approach

3. Psychophysical Experiments

In a traditional psychophysical experiment, a variable of physical stimulation is applied to an observer and then the corresponding variable of his phenomenal (or "sensory") experience is established. Here, the physical stimulation is supplied by a set of images that require an assessment of similarity from the user, while we determine a feature subset that correlates well with the assessed image similarities.

We designed three psychophysical experiments

exploring different aspects of human visual similarity. Expt 1: Pairwise similarities3

A subset of images is selected from a large database for the experiment. This subset contains images that cover all the major visual characteristics in the database. In this experiment, each image in the subset is compared with every other image in the subset and an observer judges the perceived similarity as 1 (extremely similar), 2 (considerably similar), 3 (partially similar) and 4 (not similar). The experiment is repeated multiple times with different observers and an aggregated response matrix (size M x M, where M is the number of selected images) is constructed. The (i,j) entry in the matrix, called the similarity matrix, gives the similarity score between images i and j . The similarity matrix is considered symmetric. Expt 2: Clustering

Pairwise comparison of images is hard to do for all but small subsets. In this experiment, a randomly selected single image, called a query image, is shown and the observer selects similar and dissimilar images from a subset of images. That is, the images chosen for the experiment are clustered into two classes and most closely resembles the function of a CBIR system. Expt 3: Classification

In this experiment, the observer classifies and labels the selected images. Sufficient number of images are selected so that general properties may be inferred. This experiment is an extension of Expt 2 and provides explicit information to the system about the higher level semantics in the form of a labelled subset of images.

4. Genetic algorithm formulation

We start with a set of 15 popular low-level features. Hue, saturation and intensity averages and variances form two 3-dimensional colour features. Texture features are obtained by a bank of 12 Ga-bor filters: of four orientations at three scales. The average and variance values of the responses give twelve 2-dimensional texture feature vectors. A 120-dimensional Colour Structure Descriptor (CSD), as defined by Manjunath et al.,7 is used as a structure descriptor.

We now formulate a genetic algorithm to select a subset, based on human perception as captured

246 Feature Selection Based on Human Perception of Image Similarity for Content Based Image Retrieval

by the psychophysical experiments, from the initial 15 features that together result in a 150-dimensional feature vector ( 2 x 3 + 1 2 x 2 + l x 120). Each individual {chromosome) is a 15-bit long string with each bit representing a single feature {gene). "1" indicates that the particular feature is included in the subset and "0" otherwise. For example, Bit-0 represents the colour feature and if it is 1, then the 3-dimensional average colour feature is included in the calculations. Bit-1 represents the colour variance feature, Bits-2 — 13 represent the 12 Gabor texture features and Bit-14 stands for the CSD feature.

Sixteen individuals (chromosomes) are generated randomly in the first generation. These are ranked according to their fitness as evaluated by the objective functions defined below and the top 50% survive into the next generation. Mutation operation with a probability of 0.1 and a 1-point crossover operation with a probability of 0.5 are used to add a new set of individuals replacing the weakest 50% of the population for the subsequent generation. The fitness of the entire population is the average fitness of the 16 chromosomes. The algorithm terminates when it either runs for 500 generations or the average fitness converges.

Three objective or fitness functions that relate to the three psychophysical experiments are defined. In the first method, a similarity matrix is constructed for each 15-bit string using only the features indicated by Is. The fitness function measures the Pearson coefficient of correlation between the human-generated similarity matrix from the psychophysical experiment 1 and the one computed mathematically from the individual bit-string. A higher correlation indicates greater fitness. In the second method corresponding to psychophysical experiment 2, we compute the intra-cluster distance for the images labelled as similar and the inter-cluster distance between the two clusters. The ratio of intra-cluster to inter-cluster distance is calculated for each individual in the population and the ones with the smallest ratio are assigned the highest fitness. In the third method, we construct a binary decision tree (DT) each label assigned by the human in the psychophysical experiment 3 and for each individual in the population using only the features indicated by Is in the individual. We compute an error bound using PAC learn-ability8 from the number of leaves and height of each DT. The maximum error-bound over all the classes

is assigned as the error bound for the individual. Fitness is inversely proportional to this error bound.


We used several databases for our experiments. Two are remote sensing databases containing high-resolution aerial view (aerial) and low-resolution thumbnail images (browse) respectively. Other databases contain flowers, a variety of objects such as buses, aeroplanes, elephants and others. In addition, we also experimented with the Brodatz texture database containing 116 texture classes each containing 16 images.

Database

Aerial

Browse

Flower Brodatz

Pearson

4,7,9,10, 11,14 0,1,3,6, 7,8,10 0,1,6 0,1,3,13

PAC

3,7,8,11, 12,13 0,1,4,6, 9,10 2,3,4,8 1,2,5 9,14

PAC Unbiased 2,4,7,8, 9,11,14 0,1,10

1,2,3,8 7,8,9 10,14

Fig. 2. Reduced feature sets selected by GA

For the aerial database, a group of 15 people participated in the psychophysical experiment that generated the similarity matrix. Each observer was shown a selected set of 20 images each from the three databases and was asked to assess similarity between each pair of images. The experiment was repeated a few times for each individual. The results were aggregated and summarized in a 20 x 20 similarity matrix. For the second experiment, 40 randomly selected images were labelled similar and dissimilar to another randomly selected query image. In the third experiment, a set of 100 images was randomly selected from the databases representing 10 samples each from 10 types of images. In the third experiment, we computed a decision tree using either all the 100 instances {PAC method) or from a few instances of each class {PAC Unbiased method). In the latter case, we selected a number of samples from outside the class that is equal in number to the samples representing the class for which the DT is being constructed.

The feature subsets selected by the GA based

P. Narayana Rao, G'hakmvarthy Bhagvati, R. S. Bapi, Arun K. PujaH and B. L. Deekahatvlu 24?

Fig. 3. Precision graph for Aerial view database using the 4 methods

on Experiments 1 and 3 for the aerial database are* shown in Figure 2. In addition^ we also show the results from limited experimentation done on browse, flower and Brodatz databases for comparison. As the psychophysical experiment 2 (clustering) is not done for the other databases, its results are not shown in the table. The numbers refer to the genes in the 15-bit string used in the GA.

In the aerial database, the images are at high resolution (lm) and show large numbers of objects such as buildings and roads. Human observers determine similarity from the presence and arrangement of such structures and it may be noted that none of the GAs select the colour features (Bit-0 and Bit-l) in the reduced subsets. Similarity is judged entirely by texture and the distribution of colour across the image given by the CSD (Bit-14) feature, which corresponds to human perception. The feature subset selected by the GA based on psychophysical experiment 2 for aerial database is {1,11,12,13}, which also shows that texture features are more important.

In browse images, the areas seen in the images span several kilometres and each image contains regions of several colours. The colours are also more pronounced with arid land being yellow; waterbod-ies, black; and vegetation, red. It may be seen that both average colour and colour variance features are in the reduced subset. Colour is seen as important in the third database with flowers too. For Brodatz database, texture features are primarily selected because these images are in grayscale and colour corresponds only to intensity.

The selected feature subsets are quantitatively validated by using them in a CBIR system and eval

uating the retrieval precision. A graph of precision as the number of retrieved images is-varied from 5 to 30 is shown in Figure 3 for the aerial database. The precision plotted is the average over 5 queries. The graphs for the three experiments reveal that highest precision is achieved for Experiment 3 with the DT constructed using all the samples. Experiment 2 (clustering) performs better than Experiment 1 (pairwise similarity) initially but then falls to the fourth place. As the similarity is based only on one query image, very little information is contained in the experiment. In fact, the precision values appear proportional to the amount of information captured in each experiment. Experiment 1 relies only on pair-wise comparison, while Experiment 3 (PAC) is an explicit representation of the semantic classes and so performs better than the others.

(a) Retrieved images (Rl) for Experiment 1

Stan i w 7 (e) 01 tor Experiment 3 (PAC)

(d) Rl for Experiment 3 (PAC Unbiased)

Fig. 4. Results for a sample query image from the aerial view-database. Query image is the Irst image on the left in each row.

A sample query and the results from all the experiments are shown in Figure 4. For each experiment, the top-5 retrieved images are shown in the figure. Only 2 of the top-5 images are relevant in Expt 2 (clustering). Three images are relevant in Expts 1 (pairwise similarity) and 3 (PAC Unbiased). However ? it may be seen that the top-3 images in the PAC Unbiased method are relevant. The best performance is for Expt 3 (PAC) where 4 out of the top-5 images are relevant.

Experiment 1 is also repeated on the Aerial

248 Feature Selection Based on Human Perception of Image Similarity for Content Based Image Retrieval

database with population sizes of 50 and 100 individuals in the genetic algorithm. Several points were observed with the increased population sizes. The first is that the convergence was faster with larger population sizes. While it took nearly 129 generations for the experiment with population of 16 to converge, it took only 55 and 27 generations for population sizes of 50 and 100 respectively. Secondly, the correlation coefficient with the human similarity matrix increased to 0.6 for 50 individuals from 0.56 for 16 individuals. The correlation decreased marginally to 0.59 for 100 individuals. Thirdly, and most importantly, the subset of features selected by the experiment with 16 individuals is substantially retained for larger population sizes too. These results are summarized in Figure 5.

Population 16

50

100

Correl.

0.5634

0.6084

0.5938

Generations (convergence)

129

55

27

Features

4,7,9,10, 11,14 0,9,10 11,14 0,1,3,4, 9 ,10-14

Fig. 5. Effects of varying population sizes on GA results

It is interesting to note that the features 9,10,11,14, i.e., three Gabor texture features and the CSD texture feature were retained in all the three experiments showing the importance of texture in human perception for high resolution aerial imagery. The inclusion of colour feature (0 feature) is intuitive if one looks at the example result shown in Figure 4 where the images show distinct colour in addition to the texture comprising buildings and streets. The number of features selected increased significantly when 100 individuals were used in the experiment. However, the correlation with human similarity matrix did not change much.

The features selected by the GA were also used for retrieving images from the aerial database. Average precision was calculated by retrieving the top-30 matching images for a random set of 5 query images. Average precision increased from approximately 45%

for 16 individuals to about 70% for 50 individuals and decreased back to 60% for 100 individuals. It is possible that the G A is overspecializing when a larger number of individuals were used in the experiment. However, the results are preliminary and more experimentation is needed to study the full impact of population sizes on (a) capturing human perception (via the correlation value) and (b) on the performance in a CBIR application (via the average precision).

6. Conclusions

In this paper, we demonstrated an approach for finding a set of low-level image features useful in evaluating image similarities that is based on human perception. Our three experiments and their quantitative validation via retrieval precision indicates that our genetic algorithm based approach is a successful attempt at bridging the semantic gap. That we are able to intuitively corroborate the features selected by GA provides additional support that our approach is in the right direction.

References

1. N. Doulamis and A. Doulamis, Signal Process-ingdmage Communication 20, 334 (2006).

2. J. Li and J. Z. Wang, IEEE Trans. Pattern Analysis and Machine Intelligence 25, 1075 (2003).

3. M. E. Celebi and Y. A. Aslandogan, Content-based image retrieval incorporating models of human perception, in ITCC, 2004.

4. T. Frese, C. A. Bouman and J. P. Allebach, A methodology for designing image similarity metrics based on human visual system models, in Proc. of SPIE/IS&T Conf. on Human Vision and Electronic Imaging II, 1997.

5. A. Mojsilovic, Matching and retrieval based on the vocabulary and grammar of color patterns, in ICMCS, 1999.

6. N. Guyader, H. Le Borgne, J. Herault and A. Guerin-Dugue, Towards the introduction of human perception in a natural scene classification system, in IEEE Int. Workshop on Neural Networks for Signal Processing, September 2003.

7. B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan and A. Yamada, IEEE Trans. Circuits Systems and Video Technology 11, 703 (2001).

8. T. Mitchell, Machine Learning (McGraw Hill, 1997).

249

Identification of Team in Possession of Ball in a Soccer Video Using Static and Dynamic Segmentation

V. Pallavi, Jayanta Mukherjee and A. K. Majumdar

Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur, 721302 E-mail: {pallavi, jay, akmj}<8cse.iitkgp.ernet.in

Shamik Sural

School of Information Technology Indian Institute of Technology, Kharagpur, 721302


In this paper we describe a novel approach to identify the state of ball possession by two teams in a soccer video. The proposed approach uses a combination of a few techniques to identify the presence of players and ball. We segment soccer video frames by Markov Random Field (MRF) clustering technique and use a colour-texture feature for classifying the players of the two teams. The players are detected by a combination of static analysis, optical flow analysis and difference image analysis. The classification decisions from the three techniques are combined together using the Dempster Shafer theory of evidence. The ball in the video frames is identified using Hough Transform for circle detection. The proposed approach of combining evidences from various techniques makes the analysis more robust.

Keywords: Video analysis; Segmentation; Ball possession; Dempster Shafer Theory

1. Introduction

Increasing availability and use of video has led to a demand for efficient and accurate automated analysis techniques. As the volume and complexity of video information grows, the need for more intelligent video manipulation techniques also increases. Sports videos are usually distributed to a large audience and have a wide appeal. Among various kinds of sports video, we concentrate on soccer video analysis.

Different approaches for analyzing sports video are reported in the literature. Motion based event recognition using Hidden Markov Model (HMM) was proposed by Yang et al.1 to classify basketball video. Li and Sezan2 also proposed a Hidden Markov Model using colour, motion and shape features to discriminate between play and break sequences in baseball, soccer and sumo videos. Jonker et al.3 used Hidden Markov Model for recognizing strokes in tennis videos. Some of these works are also related to the analysis of soccer video. Shum et al.4 proposed an automatic technique for extracting colour models of the field and the team uniforms. Dominant color region detection and shot classification have been used by Tekalp et al.5 for automatic soccer video analysis

and summarization. Most of the approaches use either static or dynamic features for soccer video analysis. But we use a combination of static and dynamic features for video analysis. In this paper we describe a new approach based on a state based video model to identify the team that is in possession of the ball in medium shots of a soccer video.

In the proposed scheme, shots in the soccer video are detected by a colour-texture histogram and then they are classified by a dominant region histogram shot detection technique.6 Hough Transform is applied to detect circles as probable ball candidates for ball. We also use a combination of static and motion analysis to identify the presence of players in a video shot. Static analysis uses colour and texture features while motion analysis includes difference image and optical flow techniques. The classification decisions of these individual techniques are combined together by the Dempster Shafer theory of evidence. The rest of the paper is organized as follows. Section 2 describes the state based video model for representing soccer video. Section 3 describes ball identification as well as our method for player detection. In Section 4, we identify the team in possession of the ball. We show our results in Section 5 and we conclude the paper in Section 6.

http://8cse.iitkgp.ernet.in


250 Identification of Team in Possession of Ball in a Soccer

2. State Based Video Model

A video data model is a means of extracting information contained in an unstructured video data and representing relevant information in order to support users queries. The major existing approaches for video modelling are temporal modelling, object-oriented modelling and algebraic modelling of video.

We adopt an object oriented approach for describing a soccer video. A video object is a spatio-temporal region that corresponds to a real-world object. In soccer video, players, referee, ball and goal area can be treated as objects. Soccer video objects can be described by one of the low level features such as colour, texture, motion, shape, etc. So the object oriented approach is best suited for describing a soccer video. In our work, we consider three objects in a soccer video, namely the two teams and the background that includes the field and the spectators.

A detailed study of soccer videos shows that dynamic properties of the video objects dictate the video content. Hence modelling the behaviour of video objects is very important. An approach to describe the video content is essential when a video is used as a medium to analyze the dynamic behaviour of a system. We take up a state chart based formalism7,8 to describe the semantics of video in terms of intrinsic behaviour of perceptible objects of interest in a video like its states and state transitions. Fig. 1 shows the state chart diagram for possession of ball between two teams in a soccer video. The two classes

Fig. 1. State Chart diagram for ball possession.

i.e. Team A and Team B try to possess the ball during the play. As the match continues, either objects of Team A or of Team B are in ball possessing state. There might be a fight state where objects of both the classes fight to come to the ball possession state.

3. Object Detection and Classification

We partition a video stream into shots. The shots are then classified into one of long, medium or close

Video Using Static and Dynamic Segmentation

shot. Since our work primarily concentrates on the problem of identifying the team in possession of the ball only for medium shots, we detect medium shots in a soccer video. We perform shot detection using a colour-texture histogram from the HSV colour space.9 The shots identified here are then classified into close shots, medium shots or long shots by a shot classification technique6 based on dominant color ratio. These medium shots are then processed to locate the ball and the players.

3.1. Ball Detection using Hough Transform

The fact that the ball is present in a shot is essential to identify the team in possession of the ball. We therefore identify all the frames in the medium shots where the ball is present. Ball is circular in shape and has significant motion. We use these shape and motion features to detect ball in the frames of a soccer video. Hough transform for detection of circles is used to identify the ball in video frames. With this technique many probable ball candidates are detected. We then find the optical flow velocity for each candidate. Candidate having the maximum velocity is identified as the ball region. It has been observed that the ball is detected with an accuracy of 86.21%. Fig. 2 shows the ball identified in a medium view. Then we detect and classify player objects us-

Fig. 2. Ball identified using Hough Transform.

ing static and dynamic segmentation in the frames containing the ball.

3.2. Identification of Objects using Static Segmentation

One of the basic requirements of soccer analysis is to group low-level pixel information into meaningful

V. Pallavi, Jayanta Mukherjee, A. K. Majumdar, Shamik Sural 251

sets or partitions so that relevant information can be abstracted from the video frames. We have used a segmentation algorithm10 based on MRF processing. This partitions a video frame into segments or regions.

Each frame from medium shot where the ball is identified by ball detection algorithm is selected and segmented. The segments thus obtained after segmentation belong to one of the three classes. Each segment is considered and its COLTEX9 feature vector, which is a colour texture histogram, is generated. 1-NN Classification rule is applied to identify the class to which the segment belongs using the Object Classifier.11 Since in our subsequent work, we want to identify only the players so we consider only the segments classified to Team A or Team B. Experiments were carried out with medium shots from the soccer video match played between Real Madrid and Manchester United (UEFA Champions League 2003). We use sensitivity and specificity as the performance measures which are denned as:

a = No of correctly classified segments belonging to a class b = No of actual segments of a class c = No of segments classified to a class

Sensitivity = | , Specificity = ^

This method gives a sensitivity of 70.83 % and specificity of 64.23 % for Team A and a sensitivity of 71.43% and specificity of 7.43% for Team B. It has been observed that some spectators wear jerseys similar to that of the players to cheer their favourite teams. As a result in some frames a few spectator segments get identified as the team causing low specificity.

3.3. Identification of Objects using Dynamic Segmentation

The disadvantage of identifying objects with static segmentation method in the above section is its low specificity. The use of only colour and texture information causes this low specificity. Players in the soccer match are usually in a moving state. So we introduce dynamic image segmentation based on motion features to obtain more legitimate segments of objects. In the literature moving object detection techniques fall under three main categories - opti

cal flow segmentation, frame differencing and background subtraction. For static cameras, background subtraction is probably the most popular method. Since soccer video depends on parameters like camera motion, camera angles and focusing parameters, we use the optical flow segmentation12 and frame differencing13 methods. In a soccer video, both player as well as the camera move, so the task here is to extract only the moving objects.

3.3.1. Optical Flow Analysis

Optical flow is the distribution of apparent velocities of movement of brightness patterns in an image. Thus, optical flow velocity can give important information about the spatial arrangement of the objects viewed and the rate of change of their arrangement. We use Horn and Schunck's method12 to compute the optical flow in the soccer video frames to identify the moving objects that is the players in the soccer video.

A frame from a medium shot is selected and its motion vectors are obtained. All the segments or the regions formed after the segmentation algorithm based on MRF processing are now considered one by one. For each region we find the ratio of the number of moving pixels to the number of pixels belonging to that region. If the ratio lies between 0.2 and 0.6, then the segment is considered to be moving. Thus, we get some player segments from the frames, which are objects of both the team classes.

We have experimented with the medium shots from the same soccer video. The algorithm identifies some segments as moving objects of the soccer frame and then Object Classifier classifies them to Team A, Team B or the background. This method gives a sensitivity of 70.83% and specificity of 89.58% for Team A while a sensitivity of 54.54% and specificity of 59.09% for Team B. This method keeps the sensitivity and specificity for Team A same as compared to the static segmentation method. The improved specificity of this technique to that of the static analysis is strong evidence that the segments identified belongs to the moving objects.

3.3.2. Difference Image Analysis

Difference image analysis13 is applied on frames within the medium shot. A negative and a positive

252 Identification of Team in Possession of Ball in a Soccer Video Using Static and Dynamic Segmentation

difference image are computed which show the moving objects in a frame. The difference image analysis gives good results for the frame in a shot where there is no camera motion. Since camera motion is present even in a single shot so we apply camera motion correction11 to the frames. All the segments formed after segmentation10 are now considered one by one for integration with difference image analysis.11 This gives some segments that are supposed to be the players. These segments are then classified into teams using Object Classifier.

This method gives a sensitivity of 70.83% and specificity of 89.58% for Team A while a sensitivity of 67.86% and specificity of 69.05% for Team B. This method increases the specificity while keeping sensitivity almost the same as compared to the static analysis method. It is also observed that the percentage of correct object coverage increases here as number of segments identified as moving objects has been reduced. Thus, moving object detection using difference image analysis gives better performance than static segmentation.

It has been observed that in some frames where both the teams are present then segments belonging to Team A or Team B are identified by more than one technique. As a result combining all the evidences becomes essential.

4. Dempster Shafer Theory for Combining Evidences

In the previous section we have used static as well as dynamic segmentation for detecting the players. In this section we use Dempster Shafer theory to combine evidences obtained from different classification techniques like static segmentation, optical flow analysis and difference image analysis. Dempster Shafer theory is based on two ideas: the idea of obtaining degrees of belief for a hypothesis and Dempsters rule for combining such degrees of belief when they are based on independent items of evidence. There could have been different approaches to combine the evidences like voting principle, Bayesian approach, logistic regression or some adhoc methods like simple averages. But, in these approaches either the decisions are treated equally or there is difficulty in accurately estimating the probabilities. So we use Dempster Shafer theory where a probability function can define the degree of beliefs.

4.1. Dempster Shafer Theory

The Dempster Shafer theory starts by assuming a Universe of Discourse '©'also called Frame of Discernment, which is a set of mutually exclusive alternatives. It associates to each subset 's' of ' 6 ' : a basic probability m(s), a belief Bel(s) and a plausibility belief Pla(s). m(s), Bel(s), Pla(s) have values in the interval [0,1] and Bel(s) is not greater than Pla(s). 'm' represents the strength of some evidence. If a rule is applied, then 'm' may represent the effect of that rule. Bel(s) summarizes all the reasons to believe 's'. Pla(s) expresses how much should be believed in 's' if all currently unknown facts were to support 's'. A map m : 2e —> [0,1] such that for each XeO we have :

m(<f>) = 0

Exrf "»(*) = !

is called the basic probability assignment (bpa) for 0 . Here $ is a null set.

Suppose we are interested in finding the combined evidence for a hypothesis C. Given two independent sources of evidences m\ and m^, Dempsters rule for combining evidences is as follows.

Here m,i^(C) is the combined Dempster Shafer's bpa for C. m\ and m,2 are the basic probabilities assigned to sets A and B respectively by two independent sources of evidence. A and B are sets that include C. A and B are not necessarily proper subsets and they may as well be equal to C. The numerator accumulates the evidence, which supports a particular hypothesis and the denominator conditions it on the total evidence for the hypothesis supported by both sources.

4.2. Identification of Team in Possession of Ball

Here we apply the Dempster Shafer theory to identify the team in possession of ball in the medium shots. We have three sources of evidence: The output of static segmentation that identifies all the regions which belonging to Team A or Team B and excludes all the regions which are classified as background. Then we have optical flow analysis technique


and difference image technique that identifies some moving regions out of a frame. Our objective is to integrate these evidences in support of our hypothesis of ball possession. We have assumed if players of both the teams exist in a frame then there is a fight for ball possession. On the other hand, a player of a single team would indicate the ball possession by that team only. Thus in a shot the ball may be in possession of Team A or Team B or in fight state. Even within a shot these states may be present. We would be verifying our hypothesis by the states of ball possession manually observed for medium shots.

4.2.1. Basic Probability Assignment

By default all the three techniques operate independently. The results of any of the three techniques do not affect each other. Hence the independent assumption of the Dempster Shafers theory holds. We designate the basic probabilities for the three techniques as mi, 7B2 and m.3 respectively. A video frame

N

J- is segmented into N regions such that T = M Rn

7 1 = 1

Each of the regions R can be classified into Team A, Team B or background. The three classes constitute a set of mutually exclusive alternatives. So, the Dempster Shafer Theory starts here by assuming © such that: 9 = { "TeamA", "TeamB", "Background" } Therefore, superset V(Q) is such that: "P(0={ "TeamA", "TeamB", "Background", "TeamA.TeamB", "TeamA,Background", "TeamB,Background", "TeamA,TeamB,Background" } Frames in a medium shot may have a player or more than one player from any of the two teams. Here we try to find the presence of a player in a frame by assigning basic probabilities to each of the three classes by a function that gives the evidence of the presence of a class in the frame. Let us denote this measure for the ith source of evidence as Ei(e),eeV(&). We have modeled the function Ei(e) as follows: Ei(e) = AeSi, (i = 1,2,3) are the three independent sources of evidence. Si is the sensitivity of e for the ith source of evidence and Ae is the area of pixels identified as e by the ith source of evidence. Let us also define the normalizing factor for assigning the basic probabilities to eeV(Q) as:Ke = 5Zvee2e AeSi Hence the basic probabilities assigned to eeV(Q) is: mj(e) = y . It may be noted that in our model we accumulate the evidences for the presence of an

object or objects by aggregating the 'areas' of segments related to those objects. At the same time, the measure is also proportional to the sensitivity factor of the classifier that classifies those segments to different objects. Let us consider the static analysis module for a frame J- which shows all the regions that are identified as Team A and Team B by the classifier.

Ei(TeamA) =

Area-of-regions-identifiedjas-TeamAxSiA Ei{TeamB) = Area-of .regions Jdentifiedjis-TeamB XS\B Ex (TeamA, TeamB) = Ei(TeamA) + Ei(TeamB) Ei(TeamA,BG) = Ei(TeamA) + EX(BG) Ei(TeamB,BG) = Ei(TeamB) + Ei(BG) Ei (TeamA, TeamB, BG) = Ei(TeamA) + Ei(TeamB) + Ei(BG) therefore mx(TeamA) = i(El(TeamA)+El(êamB))

mi(TeamB) = i(El(TeamA)+E1(TeamB)) mi(J3G) = 0

mi(TeamA, TeamB) = ^ T m ^ S ^

Basic probabilities for "Team B", "Background" and "Team A, Team B" can be calculated similarly. Here E\ (BG) is zero because we do not consider the regions identified as background. Basic probabilities can be calculated for optical flow analysis and difference image analysis techniques in a similar way.

4.2.2. Combining Evidences

We combine m\, m^ and rri3, the basic probabilities for our three techniques by Dempster Shafer theory of evidence combination to get mi$,3(TeamA), mi,2,3(TeamB), mi,2,3(TeamA, TeamB) . Since we try to find the presence of a team in medium shots of a soccer video so we consider only the subsets "Team A", "Team B" and "Team A.Team B". Finally we decide the presence of Team A, or Team B or both in a frame. If basic probability of Team A is greater than that of Team B by TIDS then frame belongs to Team A and if basic probability of Team B is greater than Team A by T I D S then frame belongs to Team B. We conclude that both the teams are present if :

((mh2,3(TeamA)) + (mh2,3(TeamB))/2)

254 Identification of Team in Possession of Ball in a Soccer Video Using Static and Dynamic Segmentation

—mi,2.3(TeamA,TeamB) < T2DS-

Empirically, T\DS has been found to be 0.2 and

T2Ds to be 0.12.

5. R e s u l t s

We performed an experiment on all the medium

shots identified by the shot classification technique

in the soccer video played between Real Madrid

and Manchester United . Frames from these medium

shots where ball is identified by the ball detection

algorithm are now processed. Then the Dempster

Shafer theory for combining the three evidences is

applied to find the presence of Team A or Team B or

Team A and Team B in the frames to identify ball

possession states. Out of 28 medium shots 26 were

classified as medium shots by the algorithm. There

were a total of 32 ball possession states and substates

in and within the shots. Out of the 32 states and

substates 23 were correctly identified. True identifi

cation of ball possession states for Team A is 66.67%,

Team B is 80% and fight is 73.33%. Table 1 shows

the comparative results of combining the evidences ie

static analysis (SA), optical flow analysis (OF) and

difference image analysis (DI) using Dempster Shafer

theory. Fig. 3 below shows a bar chart, which plots

the ball possession states and state transitions for

the teams for medium shots in a par t of the match.

Ball Possession State of Teams for a Part of the Match

UmiawiiuA Shot

1 I • Shot (Without B»W

s

Frame Numbers

Fig. 3. Ball possession states of team for a part of the match


In this paper we have proposed a new approach to

identify the team in possession of ball in a soccer

video. It is shown that combining static and dynamic

analysis with Dempster Shafer theory of combination

gives good results for detecting moving objects in a

soccer video. This approach thus accurately identi

fies the ball possession states, substates and state

transitions in a soccer video.

Table 1.

Evidences combined SA+OF SA+DI DI+OF

SA+DI+OF

Comparative results

Team A

(%) 45.45 55.56 66.67 66.67

Team B

(%) 80

66.67 80 80

Fight (%)

66.67 80

66.67 73.33

References

1. G. Xu, Y. Ma, H. Zhang and S. Yang, Motion based event recognition using hmm, in Proc. ICPR, (Can-nada, 2002).

2. B. Li and I. Sezan, Event detection and summarization in sports video, in IEEE Workshop on Content-based access of Image and Video Libraries, (Hawaii, 2001).

3. M. Petkovic, Z. Zivkovic and W. Jonker, Recognizing strokes in tennis videos using hidden markov models, in IASTED International Conference Visualization, Imaging and Image Processing, (Spain, 2001).

4. L. Wang, B. Zeng, S. Lin, G. Xu and H. Shum, Automatic extraction of semantic colours in sports video, in ICASSP, (Cannada, 2004).

5. A. Ekin, A. Tekalp and R. Mehrotra, Automatic soccer video analysis and summarization (IEEE Transactions on Image Processing, 2003).

6. V. Pallavi, J. Mukhopadhyay, A. K. Majumdar and S. Sural, Shot classification in soccer videos, in ReTIS, (Kolkata, India, 2006).

7. B. Acharya, A. K. Majumdar and J. Mukherjee, State chart based approach for modelling video of dynamic objects, in SPIEs International Symposium ITCOM, (Colorado, USA, 2001).

8. B. Acharya, A. K. Majumdar and J. Mukherjee, Modelling Dynamic Objects in Video Databases : A Logic Based Approach (Springer Verlag : Lecture Notes in Computer Science, 2001).

9. A. Vadivel, M. Mohan, S. Sural and A. K. Majumdar, A colour-texture histogram from the hsv color space for video shot detection, in International Conference on Vision, Graphics and Image Processing, (Kolkata, India, 2004).

10. J. Mukherjee, MRF clustering for segmentation of colour images (Pattern Recognition Letters, 2002).

11. V. Pallavi, A. Vadivel, S. Sural, A. K. Majumdar and J. Mukhopadhyay, Identification of moving objects in a soccer video, in WCVGIP, (Hyderabad, India, 2006).


12. B. K. P. Horn and B. G. Schunck, Determining op- tures in dynamic scene analysis (Image and Vision ticalflow (Artificial IntelL, 1981). Computing, 1984).

13. R. Jain, Difference and accumulative difference pic-

256

Image Retrieval Using Color, Texture and Wavelet Transform Moments

R. S. Choras

Deparment of Telecommunications University of Technology & Agriculture

S. Kaliskiego 7, 85-796 Bydgoszcz, Poland E-mail: choras@mail. atr. bydgoszscz.pl

Content-based image retrieval systems allow the user to interactively search image databases looking for those images which are similar to a specified query image. Similarity between images is then assessed by computing similarity between feature vectors. These features are represented in the vector form, and often are combined together. This paper explores novel wavelet approach to image retrieval based on a combination of color, texture and wavelet moments features. Color moments are used as color features which increase the precision of the retrieval process. Texture features are described by the mean, variance and energy of wavelet decomposition coefficients in some subbands. For describing shapes we propose a method that combines feature vector as a set of wavelet moment invariants.

Keywords: Image retrieval; wavelet transform; texture features; pseudo Zernike moments; wavelet moments.

1. Introduction

Visual information retrieval represents a new research direction in fields information technology.1

In a typical content-based image database retrieval application, the user has an image and is interested in finding similar images from the entire database. For each image in the database, a feature vector characterizing some image properties is computed and stored in a feature database. For a query image, its feature vector is computed, compared to the feature vectors in the feature database, and images most similar to the query are returned to the user (Figure 1).

The general idea of retrieval mechanisms can be described in the following steps:

• for each image in the database, a feature vector characterizing image properties is computed and stored in a feature database,

• feature vector of a query image is computed and compared to the feature vectors in the feature database,

• most similar images to the query image are returned to the user.

The features and the similarity function of the feature vectors should be efficient to match similar images and be able to discriminate dissimilar ones.

This work has been motivated by previous results on image analysis and content description.2

This paper combines the color, texture and wavelet moments to image content representation for still image databases.

QUERY FORMATION

IMAGE DATABASE

* VISUAL CONTENT DESCRIPTION

FEATURE VECTORS

FEATURE DATABASE

SIMILARITY COMPARISON

INDEXING and RETRIEVAL

RESULTS

Fig. 1. CBIR system

Wavelet transform is a powerful analysis tool for image and signal processing. Wavelet transform decomposes image into multiresolution subbands with lower dimensionality. The wavelet transform is known for its capability of multi-resolution decomposition and coefficients decorrelation and may characterize images better than other known techniques for content-based image retrieval. Therefore, features from different subbands can form an multidimensional feature vector with different dimensions most

http://bydgoszscz.pl

R. S. Choras 257

likely uncorrelated to each other. This multidimensional feature vector will be suitable to represent the image.

Feature extraction is a data mapping procedure which determines an appropriate subspace of dimensionality M from the original feature space of dimensionality N (N < M). The local energy variation of an image in different spectral bands at each scale (or frequency) provide useful information for image classification.

2. Wavelet Transform

A continuous 2-D wavelet transform is a projection of the image function f(x, y) onto a family of functions which are linear combinations (dilations and translations) of a unique function * (X ,T / ) . 3 The function ty(x,y) is called a 2-D mother wavelet, and the corresponding family of wavelet functions is given by

*»,«,»(a;) = > / I t f ( s ( x - u , y - v ) ) (1)

where s, u,v,£ R and s and (u, v) are called the scale and the translation. The 2-D wavelet transform of the function f(x, y) e Li is defined by

/

+oo /H-oo / f(x,y)ys,u,v(x,y)dxdy

-00 J — OO

(2) This transform can be interpreted as a decom

position of an image function f(x,y) such that the frequency spectrum of the transformed function T<bf(s,u,v) has the same bandwidth in a logarithmic scale s.

Let the function^ be a differential operator, then the transform T*/(s , u, v) will be a continuous scale-space representation of the image edges. If we use low-pass approximation operator $ instead of high-pass differential operator ^ , the transform will produce a continuous scale-space representation of the image approximations.

Using the 2-D continuous wavelet transform T<s/f(s,u,v) or the 2-D continuous scale transform T$f(s,u,v) we can decompose the image function f(x,y) on different levels of resolution (depending on the scale s).

We obtain discrete dyadic wavelet functions by discretizing both the scale factor s and the translation (u,v). The corresponding set of the dyadic wavelet functions and the scale functions is

Vm(x) = y/s${SX - k) (3)

where s = i and (u = k,v = I), where j,k,l € Z, and where Z denotes the set of integers. m is actually the function of j and k: m = 2j + k, k = 0 , 1 , . . . , 2n'j - 1 for j = 0 , 1 , . . . , n.

The coefficients of the wavelet function can be computed according to the formula

/

+oo /(x)tt(2>'a; - k)dx

•OO

(4) where (•, •) denotes the scalar product. The cor

responding inverse wavelet transform is

OO

f{x) = Y, Cm%^{x) (5) m=0

For multidimensional analysis of the function f(x,y) we introduce an approximation operator Aj, then Ajf will be an approximation of / on the j t h level of resolution, Ao represents the identity, Ajf € Vj holds. The scale s on the j t h level of approximation is s = JJ- . Practically, we use limited number of levels j = 0 , 1 , . . . , n, where the nth level will be the coarsest resolution with the smallest scale s = jr . Next we introduce a differential operator Dj. The term Djf will denote the difference between two approximations Ajf and Aj-if on the j t h and the (j — l)th levels of resolution, i.e.,

Djf = AJ-1f-Ajf for j = l , 2 , . . . , n (6)

Each approximation Ajf(x,y) and difference Dj:Pf(x,y) can be fully characterized with a 2-D scale function $(x, y) and its associated wavelet functions $!(x,y), p = l , 2 , 3 ,

+oo +oo

Ajf{x,y)= Yl Yl ahk,i$j,kAx,y) fc=—oo (=—oo

+oo +oo

Dj,Pf{x,y)= Y $ Z dhk,i,P®j,kAx,y) (7) fc=—oo J=—oo

where

258 Image Retrieval Using Color, Texture and Wavelet Transform Moments

» a

£-'

ib Fig. 2. Process of wavelet analysis.

Fig. 3. Wavelet pyramid.

*i,JW(a?,tf) = 5 J ^ ( ^ — 5 ™ ^ — )

•**,!*(*> y) = ^ * p ( £ ^ » ^ ) **>' € z (8)

%\M = {/(^2/)»*i,M(^y)>5

rfj,fc,i,P = (/(£, y), *j,fc,/fP(a:, y)> (9)

Suppose that the coordinates x and y are not correlated, and the 2-D scale function #(x,t/) and 2-D wavelet functions #(x, y), p = 1,2,3 are separable. Than we can write

#(x, y) = 0(x)^(y)

* i (x ,y) = ^(x)#(y)

^ J ) = # W I / ) (io) where 0(x) is a 1-D scale function and ip(x) is

a 1-D wavelet function. Apparently, # 1 ? # 2 5 ^ 3 extract the details of 2-D image function f(x, y) in the x-axis, y-axis and in the diagonal directions, respectively.

This representation is called the wavelet pyramid of 2-D image function. Given a discrete image / (x , y) with a limited support x,y = 1,2, . . . ,2 n , the actual procedure for constructing this pyramid involves computing the coefficients a^kj and djtkti,p which can be grouped into four matrices A$, Dj,pi p = 1,2,3, on each level j

Aj = (%-,fc,i), DjiP = (dj,fc,i,p)

for M = l , 2 , . . . , 2 n - ' (11)

DWT decomposes an image into a pyramid structure of subimages with various resolutions corresponding to different scales. Given N x N image, a fully decomposed wavelet transformation results in 3 x flog2 N~\ +1 subimages called subbands. The computation low of the pyramidal decomposition and the results of an image are shown in Fig. 2 and 3.

3 . Color features

Each pixel of the image can be represented as a point in a 3D color space. Commonly used color spaces for image retrieval include RGB, Munsell, CIE l*a%*, CIE L*u*v*, HSV (or HSL, MSB). The CIE L*a*b* and CIE L*u*v^ spaces are device independent and considered to be perceptually uniform. They consist of a luminance or lightness component (L) and two chromatic components a and b or u and v. HSV (or HSL, or HSB) space is widely used in computer graphics and the three color components are hue, saturation (lightness) and value (brightness). The hue is invariant to the changes in illumination and camera direction and hence more suited to object retrieval.

Color in general will not allow us to determine an object's identity and for robust retrieval needs additional information such as texture and shape.4 Color feature vector is used to retrieve false positives - images which have a similar color composition as the query image but with a completely different content.

A color image I(x, y) of size X x Y consists of three channels I = ( IR , IQ, IB)* We compute a j level 2D wavelet transform on each of three matrices. Then Aj of each transform matrix represents the lowest frequency band and is used to extract color information.

R. S. Choras 259

Thus, if we interpret the color distribution of an image as a probability distribution, then the color distribution can be characterized by its moments, as well,56 Color moments have been successfully used in many retrieval systems. Furthermore, because most of the information is concentrated on the low-order moments, only the first moment (mean), the second moment (variance), and the third central moment (skewness) were used. If the value of the i color channel at the (x,y) image pixel is Pi(x, y) and the number of pixels in the image is N = X x Y, then the index entries related to this color channel are:

1 X Y

1 x Y

I x Y

x = l y—1

Since only 9 (three moments for each Aj of the three color components) numbers are used to represent the color content of each image, color moments are a very compact representation compared to other color features.

The similarity function used for retrieval is a weighted sum of the absolute differences between the suitable moments. Each moment entry is weighted by a value wn, tUj2, wa > 0 selected by the user. The images to be compared have been quantized on the same color spaces.

4. Texture Features

Texture is another important property of images. While color is a point property, texture is a local neighborhood property. Basically, texture representation methods can be classified into two categories: structural and statistical.7 Texture refers to the visual patterns that have properties of homogeneity that do not result from the presence of only a single color.

The method of feature extraction of texture based on wavelet decomposition is as follows:

(1) Wavelet decomposition. We perform a multires-olution decomposition of the image using DWT. The discrete wavelet transform can be thought of as filtering of the image into sub-bands by an array of scale and orientation specific filters. Since wavelets have good spatial and frequency localization, the wavelet coefficients provide a good characterization of the image objects. For the DWT we used D6 orthogonal wavelet which prevent correlation between scales in the decomposition of the image. We perform a three-level wavelet decomposition of an image which results in 10 sub-bands.

(2) Define energy of wavelets coefficients in each sub-bands.

E(Aj)= Yl 4 M (14)

E(Dj<p)= Yl 4w , P ' P = 1 ' 2 > 3 (15)

(3) The feature vector is constructed as (for p = 1,2,3)

Bj = [E(Aj), E(DjtP), E{Dj-l<p), S(£>,_2,p)] (16)

(4) Compute similarity distance

S1(Bj) = \Bj-Bj{ref)\ (17)

and

S2,U) = S2(Aj)*S2{Dj,p) (18)

where

\Aj{k + r,l + c)* AUref)(k + r, I + c)\ 2[ j) \Aj(k + r,l + c)\\Aji{ref)(k + r,l + c)\

(19)

S2(Dj,p) = \Dj,p - .Bj,p(re/)l (20)

(r, c) £ N and iV is the neighbourhood of a given point.

dcoior(Q,I) = YWil\mi Tni\+Wa\(7i -°i\+Wi3\sf-s!\ (13)


In this approach we used information about energy wavelet coefficients in 10 subbands to construct feature vector. The discrimination of different textures may be performed on the basis of only Djti and Djt2 high frequency components (horizontal and vertical subband images). The low-pass components is not used because textures are better described through the higher frequency channel than through approximation component. In this case, steps involved in extracting the texture feature vector, are the following:

(1) At each j t h level calculate the mean and standard deviation of the detailed subimages

2 " - J 2n~>

& (Dj*') = F77 E E l^>'.p' I (21) fe=i ;= i

arj(Djy) = 2n~i 2™~i

1 k=X l=\

for k, I = 1,2,. . . , 2"-J and p' = 1,2. (2) Feature vector is defined as follows:

Bj = I/i.M'/j^'M/'^/]

(22)

(23)

where

/>> = 2[muj(Dj,i) + mUj(Dj,2)}

fj,* = âj(Dj'l)+aj(Dj'2^ and fif,crf are respectively mean and stan

dard deviation the original image.

In this method for the level decomposition we have 8 features.

The texture similarity of a query image Q and a image I in the database is defined as

dtextureiQ, /)(M, < ) = E E d W ' 7 ) ( 2 4 ) m n

where

d(Q,I) = (Q) U)

s ( ^ ) + <?*-<? H ( ^ )

(25)

where S (^ m n ) and E(amn) are respectively mean and the variance of the transform coefficients over the database.

5. Shape Feature Extraction Based On Wavelet Transform

Our approach for feature extraction is based on DWT, and on calculation of their moments. DWT decomposes image into several different sub-bands. Some sub-bands are fed into moment descriptor block for analysis. The output from this block are a wavelet moment based feature and is further used to dissimilarity distance computation for feature matching Fig. 4.

" : • <

Fig. 4. Feature extraction based on DWT

Basically, shape based image retrieval is the measuring of similarity between shapes represented by their features. Shape is an important visual feature and it is one of the primitive features for image content description. However, shape content description is a difficult task because it is difficult to define perceptual shape features and measure the similarity between shapes. To make the problem more complex, shape is often corrupted with noise, defection, arbitrary distortion and occlusion. To characterize the shape we used following descriptors: Pseudo -Zernike moments of Wavelet Coefficient and wavelet moments.

5.1. Feature Extraction Based on Moments of Wavelet Coefficient

Pseudo-Zernike moments are used in several pattern recognition applications as feature descriptors of the image shape, and have proven to be superior to other moment functions such as Zernike moments in terms of their feature representation capabilities and robustness in the presence of image quantization error and noise.8

Pseudo-Zernike moments offer more feature vectors than Zernike moments and are less sensitive to image noise than the conventional Zernike moments. However, pseudo-Zernike moments have not

R. S. Choras 261

been effectively used in image processing applications because their computation with present methods is more complex and lengthy compared to regular moments, Legendre moments and even Zernike moments. To reduce computational requirements of pseudo Zernike moments we used the wavelet transform. Wavelet transform decomposes image into multires-olution subbands with lower dimensionality. Then, a Aj subband is input into pseudo Zernike moments for feature description.

The 2D pseudo Zernike moments of order s with repetition q of subband Aj = (oj.r.e) are defined as

PZsq = -^— / Vsq(r,6)Ajrdrd6 (26) i" Jo Jo

where Zernike polynomials

Vaq(r,6) = Rsge-i«e; j = >/=! (27)

andr = yjx1 + y2, 9 = arctan(^); —l<x,y<l.

The real-valued radial polynomials are defined as

s " ( ) U{ l)tKs + \q\ + i-ty.(s-\q\-tyr

(28) and 0 < \q\ < s, s > 0.

As a shape feature vector we used pseudo Zernike moment order 7 (eight moments), which possesses the best features that optimally describe the shape.

5.2. Feature Extraction Based On Wavelet Moments

Generalized formula for rotation invariant moments order pq is

Fpq= f ff(r,6)ejqegp(r)rdrd0. (29)

where gp(r) is a function of radial variable r.

Based on expression (29) we can easily obtain another moments e.g. Hu's moments and Zernike moments. Function gp(r) may be defined globally (on

the whole domain r) or locally and then Fpq is respectively global or local feature.

The functions gp(r) may be replaced by wavelet basis functions and then we have wavelet moments expressed as9 '10

FWm^q = f f f(r, 6)e^9^m<n(r)rdrd9 (30)

where m = 0,1,2,3, n = 0 , 1 , . . . , 2 m + \ f(r, 6) is image in polar coordinates, q (integer) represents the qth frequency feature of the image in the phase domain {0 < 9 < 2ir}.

^m,n(r) is wavelet basis function and we considered wavelet function family \I>a,(>(r) = TTJ^C2^) where a and 6 are respectively a dilation and shifting parameters. The image size is considered in domain r < 1, the wavelet defined along radial axis is *m,n(r) = 2 ^ * ( 2 m r - 0 . 5 n ) .

Wavelet moments are new moment features which combine the wavelet characteristic and moment trait. Considering the respective characteristic of moment features and wavelet analysis, wavelet moments not only have the feature of moments (translation, scaling, rotation invariance) but also have the trait of multiresolution analysis.

5.3. Feature Vectors Classification

A suitable classifier must be chosen based on the characteristics of feature vector.

The distance between two feature vectors is considered as a sum of distances relative to each of component pairs. We use the Bhattacharrya distance as the distance Di between the component of the two feature vectors with Vi = 2 , . . . , n where n + 1 is the size of the feature vectors.

Distance D between two feature vectors is chosen as

n D = Y,Di (31)

t=0

Generally, given two images I\, It represented by feature vector set Feat\, Feat? the similarity d{Feat\,Feat2) is measured by sum of cross-correlation between the corresponding components


K Featk

l • Feat\ d(Feat1,Fc«ta) = - g p ^ j j p ^ (32)

where Feat\, Feat^ denote the feature vectors of

the Arth component of the image I\, h, respectively.

6. C o n c l u s i o n s a n d Future D e v e l o p m e n t

The tota l similarity distance between a query image

and image in the database including the color, tex

ture and shape information is used as follows

D{Q, I) = dealer + dtl + d, •shape (33)

For evaluating the performance of the algorithm,

we used the Corel Pho to Gallery da ta set published

in.11 There are twenty semantic categories, each of

which consists of 100 images. All images are of size

of 384 x 256 pixels. Figure 5 is the query result of

Tiger image that consists of the top nine similar im

ages. The image in the first row is the query image.

The images in the next two rows are displayed in

descending ranks.

."•J.J}, ••••••4

Fig. 5. Sample query result

The combination of color, texture and shape fea

tures gives a percentage of correct classification of the

first retrieved subimage equal to 95.7%.

Other images were retrieved and ranked based

on the similarity to the query image.

While the results in this paper indicate tha t

content-based image retrieval over image database

7.

8.

is a promising approach, much further work needs

to be done before we can fully validate the effective

ness of content-based retrieval on other multimedia

database such as video.

References

M. S. Lew, Principles of Visual Information Retrieval (Springer-Verlag, London, 2001). R. S. Choras, Content-based retrieval using color, texture and shape information, in Progress in Pattern Recognition, Speech and Image Analysis, (Springer-Verlag, 2003). S. Mallat, A Wavelet Tour of Signal Processing (Academic Press, San Diego, 1998). M. Flicker and et al., IEEE Computer 28, 23 (1995). M. J. Swain and D. H. Ballard, International Journal of Computer Vision 7 (1991). M. Strieker and M. Orengo, Similarity of color images, in SPIE Proc. Storage and Retrieval for Image and Video Database III, 1995. W. Niblack and et al., The QBIC project: Querying images by content using colour, texture and shape, in SPIE Proc. Storage and Retrieval for Image and Video Database, 1993. R. Mukundan and K. R. Ramakrishnan, Moment Functions in Image Analysis Theory and Applications (World Scientific Publishing, Singapore, 1998). D. Shen and H. H. S. Ip, Pattern Recognition 33 , 319 (1997). G. Zhao, L. Cui, S. Xiang and H. Li, Journal of Information & Computational Science 2, 421 (2005). X. Li, An efficient multi-filter retrieval framework for large image database, in Proc. of 2002 Int. Symposium on Information Systems and Engineering, (San Diego, USA, 2002). C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein and J. Malik, Blobworld: A system for region-based image indexing and retrieval, in Proc. Thir Int. Conf. on Visual Information Systems, (Springer, 1999). E. P. Simoncelli, Modeling the joint statistics of images in the wavelet domain, in Proc. of SPIE vol. 38IS - Wavelet Applications in Signal and Image Processing VII, (Denver, USA, 1999). M. Unser, IEEE Trans, on Image Processing 4, 1549 (1995). G. V. de Wouver, P. Scheunders and D. V. Dyck, IEEE Trans, on Image Processing 8, 592 (1999). B. M. Mehtre, M. S. Kaankanhalli and W. P. Lee, Information Processing and Management 33, 319 (1997). J. E. Gary and R. Mehrotra, Information Systems 18, 525 (1997).

10.

11.

12.

13.

14.

15.

16.

17.

263

Integrating Linear Subspace Analysis and Iterative Graphcuts For Content-Based Video Retrieval

P. Deepti, R. Abhilash and Sukhendu Das


IIT Madras Chennai-600036, India.

E-mail: deeptWcse.iitm.ernet.in, [email protected] and [email protected]

With the availability of large and steadily growing amounts of visual and multimedia data, the need to create methods to query this data efficiently becomes significant. Consequently, content-based retrieval of video data turns out to be a challenging and important problem. In this paper, we propose a system for retrieving video data using either a query-by-example image or query-by-set of consecutive frames (or query-by-example video). Our technique explores the use of Fisher's Linear Discriminant Analysis (2D-LDA) for feature extraction and segmentation using iterative graphcuts, to retrieve the video similar to the given query. Our proposed approach consists of two different types of searches in the database relative to the type of query. For a query-by-example image we assume that the user provides the segmented semantic object, otherwise a preprocessing (foreground object detection) stage is used to extract the semantic object. The features of the query image are matched with the Fisher's feature vectors of the extracted object and the video containing the query image is retrieved. Given a complete video or a set of consecutive frames as a query, our system extracts the features from the video using iterative graphcuts and 2D-LDA and retrieves the video along with the time stamps at which the frames occur in the video.

Keywords: Content-based video retrieval, 2D LDA, Graphcuts, foreground extraction

1. Introduction

Fast and efficient storage, indexing, browsing, and retrieval of video is a necessity for the development of various multimedia database applications. While there are efficient search engines for text documents today, there are no satisfactory systems for retrieving visual information. Many interesting mechanisms for retrieving videos based on their content, such as the color of a background and motion of an object has been developed in recent past.

Content-Based Image Retrieval (CBIR) or Content-Based Video Retrieval (CBVR) has been one of the most vivid research areas in the field of computer vision over the last 10 years. Content-Based Visual Queries (CBVQ) has emerged as a challenging research area in the past few years.1 While there has been substantial progress with the presence of systems such as QBIC,2 Photo- Book,3 and Visu-alSeek4 most systems only support retrieval of still images. CBVQ research on video databases has not been fully explored yet. VideoQ1 is one of the most well-known video retrieval systems. It allows a user to specify the motion of an object, color, shape, texture, and spatio-temporal relationship between objects. In VIOLONE,5 the trajectory of a moving object and the related object properties (specifically, speed and

size) can be specified as a query condition. Finding the similarities of video sequences can

be easily done by a human, using its complex visual perception system [2]. But when it comes to autonomous search and retrieval systems, the problem becomes very complicated. When a person compares two video sequences for matching, two important points are considered by him, first he looks for the similarity between single frames and second, the similarity between the order of the frames in two video sequences are considered. Whenever both of these similarities be available, the video sequences will be known as similar ones.

We propose an advanced content-based video search system with the following unique features: Automatic video object segmentation and tracking and Query with single image, multiple frames and full video.

2. System Overview

In our content-based retrieval system, the user queries the system using image, full video or a set of consecutive frames in a video. The database consists of videos with single moving object (person, car etc. ) taken in different backgrounds. Extracted features vary based on the type of query. For query-by-



264 Integrating Linear Subspace Analysis and Iterative Graphcuts For Content-Based Video Retrieval

example image, stored video frames are tracked and segmented and then given for training. For query-by-example video or set of consecutive frames, the frames in each video are given for training. Once a query is given, the features of the query object are matched against the features of the objects in the database. Then the video with nearest matched features is retrieved. A block diagram of our system is shown in Fig. 1.

QBE image Video Database

QBE video

DI seqnenci of fi awes

Tracking and extraction

of moving object

Sepneured video Construct Fisher projection axis

Index to video database

Fig. 1. (a) Feature Extraction Module For Database Creation (b) Score Compute Module.

3. Video retrieval for Query-by-example Video and Query-by-set of consecutive frames

Given a complete video or a set of consecutive frames as a query, our system extracts the features from the videos using 2D-LDA and retrieves the video and the time stamps at which the frames occur in the video respectively.

Training Phase

Training is a process of acquiring features from available training images and storing them in a knowledge base for the purpose of recognizing an unknown future scene image. Given a set of frames of each video stored in the database, the 2D-LDA6 approach extracts most informative features, which could establish a high degree of similarity between samples of the same class and a high degree of dissimilarity be

tween samples of two different classes. Formally, let there be T number of classes each

with hi, i = 1,...,T, number of training images. Therefore, we have totally AT = ^2i=1 ki, number of training images. Let A$% be an image of size m x n representing the jth sample in the ith class. Let C be the average image of all N training images. Let Ci be the average image of all ki training images of the ith class. The image between-class scatter matrix Gb and the image within-class scatter matrix Gw are computed as

T

Gb = l/NîiCi - C)T(Ci - C) (1) »=i

Gw = 1/N ] T jr(Ai - Ci)T(Ai - d) (2) i=l j=l

Once Gb and Gw are computed, it is recommended to find the optimal projection axis X so that the total scatter of the projected samples of the training images is maximized. For this purpose, the Fishers criterion given by

XTGbX J{X) (3) XTGWX

is used. It is a well-known fact that the eigenvector corresponding to the maximum eigenvalue of Gw Gb is the optimal projection axis which maximizes J(X). Generally, as it is not enough to have only one optimal projection axis, we usually go for d number of projection axes, say X\,X2, ....,Xd, which are the eigenvectors corresponding to the first d largest eigenvalues of Gw Gb. In 2D-LDA, once these X\, X2,...., Xd are computed, each training image A\ is then projected onto these X's to obtain the feature matrix Y? of size m x d of the training image A\. So, during training, for each training image A\, a corresponding feature matrix of size mxd, d < n , is constructed and stored in the knowledge base for matching at the time of recognition.

Retrieval Phase

Let I be each frame in the query video given for retrieval. Let / ' be its projected image onto the d number of optimal projection axes computed by jl = WTI. Given two images, say v^ and vii of any frames of two different videos, represented by feature matrices Z[* = [zj1 ,^1)—i^1] a n <i ^2 =

P. Deepti, R. Abhilash and Sukhendu Das 265

[z\2,zl22, ...,z^2], the similarity measure dis^ZJ1 , ,^2)

used is d

where \\a — b||2 denotes the Euclidean distance between the two vectors a and b. If the feature matrices of the training frames are Z\, Z%,..., Zjv, and each image belongs to some video V*, then for a given test frame / j , if dist(I^,Z{) = argmirijdist(I^Zj) and Zi eVj , then the resulting decision is JI € Vf.

4. Video retrieval for Query-by-example Image

Given an image as a query, the system is supposed to retrieve all the videos, which contains the object, eg. a person, present in the image. Here we are considering the query image has a single object, eg. A person, and the system is required to return all the videos of this person.

Training Phase

All the videos containing single moving objects are processed, and the moving object is tracked and extracted from the video. The tracking is done using Graphcuts. The graphcut algorithm gives the foreground layer in each frame. To obtain the perfect extraction of the moving object the segmentation is done using iterative graphcuts (grabcuts).7 To make the extraction of the object automatic, we made use of the motion information obtained at tracking stage along with the grabcut algorithm. Fig. 2 shows foreground layer obtained from the tracking stage, the regions assigned and the segmented output. The major steps for tracking and extraction of moving object from the video are:

A. Tracking

(1) Computer builds the background frame from the video. The weighted average of the scene is used as the background reference frame.

(2) A graph is constructed for each frame, and weights are assigned for the edges of the graph based on the difference between the frame and the background frame.

(3) Minimum GraphCut8 is performed to obtain the blob of the moving object (foreground layer) in each frame.

B . Segmentation

(1) Pixels outside the foreground layer obtained in the tracking stage are marked as known background region.

(2) A small rectangular patch at the center of the foreground layer is considered as known foreground region and rest of the pixels in the foreground layer are considered as unknown region.

(3) K components of Gaussian Mixture Models (GMMs) are created for initial foreground and background classes. We first divide both regions into K pixel clusters. The Gaussian components are then initialized from the colors in each cluster.

(4) Each pixel in the foreground class is assigned to the most likely Gaussian component in the foreground GMM. Similarly, each pixel in the background is assigned to the most likely background Gaussian component.

(5) The GMMs are thrown away and new GMMs are learned from the pixel sets created in the previous set.

(6) A graph is built and Graph Cut is run to find a new tentative foreground and background classification of pixels.

(7) Step 4-6 are repeated for each frame until the classification converges.

This will give us an object segmented from the video. For each video in the database, a segmented video is created.

Given frames from each segmented video, the 2D-LDA approach extracts most informative features which could establish a high degree of similarity between samples of same class and a high degree of dissimilarity between samples of two different classes.

Retrieval Phase

The query image is projected onto the Fisher optimal projection axis as described in section 3. Using nearest neighbor technique, the training sample nearest to the projected image is retrieved.

5. Results

In our experimental setup, we have a collection of 50 videos to evaluate our system. The number of frames

266 Integrating Linear Subspace Analysis and Iterative Grapheuts For Content-Based Video Retrieval

[Known Background Region]

I Kh own F ore gto und R eg ion j

Fig. 2. Left to Right: (a) Original Scene (b) Extracted Foreground Layer From Tracking Stage Showing Foreground, Background and Unknown Regions (c) Segmented output obtained only through tracking (d) Segmented output using our proposed method.

10

25

33

4 Video no: 1

Query Images

Video no: 25

Retrieved videos from the database

Sample videos in the database Fig. 3. Left to Right: Sample videos in our database, Query Images, Retrieved video.

for the videos ranges from 25 to 30. The source videos were taken with a camera of 3000x3000 resolution. The overall processing time for retrieving a video of 30 frames took less than half an hour. About 30% was for tracking and 50% for segmentation and 20% for recognition. Sample queries are performed on the database. Fig. 3 shows samples videos from the database, preprocessed query images and the results for query-by-example image. Fig. 4 shows query-by-example video and the retrieved set of frames (along with their frame-numbers). The timings requirements on a P-IV, 3.4 Ghz, 2GB RAM machine is: 10 minutes for a single image query and 20 minutes for a video query of 5 frames. We are yet to observe any

failures, as our videos have only a single moving object in any shot.

6. Conclusion

Content-based video retrieval in large archives is an emerging research area. Our experiments with the proposed method show considerable success in retrieving videos with single moving object in diverse backgrounds in very less computational time. The other interesting feature and unique contributions include developing a fully automated video analysis algorithm for object segmentation and feature extraction. In future, we plan to extend our work for multiple moving objects.

P. Deepti, R. Abhilash and Sukhendu Das 267

Query Sequence Frames Frame no: 1, 2,3 of video 12

Retrieved frames from the database

Fig. 4. Left to Right: Query sequence of frames, Retrieved frames of the respective video.

References 1. S.-F. Chang, W. Chen, H. Meng, H. Sundaram and

D. Zhong, Videoq: An automated content-based video search system using visual cues, in ACM 5th Multimedia Conference, Seattle, WA, (Seattle, WA, 1997).

2. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkarii, J. Hafner, D. Lee, D. Petkovic, D. Steele and P. Yanker, IEEE Computer Magazine 28, 23 (1995).

3. A. Pentland, R. W. Picard and S. Scalroff, International journal of computer vision 18, 233 (1996).

4. J. R. Smith and S. F. Chang, A fully automated content based image query system, in ACM Multimedia

conference, (Bostan, 1996). 5. A. Yoshitaka, Y. Hosoda, M. Yoshimitsu, M. Hi-

rakawa and T. Ichikawa, Journal of Visual Languages and Computing 7, 423 (1996).

6. M. Li and B. Yuan, Pattern Recognition Letters 26, 527 (2005).

7. C. Rother, V. Kolomogorov and A. Blake, ACM tram-actions on Graphics (SIGGRAPH) 23, 309(August 2004).

8. Y. Boykov and V. Kolmogorov, IEEE Tmn. on Pattern Analysis and Machine Intelligence 26, 1124(Sep. 2004).

268

Organizing a Video Database Around Concept Maps

K. Shubham, L. Dey, R. Goyal and S. Gupta

Department of Mathematics Indian Institute of Technology, Delhi Hauz Khas, New Delhi-110016, India

E-mail: [email protected], [email protected], rohitgyl&gmail.com, [email protected]

S. Chaudhury

Department of Electrical Engineering Indian Institute of Technology, Delhi Hauz Khas, New Delhi-110016, India


Though a number of retrieval mechanisms have been designed to work with video features like shot detection etc., text-based searching facilities for video are mostly restricted to using text matching with video descriptions. Text matching severely restricts the performance of retrieval. Since text descriptions associated to videos are small, devoid of hypertext or fancy formatting, usual relevance computation mechanisms used for search engines also fail to achieve desired retrieval performance. In this paper, we have proposed a conceptual organization scheme for video databases around concept maps generated from noun phrases extracted from the descriptions. Since subjects and objects of sentences are usually noun phrases, these phrases can capture the contents of a collection quite effectively. We have shown how the concept maps are generated, and used for computing the relevance of documents with respect to a query, through the use of concept maps.

Keywords: Concept Maps; Concept Based Retrieval

1. Introduction

The search engine has changed the face of information retrieval by providing efficient access to text-based information. However, the same cannot be said with respect to accessing non-text data, of which also there is no dearth on the Internet. However, since search engines are designed to perform text-based searches, hence the non-text data is also indexed using textual descriptions and the searching mechanisms follow the same paradigms as that of text-searching mechanisms. One of the key differences between text data sources and the text data that is used to describe non-text elements like images or videos is that the later is much smaller in size. While special features like paragraph headings, emphasized text, relative font size of a text with respect to its neighbors etc. can be made use of, while searching for text relevant to a query. Such features are rarely encountered in non-text data descriptions. The descriptions accompanying images or videos are usually short, describing only some key aspects about the contents. Due to the extremely small lengths of accompanying texts, usual relevance computing measures like those based on term-frequency also fail on these descriptions. Thus, search for relevant video or image

is usually purely text-match based where the text could come from associated descriptions, file names, neighboring text etc. Hence, an image query "swn" on google fetches an image of a logo from java.sun.com for obvious reasons. Some search engines try to circumvent this with content-based image or video retrieval, where image or video features are used to describe the contents. However, the query system in such cases requires a user to input queries in terms of such features as well, and that is not an easy task, since the user may not be able to find an appropriate image to describe his or her requirements. It is much more intuitive and simpler to describe the requirements using words and then judge the results for relevance, rather than waiting to find an element that approximately matches the requirement and then input it as a query.

In this paper, we propose a concept-based organization of a video database, where each video is accompanied by a short text description. The text description usually comprises one or two sentences. The proposed conceptual organization of the video descriptions is aimed at providing a concept-based retrieval facility rather than one based on text match. For example, if the query is for "funny movie" then





http://java.sun.com

K. Shubham, L. Dey, R. Goyal, S. Gupta and S. Chaudhury 269

all video descriptions that match this description conceptually should be retrieved, whether they contain the term funny or not. Again, when a query contains terms like "heart + disease", a video description not containing these terms, but containing the description "Doctor discusses common symptoms and treatments for cardiac patients", should be judged as relevant. This is possible only when the underlying collection supports a conceptual organization.

This paper proposes a conceptual organization of video databases through analysis of noun phrases appearing in video descriptions. Since these descriptions are very short, hence repeated occurrence of terms is a very rare event in this database, other than for very generic terms like "films" or "music video" etc. However, even singular occurrences like "Tom and Jernf or "Elizabeth Taylor""1 can be quite informative in this context. The noun phrases extracted from text descriptions are subjected to a cooccurrence analysis. The co-occurrence data is then analyzed to generate concept maps. Concept maps can be defined by a collection of concepts which may be implicitly or explicitly related to each other. Concepts are the important noun phrases present in the description set which are selected using a ranking mechanism, described in section 3. Concept maps are graphs showing possible relationships between concepts. Concept graphs also highlight the possible interferences of domains through shared noun phrases. To achieve a conceptual organization, the text descriptions are matched against the concept maps, and each description is judged for coherence against various concept maps. A document might have same number of phrase matches in two different concept maps, but the coherence with respect to the two can be different. Queries on the other hand are mapped to phrases and then enhanced by incorporating more terms from the underlying concept maps. The relevance of a document to a query, is a function of the degree of coherence between a query, a document and a concept map.

One of the interesting outcomes of the proposed work is the evolution of concept-maps, which though extracted from video descriptions, are general enough to be extensible to other domains also. Concept-based retrieval can play a key role in reducing information overload for the user, by guiding the user to input terms relevant to the context in which the search is to be conducted. Earlier uses of

WordNet for enhancing queries with relevant terms have not proved to be too successful since WordNet at times can defocus the search to a large extent. Besides, along with a term, the set of terms indicated as co-ordinates or synonyms to the term by Word-Net, may also be absent in the underlying collection, though there may be documents with contents similar in intent to the user query. Since in our case, the concept maps are extracted from the underlying database itself, the related terms that are provided to enhance a query provide the user with a more relevant set of terms. Though building topic models through co-occurrence analysis has been pursued for quite some time now for text mining, its applicability for retrieving video data has not been explored. Besides, exploitation of noun phrases for topic map generation is also a novel idea presented in this paper.

2. Related Work

A survey of related work shows that most of the video retrieval systems have been designed to either use video features, or browse through a content hierarchy. Zhangei al.1 presented the first video parsing, indexing and retrieval framework, which uses annotations and visual features of key-frames for video browsing and retrieval. The Photobook system2 allows users to plug in their own content analysis procedures. WebSEEK3 is a web-based retrieval system, that has several indexes built for images and video, based on visual and non-visual features. VideoQ4 is a more advanced system that supports video query using single or multiple objects, and also indicating a set of desired feature values for visual features like color, texture, shape, and motion. The Informe-dia digital video library project5 built an exhaustive video knowledge base by integrating visual features, closed caption, speech recognition etc. Zhu et al. designed InsightVideo6 which integrates hierarchical video browsing and feature based video retrieval. However, none of these systems consider a concept-based organization of the video collection built from the textual descriptions of the videos themselves.

3. Generating Concept Maps

In the proposed framework, concept maps are constructed by analyzing the co-occurrence of noun phrases in the description database. The noun

270 Organizing a Video Database Around Concept Maps

phrases are extracted using the Stanford Parser7 . It is a lexicalized probabilistic parser, that implements a factored product model, with separate Probabilistic Context Free Grammar (PCFG) phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm. This parser works reasonably well for a large variety of sentences.

The noun phrases extracted from a video database description are stored along with their document frequencies. The noun phrase collection is then ordered in decreasing order of the information content of phrases using the following information theoretic function:8

I{U) = -P(ti)logw{P{U)) (1)

Where P{ti) denotes the probability of occurrence for term U.

This weighing mechanism eliminates all frequently occurring terms which have very little information content. All noun phrases with positive information content were then considered for constructing the concept maps. Each phrase was assigned a weight based on frequency normalization9 computed as follows:

W(U) = (1 + ln(l + ln(f(ti)))) * ( _ — - i — j - ) U.8 + (U.Z * —jj^)

(2) Where f{U) is the total number of occurrences of

ti in corpus, d{U) is the total number of documents in which U occurs and N is the total number of documents in the corpus. Let 0 denote a threshold which represents a minimum weight required for a phrase to be included in the concept map. The cut-off value 9 was decided as follows:

9 = ( i + i n ( i + f a ( i ) ) ) t ( 0 8 + ( ; 2 4 ) ) (3)

Thus, all phrases that have only a single occurrence in the whole collection are eliminated, while all others are retained. Higher values for 6 may be chosen for other collections with more text.

The surviving noun phrases are now analyzed for co-occurrence. Given n noun phrases extracted from a domain, the co-occurrence matrix is an nXn matrix which stores the strength of association between two

noun phrases. The strength of association between two noun phrases n l and nl is computed as follows:

Where P(nl n n2) denotes the probability of cooccurrence of terms n l and n2, P(nl) and P(n2) denote individual probabilities of occurence of n l and nl respectively.

As defined earlier, a concept map is a collection of concepts which are related to each other, now these concepts may be related to each other either strongly or weakly. Strongly connected concepts are those that co-occur in the description collection. Concepts that do not co-occur in the description database, however may still be related to each other through their associations with other co-occurring concepts.

Based on these observations, a concept map is constructed as follows. The set of all strongly connected components are identified from the cooccurrence graph using topological sorting for finding maximal cliques. For each maximal clique, the set of concepts that are not part of the clique, but which are related to one or more concepts in the clique are associated as weak components of the graph. Finally, for each concept, the graph in which it appears with the maximal degree is identified and associated to the concept as a key graph. Starting with the key graph, a union of all graphs in which this concept occurs, is constructed to get the complete concept map associated to a concept.

The interesting aspect of the concept map construction mechanism is that a single concept may appear in multiple concept maps with different relevance with respect to the concept around which the concept map is built. Thus, the concept "woman" appears in health related concept maps as well as other maps. Hence the relevance of the term "woman" can be judged in the light of other terms present in the query. Similarly, the relevance of a term in a document is also judged by the presence of other strongly or weakly associated terms and not by a single term alone. Fig. 1 and Fig. 2 show two samples of concept maps generated from a video database containing text descriptions of 4939 videos. The collection includes news casts, movie or song videos and also documentaries. It may be observed that the terms(concepts) strongly associated to the concept "osteoporosis" are mostly very relevant terms like


C is computed as a coherence value C(î C), given by:

Fig. 1. Concept map generated around concept "osteoporosis?.

Fig. 2. Concept map generated around concept "ses?'.

"current treatment options", "medication", "bone loss" etc. The second graph also shows some interesting insights into the document collection. Though no single document has the terms "sez" and "adulf occurring together, these two terms get associated through the term "girF. We have observed that this can have an interesting impact on organization of retrieval results, as will be discussed in the next section. In this section, we show how a conceptual organization is forced on a video collection using the concept maps generated earlier. For each document, the phrases(concepts) extracted from that document have some concept maps that are associated to them. A document is judged to be relevant to each concept map with which it has some intersection. The coherence of a document D with respect to a concept map

«°-0)-(t)D<J^tHi) (5) *-^ degree(ti)

Where ti is a matching concept, m* is the number of matching concepts which are adjacent to ti, and k is the total number of matching concepts.

Using this function, the coherence of a document with respect to a concept map is more when the document shares edges with the concept map, rather than when a document simply shares weakly connected concepts in a map. Every document thus has a potential to be associated to any concept map, provided the document shares either strongly or weakly associated concepts with a concept map.

To select documents in response to user queries, presently the system locates query concepts on the concept maps. A spreading activation based mechanism is then used to compute the relevance of all documents associated to the activated concept maps. Let U be a query term present in concept map C. These terms are assumed to be activated in the relevant concept maps with an initial activation value of 1. The total activation value at a node is spread to other adjacent nodes, using the spreading activation theory.10 Each adjacent node receives an equal fraction of the total activation at its parent node.

The activation of other terms in the concept map C is computed as a function of the distance from activated terms. Let tj be a non-query term in a relevant concept map. For each activated term in the map, tj receives an activation, along the minimal path from the activated term to tj. The total activation of tj is the value that the activation converges to after activation is propagated through the entire concept map. Thus the level of activation at node y is given by:

av = X^1 * S(x,y) + -[) + cv (6)

Where cy is the initial activation of a node, 6(x, y) is the distance between nodes x and y. Each node of an activated concept map, thus has a particular activation value between 0 and 1. Activation value is higher for nodes that are closer to the query nodes. The relevance of a document to a query is thereafter computed using the earlier coherence function for the

272 Organizing a Video Database Around Concept Maps

Table 1. Results of categorizing "Osteoporosis" related descriptions.

Category of match Ratio of documents

Good matches 70% Medium matches 10%

Bad matches 20%

document and the activated concept map, but now the activation value associated to each node in the concept map is also taken into consideration. That is, a concept present in the document as well as the activated concept map, is considered in the coherence function if it has an activation greater than a certain threshold, which can be provided by the user. Thus in response to a particular query, concepts of the activated concept map, that are not present in the query, but have an activation higher than the threshold (and thus are potentially relevant to the query), are also considered for judging the relevance of a document to the query.

4. Results

We now show some results of categorizing the collection of videos using the concept maps generated earlier. Average accuracy of coherence value computation is found to be around 80% for a randomly chosen collection of descriptions. Due to lack of space, we present here summarized results for the two concept maps that were presented in section 3. "Osteoporosis" was chosen due to the predominance of this or related query in the database worked with. The second map was chosen to illustrate that the proposed scheme is capable of extracting documents, even when it contains terms not directly related to the query.

Table 1 summarizes the effectiveness of the coherence value computation for a set of documents which are known to talk about the problem of osteoporosis. It can be seen that 10% of the documents were wrongly categorized by the proposed scheme. The medium categorization was due to the presence of more general phrases related to the medical domain in these documents, than terms which are directly related to "Osteoporosis". Some failures could also be attributed to the parser's incapability to recognize all phrases correctly.

To highlight the use of the scheme itself we consider the following description of a video which was

retrieved by text match with query "Office 2003' : "Music video for Marques Houston's 'Pop That Booty' from the rapper's 2003 album, 'MH.' In the sexually explicit video, Houston and friends drive up to an office building and are pampered with massages and treated to erotic dances by scantily clad women.". This is an obvious wrong result. Using the proposed scheme, the coherence values of this document are found to be maximum with respect to the concept maps around terms "sex" and "music video".

We are extending the proposed scheme to identify a set of phrases which can be indicative of the genre of a film. Some of the major genres are action, romance, biopic etc. Though we have identified a set of such phrases for each genre using web sources, the phrase set is yet to be standardized through verification of the sources. Each document is then assigned membership to genres through the concept maps. Since the phrase sets for the genres are intersecting, hence each video receives memberships to all the genres.

5. Conclusions

Concept maps have been predicted to play a crucial role in the realization of the semantic web, where an increasing number of knowledge exchange transactions are expected to be based on semantic content analysis rather than on syntactic analysis. However, one of the biggest bottlenecks in building concept maps lie in accumulating domain concepts and extracting their inter-relationships. The proposed methodology attempts to generate concept-maps representative of a domain in an automated fashion. Each document in a collection can be assigned a membership to each concept map. The relevance of a document to a specific query is computed by expanding it with respect to a map.

The proposed mechanism also facilitates a querying mechanism in which the user can associate weights to the related terms and thus orient the query towards a desired direction. Considering that natural language is by nature ambiguous, this is a particularly important advantage. For example, it is quite obvious that though the two queries "woman+problems" and "woman+films" have 50% overlap in terms, the intent of the queries are quite different. Thus, on being shown the concept maps, one user is likely to choose terms related to the health


domain, while the other user is likely to choose terms

from the adult contents.

R e f e r e n c e s

1. H. J. Zhang, C. Y. Low, S. W. Smoliar, and D. Zhong,Video parsing, retrieval and browsing: an integrated and content-based solution, in Proc. ACM Multimedia Conf., San Francisco, CA, 1995.

2. A. Pentland, R. Picard, and S. Sclaroff, Photobook: content-based manipulation of image database, Int. J. Comput. Vision, Vol. 18(1996), pp. 233-254.

3. J. R. Smith and S. F. Chang, Searching for Images and Video on the World-Wide Web, Technical Report 459-96-25, Columbia University, 1996

4. S. F. Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong, VideoQ: an automated content based video search system using visual cues, in Proc. ACM Multimedia Conf., Seattle,WA, 1997

5. D. W. Howard, K. Takeo, A. S. Michael, and M. S. Scott, Intelligent access to digital video: informedia project, IEEE Computer, Vol. 29(1996), pp. 46-52.

6. X. Zhu, A. K. Elmagarmid, X. Xue, L. Wu and A. C. Catlin, InsightVideo: Toward Hierarchical Video Content Organization for Efficient Browsing, Summarization and Retrieval, IEEE Trans. On Multimedia, Vol. 7(2005), pp. 648-666.

7. h t t p : / / n i p . S t a n f o r d . e d u / s o f t w a r e / l e x - p a r s e r . shtml.

8. C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal, Vol. 27(1948), pp. 379-423 and 623-656.

9. A. Singhal, G. Salton, M. Mitra and C. Buckley, Document Length Normalization, Information Processing & Management, Vol. 32(1996), pp. 619-633.

10. J. R. Anderson, A Spreading Activation Theory of Memory, Journal of Verbal Learning and Verbal Behavior, Vol. 22(1983), pp. 261-295.

http://nip.Stanford.edu/software/lex-parser

274

Statistical Bigrams: How Effective Are They in Text Retrieval?

Prasenjit Majumder Mandar Mitra


Kolkata {prasenjit-t, mandar}@isical.ac.in

Kalyankumar Datta

Electrical Engineering Department Jadavpur University

Kolkata kalyandatta@debesh. wb.nic. in

Phrases have generally been found to be useful as indexing units (or keywords) in text retrieval. Broadly they can be classified as statistical or linguistic phrases, based on their extraction method. Statistical phrases are word n-grams (n = 2,3,4) that fulfil certain statistical criteria; linguistic phrases are word sequences which satisfy certain syntactic or linguistic rules. Statistical phrases are often prefered in retrieval systems because they are usually easier to identify, and have been found to be as effective as linguistic phrases for indexing and retrieval. This paper presents a comparative study of various methods for identifying statistical phrases, based on raw frequency, log-likelihood, mutual information, Dice coefficient, etc. Our experiments show that mutual information and log-likelihood are the most effective measures for identifying phrases. When phrases identified using these methods are used as keywords, retrieval performance improves significantly.

Keywords: phrase detection, information retrieval, statistical association, empirical study

1. Introduction

In many text-processing applications, such as searching, automatic text classification, document clustering, etc., it is customary to represent text documents as lists of keywords or terms, with associated weights. These keywords or terms indicate the subject matter of the document's information content, and the weight corresponding to a term is a measure of its importance in representing the information content of the given texta.

Keywords assigned to a document may include single words as well as multi-word phrases. Multiword phrases (such as "pattern recognition", "data mining", etc.) are useful as keywords since a phrase as a whole usually denotes a more specific concept or entity than the constituent words taken individually (e.g. compare "data mining" to "data" and "mining" as two separate words).

Multi-word phrases are usually identified using two broad classes of techniques: statistical and syntactic. Statistical phrases are identified by selecting word n-grams (a sequence of n(= 2,3,4) words that

aThis method of representation is often referred to as the vector space model or 'bag of words' approach in literature.

occur side-by-side) that fulfil certain statistical criteria. Linguistic phrases are word sequences which satisfy certain syntactic/linguistic rules. Clearly, in order to identify syntactic phrases, some linguistic processing, e.g. part-of-speech (POS) tagging, is needed. Typically, this type of processing is computationally expensive. Further, while the necessary language-processing tools are available for some languages (including English), they are either not available, or not yet mature for many, less well-studied languages. Thus, syntactic methods are not feasible for identifying phrases in these languages. Moreover, earlier studies have shown that statistical phrases are as effective as linguistic phrases for indexing and retrieval. For these reasons, statistical phrases are often preferred over linguistic phrases in practical retrieval systems.

In this empirical study, we investigate various alternative methods for identifying statistical phrases, and compare their effectiveness and efficiency. The overall approach, and the different statistical methods that we tried are described in Section 2; our experimental setup and results are reported in Section 3; finally, Section 4 summarizes our conclusions.

Prasenjit Majumder, Mandar Mitra and Kalyankumar Datta 275

Related Work

Over the last three decades, several phrase identification methods employing various statistical measures of association have been proposed, the simplest measure being co-occurrence frequency. If the co-occurence frequency of two words exceeds a certain limit, it can generally be assumed the the words have a special meaning or usage when they occur together.1 Pecina et al? empirically studied several collocation extraction methods. They started with a list of word-bigrams, extracted from a collection of documents, using various statistical association measures. Certain POS patterns that can never form valid collocations were identified, and bi-grams matching these patterns were then filtered out from the original list. Note that this approach is feasible for resource-rich languages, for which POS taggers are available. For resource-poor languages (e.g. many Indian languages), this filtering is not possible; rather, we have to depend absolutely on the statistical methods.

A straightforward method of evaluating different statistical association measures was also proposed earlier by Evert et al.3 The n "best" bi-grams were extracted from a corpus using different association measures, and the quality of the resulting list was manually judged. In our work, we use standard, quantitative evaluation measures to compare the effectiveness of different measures for an actual retrieval application.

2. Our Approach

We start with a corpus, or a collection of text documents. After some basic preprocessing (e.g. all uppercase letters are converted to lowercase, words are stemmed to a canonical form (e.g. mining —* mine), etc.), we construct a list of all the word bi-grams (pairs of words that occur side-by-side in at least one document) contained in this collection. Any bi-gram containing a stopword (non-informative words such as 'the', 'in', etc.) is then eliminated from the list. Let the final list thus obtained be {pi,P2, • • • ,PN}- For each bi-gram, some statistical property (say / ) is calculated. The entire list of bi-grams is then ranked according to the / values. The M top-ranked bi-grams are selected as valid phrases that may be assigned as keywords to represent a document's information content. Using these phrases (along with single terms), the document collection is indexed, i.e. the 'bag of

words' representation is computed for each document.

Finally, we take a set of test queries, which are also indexed using single terms and the identified phrases, and for each query, we retrieve a certain number of documents from the given collection using the standard inner-product similarity measure.4

The search results are evaluated using standard measures like recall, precision and mean average precision (MAP).

2.1. Statistical Measures

Several statistical properties or association measures for identifying phrases have been proposed in earlier work. In our experiments, eight different measures were used as the function / in the approach outlined above. Several of these (e.g. chi-square, odds ratio) were found to be ineffective for phrase identification, and details of these measures are not reported in this paper. Below, we briefly describe the measures that yielded the best performance in our retrieval experiments.

Raw frequency (RF): fraw(p) is simply the number of times that the bi-gram p occurs in the entire document collection. All bi-grams occuring in the collection are ranked in decreasing order of occurrence frequency. We choose a threshold frequency 6, and all bi-grams that occur at least 9 times in the collection are selected as candidate phrases. Table 1 shows a list of some of the most frequent phrases in the corpus used in our experiments.

Table 1. Most frequent b i -grams in the WSJ Corpus.

Bi-Gram Frequency

vice president 36721 staff reporter 32945

executive officer 18608 west germany 6176

Dice Coefficient (DC): Let p = (a, b) be a bi-gram consisting of the words a and b. Let N be the total number of bi-grams in the corpus, na the number of bi-grams whose first word is a, rib the number of bi-grams whose second word is b, and nâ^) the number of occurrences of (a, b) in the corpus. Then, fdice(p)

276 Statistical Bigrams: How Effective Are They in Text Retrieval?

is given by

/dice(a, b) = 2 x n{a,b)/(na + nb) (1)

Note that fdice{a,b) can also be viewed as 2P(a,b)/{P(a) + P(b)) where P(a), P(b) are the probabilities of occurrence of a and b in the corpus, and P(a, b) is the probability of occurrence of the bi-gram p.

Mutual Information (MI): The Mutual Information MI(X,Y) between two discrete random variables X and Y is defined as follows:

MI(X,Y) = £ £ P ( z , y) l o g 2 - ^ P{x,y)

)P(y) (2)

r i : / : i n •;/1 y€Yx€X

If a is a word, then we associate a random variable Xa with it. Given a word w chosen at random from the corpus, if w = a, then Xa takes the value 1, and is 0 otherwise. Then, for a bi-gram p = (a, b), fmi{p) is given by

fmi(P)= E E P(â ,X b ) l0g 2 ^ ° p - g X.€{o,i}x4e{o.i} P(Xa)P{Xb)

(3)

Log-likelihood ratio (LL): This is similar to Mutual information, but uses a different scaling factor.

P(x,y) LL(X,Y) = 2N x E E P(*>2/)lQge

y€Yx€X P(x)P(y)

(4) where TV denotes the total number of bigrams in the document collection.

3. Experiments

3.1. Data

We used the Wall Street Journal corpus from the TIPSTER document collection.5 This corpus contains a total of 173,252 documents. Queries 1-200 of the TREC collection5 were used in our retrieval experiments. All documents and queries were indexed using the SMART information retrieval system.6

Stopwords were removed, and words were stemmed to their root forms. For the baseline experiment, no phrases were used; documents and queries were represented using single terms only. Next, phrases were identified using the approaches outlined in the preceding section, and documents and queries were indexed using both single terms and phrasesb. For each

experiment, the SMART system was used to retrieve 1000 documents for each of the 200 test queries. The retrieval results were evaluated using recall (proportion of relevant documents retrieved), precision at a 20 document cutoff (proportion of relevant documents within the top-ranked 20 documents), and mean average precision.4

3.2. Results

For each measure / , several values of M (the number of top-ranked phrases that are selected as candidate keywords) were tried. Similarly, various values of the cutoff frequency were tried for the raw frequency based approach. This allows us to observe the impact of the size of the phrase dictionary (the list of candidate phrases) on performance.

Figures 1-3 show the results obtained using various association measures as the phrase dictionary size is varied. The number of phrases in the phrase dictionary is plotted along the x-axis; the mean average precision (resp. precision at 20 docs, and number of relevant documents retrieved) obtained in the corresponding retrieval experiment is plotted along the y-axis.

The figures show that the performance of the raw frequency, mutual information, and log-likelihood based methods are comparable. Both the Dice coefficient and chi-square based approaches perform worse than these methods. Also, in Figures 1 and 3, the curve for MI has the sharpest rise. This suggests that the MI approach ranks the most useful phrases at the top. A more detailed comparison of the three best methods is presented in Table 2.

Table 2. Summary of best results

No Phrase RF LL MI

P20 Avg. P Rel Ret #Phrases

0.5258 0.3629 16872

0

0.5320 0.3739 17096

149688

0.5318 0.3752 17100

184786

0.5320 0.3750 17101 190078

bIdentification of phrases was done using the NSP package.7

We performed a paired i-test on the MAP values to determine whether the differences in performance obtained using various approaches are statistically significant. The difference between not using phrases and using phrases (identified using any of the three

Prasenjit Majumder, Mandar Mitra and Kalyankumar Datta 277

•Raw Freq. -Dice •Ml • Log-I •x2

_ 0 376

100 1 150 175 200 225 250

# phrase

275 300 325 350 375 400 425 450 475 Thousands

Fig. 1. MAP values for various measures

-Raw - Dice • -Ml - Log-Like -x2

5: 75 100 125 193 175 200 225 253 275 300 325 350 375 400 425 450 475 Thousands

# Phrases

Fig. 2. Precision at 20 docs, for various measures

measures mentioned in Table 2) was found to be statistically significant, whereas the differences between these three methods were not found to be significant. However, a large difference in performance is noted for some individual queries. Consider test queries 36 and 37, for example.

Num: 36 Topic: How Rewritable Optical Disks Work Document describes the principles and mechanisms behind rewritable optical disk technology.

Num: 37 Topic: Identify SAA components Document identifies software products which adhere to IBM's SAA standards.

For query 36, the phrase rewritable optical was not identified in the RF approach, whereas the LL approach correctly identified this phrase. Conversely, for query 37, the phrase software products was identified only in the RF approach. Thus, the performance of these methods varies greatly for these queries, even though the overall performance differences were not significant.

4. Conclusion

Experiments with a part of the TREC dataset (WSJ corpus, TREC queries 1-200) show that raw frequency, mutual information, and log-likelihood are useful measures that can be used in statistical approaches to phrase identification. Statistically significant performance improvements are obtained when

278 Statistical Bigrams: How Effective Are They in Text Retrieval?

£ 16900

16850

- Raw -re:), - A - D ce - a - Ml —t— Leg-like —«— x2

m i i i iti

i i 1 1 1 1 1 1 i i 1 1 1 1 1 i i i 1

0 25 50 75 '00 125 15C 17E 200 225 250 275 30D 325 350 375 400 425 450 475 Thousands

t phrases

Fig. 3. # relevant documents retrieved for various measures

phrases identified using these methods are used for

indexing and retrieval. However, no statistically sig

nificant differences were observed in the performance

achieved by these three measures. Thus, any one of

them could be used in an operational setting.

R e f e r e n c e s

1. C. D. Manning and H. Schiitze, Foundations of Statistical Natural Language Processing (The MIT Press, Cambridge, Massachusetts, 1999).

2. P. Pecina, An extensive empirical study of collocation extraction methods, in Proceedings of the ACL Student Research Workshop, (Association for Computational Linguistics, Ann Arbor, Michigan, June 2005).

3. S. Evert and B. Krenn, Methods for the qualitative evaluation of lexical association measures, in Proceedings of the 39th Annual Meeting of the As

sociation for Computational Linguistics, (Toulouse, France, 2001). G. Salton and M. McGill, Introduction to Modern Information Retrieval (McGraw Hill Book Co., New York, 1983). D. K. Harman, Overview of the third Text REtrieval Conference (TREC-3), in Proceedings of the Third Text REtrieval Conference (TREC-3), ed. D. K. Harman (NIST Special Publication 500-225, April 1995). G. Salton (ed.), The SMART Retrieval System-Experiments in Automatic Document Processing (Prentice Hall Inc., Englewood Cliffs, NJ, 1971). S. Banerjee and T. Pedersen, The design, implementation, and use of the Ngram Statistic Package, in Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, (Mexico City, 2003).

PARTI

Pa t t e rn Recognition

281

Adaptive Nearest Neighbor Classifier

Anil K. Ghosh

Department of Mathematics and Statistics Indian Institute of Technology

Kanpur, Uttar Pradesh 208016, India. E-mail: [email protected]

In nearest neighbor classification, one normally uses cross-validation type methods to estimate the optimum value of fc, and that estimated value is used for classifying all observations. However, in classification problems, in addition to depending on the training sample, a good choice of k depends on the specific observation to be classified. In this article we propose an adaptive nearest neighbor classification technique, where the value of k is selected depending on the distribution of competing classes in the vicinity of the observation to be classified. Results on some simulated and benchmark examples are presented to show the utility of the proposed method.

Keywords: Bayesian strength function; cross-validation; posterior probability; prior distribution; sequential technique

1. Introduction

In nearest neighbor classification,1'2 posterior probabilities of competing classes are assumed to be constant over a small neighborhood around a query point x, and one estimates these probabilities using the proportions of different classes in the training sample lying in that neighborhood. As a result, the most frequent class in that neighborhood has the largest estimated posterior p(- | x), and it turns out to be the winner. Usually, a closed ball of radius rfc(x) is taken as this neighborhood, where rfc(x) is the distance between x and its fc-th nearest data point in the training sample. Clearly, the size of this neighborhood and hence the classification result depend on the parameter fc. Existing asymptotic results2,3 suggest that fc should vary with the training sample size n in such a way that k —» oo and k/n —» 0 as n —• oo. However, when the sample size is small or moderately large, there is no theoretical guideline for choosing the value of k. For a given data set, the optimum value of fc is usually estimated by minimizing the cross-validation estimate4 of misclassification probability. Of course one can also follow the idea of likelihood cross-validation technique (LCV),5'6 which estimates the optimum value of fc by maximizing the loglikelihood score S"=i '°5{P-t(ct I xt)}i where ct is the class label of the observation xt, and p~t(j | Xt) is the posterior probability estimate for the j - t h class at x t , when x t is not used as a training data point. These cross-validation and LCV methods use the training data to select a single value of fc, and then that selected value is used for classifying all observations. How

ever, one should note that in classification problems, in addition to depending on the training sample, a good choice of fc depends on the specific observation to be classified. Therefore, instead of using a fixed value of fc over the entire measurement space, spatially adaptive choice of fc may be more useful for classification. In this article, we propose one such adaptive nearest neighbor classifier, which adopts a sequential approach (to be discussed in Section 3) to select the optimum value of fc for classifying an observation.

Throughout this article, we will assume continuity of all population density functions and use the Euclidean distance for classification. However, the description of the adaptive nearest neighbor classifier (see Section 2) will make it quite clear that the use of the proposed classifier is not restricted to any special type of distance function, and one may use Maha-lanobis distance7 or any other flexible and adaptive metric8'9 as well.

2. Methodology

As we have mentioned before, in usual nearest neighbor classification, posterior probabilities of competing classes are assumed to be constant over a neighborhood around x. Here, we deviate from this usual notion, and instead of considering p(j | x)'s as fixed and non-random, we assume a prior distribution of p(l | x),p(2 | x ) , . . . , p ( J | x). Since it is quite evident that the calculations have to be done at each x separately, for convenience of our notation, we will drop the term x, and denote p(j | x) by


282 Adaptive Nearest Neighbor Classifier

Pj. Similarly, the vector of conditional probabilities

(pi,P2, • • • ,PJ) E / = I Pj = !] w i l 1 b e denoted by p . Suppose that for some given fc, £fc(p) is the prior

distribution of p in the neighborhood (ball of radius rfe(x)) around x. If tjk of these k neighbors come from the j - t h class, the conditional distribution of tfc = (tlk ,t2k,...,tJk) E / = i tjk = k] for given p is multinomial and its probability mass function can be expressed as

Therefore, for some fixed k and tfc, the conditional distribution of p is given by

Cfc(p I tfc) = &(p) ipk(tk I p) / / &(p) iPk(tk I p) dp.

In a two-class problem, one will naturally prefer the first class as compared to the second one if P{p\ > P2 | tfc) > P(p2 > Pi | tfc), and the evidence in favor of the first class will increase with the probability function P(pi > P2 | tfc). Following similar idea, one can use Cfc(p I tfc) to define the Bayesian measure of strength for different populations, where the strength function for the j'-th (j = 1,2,.. . , J ) population is given by

Sk(j) = / Cfc(P I tfc) dp.

Pj=max { p i , P 2 , • • • , P . J }

It is quite transparent from the definition that 0 < Sk(j) < 1 and Y.Ji=i Sk{j) = 1 if &(p) is the probability density function of a continuous distribution. Also, these Bayesian strength functions (BSF) preserve the ordering of usual posterior probability estimates (i.e., proportions of different classes in the neighborhood).5 Therefore, for any fixed value of k, if one goes for classification based on the values of Sk(j), it will lead to the same result that of usual fc-nearest neighbor classification. Note that the value of Sk(j) depends on the prior distribution £fc(p) as well, and it has to be chosen appropriately. Uniform prior distribution is the most convenient one to handle with. Not only it makes the computation of BSF simpler (we will discuss it in Section 3), it is non-informative and gives no preference to any of the classes. Throughout this article, for all data analytic purpose, we will assume that £fc(p) is uniform for all values of k.

In adaptive nearest neighbor classification, instead of using a fixed value of fc, we study the results

for a sequence of values of k and choose that one which leads to the maximum BSF in favor of the winning population. However, it is not useful to consider all possible values of k since the use of very large values not only increases the computational cost but also fails to represent the local pattern of the measurement space. So, if the training samples from competing classes are not of comparable size, use of large k leads to a decision in favor of one of the larger classes. Therefore, one should set some reasonable upper bound for k. This upper bound is expected to increase with the sample size n. In nearest neighbor classification, for the consistency of posterior probability estimates, one needs to shrink the neighborhood to zero as the sample size increases. So, it is somewhat reasonable to use a function /?„ of n as the upper bound, which increases with n but Pn/n -» 0 as n -> oo. We look at the results for all k < /3n> and finally use that one which leads to the highest strength in favor of the winning population. Unlike usual nearest neighbor classification, here we do not need to go for cross-validation type algorithm to predefine the value of fc, and that leads to substantial saving in computational cost, especially when the training sample is large.

3. Data analytic implementation

We adopt a sequential type approach to find adaptive values of fc. For classifying an observation x, we start with the data point nearest to x and then gradually increase the value of k by considering other data points one by one according to their distances from x. At each stage, we compute BSFs for different populations and keep track of the population, which attained the maximum BSF up to that stage. Given the value of /?„ and the first fc neighbors, one can easily compute the largest possible strength in future in favor of other populations. Clearly, the largest value of this future strength will be attained if the rest of the observations come from the second best class (i.e. the class with the maximum number of representatives among the challengers at that stage). If this largest possible value is smaller than the observed maximum BSF up to that stage, one can stop there and classify the observation to the class that achieved the maximum BSF. For instance, in a two-class problem, if the upper bound (3n is set at 5, and the first

Anil K. Ghosh 283

two neighbors of x come from the first population, one can stop there and take the decision in favor of the first population. Note that given the first two neighbors of x, the second population can at most enjoy the 3-2 majority, but the BSF in favor of the winning class in that case is smaller than that in the 2-0 case. Therefore, in this case, there is no need to go for further calculations.

From our discussion in the previous section, it seems that 0n = Cy/n could be a good choice for the upper bound of k, where C is an appropriate constant. For this choice of /?„, the adaptive nearest neighbor classifier requires only 0(ri) calculations to classify an observation, whereas it requires 0(n log(n)) computations if we use all possible values of k. Note that this order of computation 0(n) is the same as that required by the nearest neighbor classifier with a fixed value of k (follows from computational complexity of order statistics10). Our empirical experience suggests that the final result is reasonable and not very sensitive to the choice of C if it lies between 2 and 3. Throughout this article for all data analytic purpose, we use /3n = 2\Jn. However, if this value of (3n exceeds no = min{ni, n2..., nj}, we take (3n = UQ. Though this choice of upper bound is somewhat subjective, it led to fairly good results in our experiments, which we will see in Section 4.

To compute BSFs for different populations, one can use the numerical integration method. Given the value of k and the vector tfc, the required number of computations for this approximation is proportional to 7"7-1, where 7 is the number of grid points chosen on each axis. Clearly, this computational cost grows up rapidly with the number of classes J. Therefore, in the presence of several competing populations, one has to look for an alternative approximation procedure. If £fc (p) is uniform, it is easy to see that given the value of k and tfc, p follows a Dirichlet distribution with parameters k+1; tik + l,i2fc + l, • • • ,tjk + 1 (X)i=i tjk = k)- Therefore, one can easily generate observations from the appropriate Dirichlet distribution to approximate BSFs for different populations. We know that if Xi, X2, • • •,Xj are independent and Xj ~ Gamma{tjk + 1) for j = 1,2,..., J , {Xi/S,X2/S,...,Xj/S) jointly follows a Dirichlet distribution with parameters k + 1; t\k + l,*2fc + l,...,tJk + l (J2j=i tjk = k), where S = £ j = 1 Xj. Therefore, P{pj > pt Vi ^ j \ tk} = P{Xj > Xi Vi j}, and hence BSF can be approximated by

generating observations from gamma distributions. Now, noting the fact that Gamma{tjk + 1) variate is a sum of tjk + 1 iid exponential variables with unit mean, one can keep on generating observations from exponential distributions to approximate BSF for different populations and for different values of k in a sequential way. For our data analytic purpose, we adopted the numerical integration method when J < 3. For J > 4, BSFs were approximated using 10,000 sets of observations generated from appropriate gamma distributions.

4. Results

We use some simulated and benchmark data sets to study the performance of the proposed classifier. For the simulated example (wavefrom data11), taking almost equal number of observations from competing classes, we generated 1000 different training and test sets of sizes 100 and 1000, respectively. In cases of salmon data12 and Iris data,13 we formed these training and test sets by randomly partitioning the data set 1000 times into two equal halves consisting of equal number of observations from the competing classes. For these data sets, average test set misclas-sification rates of the adaptive nearest neighbor classifier over these 1000 trials are reported in Table 1 along with their corresponding standard errors inside the braces. However, for the vowel recognition data,14 there are specific training and test sets, and in this case we report the test set error rate only. Error rates are also reported for LCV and usual cross validation techniques. A brief description of the data sets that we use in this section is given below.

Waveform data:11 This is a simulated data set. For i = 0,1, . . . ,20, define h\{i) = max{6 — \i — 10|,0},/i2(i) = hi[(i - 4) mod 21] and h3(i) = hi[(i + 4) mod 21]. Suppose that U ~ {7(0,1) and et's (i = 0,1, . . . ,20) are i.i.d. JV(0,1). Now, for the three competing classes, measurement variables Xi (i = 1,2,..., 21) are distributed as

Class-1: Xi = Uhî - 1) + (1 - U)h2(i - 1) + Ci_i. Class-2: Xt = Uhi{i - 1) + (1 - U)h3(i - 1) + Ci-i-Class-3: Xt = Uh2(i - 1) + (1 - U)h3(i - 1) + Ci-i-

Salmon data:12 It consists of 100 bivariate observations on growth ring diameter (freshwater and marine water) of salmon fish coming either from Alaskan or from Canadian water.

284 Adaptive Nearest Neighbor Classifier

Iris data:13 It contains measurements on four different features (sepal length, sepal width, petal length and petal width) on each of 150 observations coming from three different types of Iris plants : 'Iris Setosa', 'Iris Virginica' and 'Iris Versicolor' (50 from each class).

Vowel recognition data:u It was created by a spectrographic analysis of vowels in words formed by an 'h' followed by a vowel and then by a 'd'. There were 67 persons who spoke different words, and two lowest resonant frequencies of a speaker's vocal track were noted for 10 different vowels. This data set was split into a training and a test set of sizes 338 and 333, respectively.

As it has been discussed before, in adaptive nearest neighbor classification, we generate an ensemble of classifiers and then choose that one which leads to the maximum BSF in favor of the winning class. Instead of taking only one value of k, one may also like to consider the results for all possible values of k (1 < k < n — 1) and use voting or other aggregation methods to arrive at the final decision. Bagging15

and boosting16'17 are the two most popular aggregation methods available in the literature. They use bootstrap18 or weighted bootstrap technique to generate different subsamples from the training data, and based on those different subsamples, different classifiers are developed. Results of these classification rules are aggregated using some weight function. Depending on the error rate A of a classifier, boosting assigns an weight w = log{{\ — A)/A} to it. Bagging on the other hand uses equal weights for all classifiers. In our method, we do not require any resampling technique for generating the classifiers. Use of different values of k leads to different classification rules, which can be aggregated using some weight function. Here, we adopt the weight functions used by bagging (i.e. uniform weight) and boosting (i.e. weight proportional to log of the odd ratio) for aggregation. This essentially leads to simple voting and weighted voting of all nearest neighbor classifiers. In Table 1, we report the error rates for these two voting methods as well.

Apart from the salmon data, in all other cases our proposed method led to the best error rate among the nearest neighbor classifiers considered here. In most of these cases, its performance was significantly better than its competitors. Weighted vot

ing method had the best error rate on the salmon data, but even in that case the adaptive nearest neighbor classifier did a reasonably good job. In view of the above data analysis, adaptive nearest neighbor classification seems to be better than using a single value of k for classification of all observations.

Table 1. Misclassification rates of different classifiers

Waveform Salmon Iris Vowel

LCV 20.1 (.06) 9.2 (.10) 3.9 (.06) 25.5 Cross-valid. 20.1 (.07) 10.1 (.11) 4.2 (.06) 21.9 Voting 26.7 (.09) 9.1 (.11) 6.3 (.08) 67.2 Wt. voting 22.9 (.08) 8.9 (.11) 5.6 (.08) 22.8 Adaptive 18.9 (.06) 9.2 (.09) 3.8 (.06) 21.3

References

1. E. Fix and J. L. Hodges, Project 21-49-004, Rep. 4, pp. 261-279. US Air Force School of Aviation Medicine, Randolph Field (1951).

2. T. M. Cover and P. E. Hart, IEEE Trans. Info. Theory, 13 21 (1967).

3. D. O. Loftsgaarden and C. P. Quesenberry, Ann. Math. Statist, 36, 1049 (1965).

4. P. A. Lachenbruch and M. R. Mickey, Technometrics, 10, 1 (1968).

5. A. K. Ghosh, Comput. Statist. Data Anal., 50, 3113 (2006)

6. B. W. Silverman, Density Estimation for Statistics and Data Analysis (Chapman and Hall, London, 1986).

7. P. C. Mahalanobis, Proc. Nat. Acad. Sci., India, 12, 49 (1936).

8. J. H. Friedman, Technical Report 113, Stanford University Statistics Department (1994).

9. T. Hastie and R. Tibshirani, IEEE Trans. Pattern Anal. Machine Intell, 18, 607 (1996).

10. A. V. Aho et al., Design and Analysis of Computer Algorithms (Addison-Wesley, London, 1974).

11. L. Breiman et al., Classification and Regression Trees (Chapman and Hall, London, 1984).

12. R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis (Prentice Hall, New Jersey, 1992).

13. R. A. Fisher, Ann. Eugenics, 7, 179 (1936). 14. G. E. Peterson and H. L. Barney, J. Acoust. Soc.

Amer., 24, 175 (1952). 15. L. Breiman, Machine Learning, 24, 123 (1996). 16. R. E. Schapire et al, Ann. Statist, 26, 1651 (1998). 17. J. H. Friedman et al., Ann. Statist, 28, 337 (2000). 18. B. Efron and R. Tibshirani, An Introduction to Boot

strap (Chapman and Hall, New York, 1993).

285

Class-Specific Kernel Selection for Verification Problems

Ranjeeth Kumar and C. V. Jawahar

Center for Visual Information Technology, International Institute of Information Technology,

Gachibowli, Hyderabad - 500032, INDIA E-maU:jawahar@iiit. ac. in

The single-class verification framework is gaining increasing attention for problems involving authentication and retrieval. In this paper, nonlinear features are extracted using the kernel trick. The class of interest is modeled by using all the available samples rather than a single representative sample. Kernel selection is used to enhance the class specific feature set. A tunable objective function is used to select the kernel which enables the adjustment of the false acceptance and false rejection rates. The errors caused due to the presence of highly similar classes are reduced by using a two-stage hierarchical authentication framework. The performance of the resulting verification system is demonstrated on the hand-geometry based authentication problem with encouraging results.

Keywords: Kernel Methods; Kernel Selection; Biased Discriminant Analysis; Hierarchical Classification

1. Introduction

The single class recognition framework is gaining increasing attention for problems such as biometric authentication1 where the task is to verify the claimed label of a previously unseen sample. In problems involving multiple classes, posing the authentication problem as a single-class recognition problem needs a feature set that discriminates between the class of interest {positive class) and the iest(negative classes). When the features are highly discriminative, as in the case of strong biometrics such as finger prints,3

this is not a serious problem. However when the features distributions of different classes overlap substantially a class-specific feature selection scheme is essential for reliable authentication. Feature selection algorithms such as multiple discriminant analy-sis(MDA) that identify a global set of features that aid in discriminating between a set of classes are conventionally used for biometric based recognition.4

Class specific feature selection scheme using the biased discriminant analysis(BDA) is applied to hand-geometry based authentication in 5 with significant improvement in the false acceptance and false rejection rates (FAR and FRR respectively). It is known that different applications require a different combination of these rates.1 '2 In this paper, nonlinear variant of the biased discriminant analysis using the kernel trick6 is used for identification of class-specific features. Kernel selection is used for both identification of the appropriate feature set and to tune the FAR and FRR rates as desired. A hierarchical verification scheme is used to reduce the number of errors

caused due the presence of highly similar classes.

The use of kernel variants of the biased discriminant algorithm is motivated by several factors that aid in enhancing the performance of a verification system: i) The use of kernel functions enables working with features that are nonlinearly related to the input features. This helps in capturing higher order correlations in the features unlike the linear biasmap algorithm which considers only the covariance structure of the data, ii) The use of class specific kernel functions provides an elegant way to identify class specific features. Selection of the kernel function and its parameters to minimize a weighted combination of the FAR and FRR provides a framework to identify the appropriate feature set and achieve the desired FAR and FRR rates, iii) As kernel methods are nonparametric they are free of assumptions on the distribution of the data unlike linear algorithms which are designed for the normal distribution. Further, modeling the positive class by using all the positive examples and not by their mean alone allows us to handle distributions that deviate from the normal distribution. The contributions of this work are : i) A novel way of using the kernel selection as a tool for both identification of a class specific feature set and also to tune the FAR and FRR rates to suit the application, ii) Modeling the positive class using a set of positive examples rather than a single representative sample i.e. the mean of the samples, iii) A two-stage authentication framework used that reduces the errors due to highly similar classes with-

286 Class-Specific Kernel Selection for Verification Problems

out sacrificing the efficiency provided by the single class framework. Our results demonstrate that the proposed scheme results in improved performance of the verification system.

Section 2 reviews the Biased Discriminant Algorithm (BDA) and its kernel variant. Its modified version which takes all the samples of positive class and its kernel variant are derived. Section 3 describes the way the kernel function is chosen. A hierarchical authentication algorithm that retains efficiency of single class framework and gives more accurate results is introduced. Section 4 gives the experimental results on hand-geometry data.

2. Kerne l BDA and Mult ip le Exemplar BDA

Kernelization of linear algorithms using the kernel function6 resulted in nonlinear methods that saw great success, most notably the support vector ma-chine(SVM)7 and kernelized subspace methods.8-10

Kernelization involves the reformulation of linear algorithms in terms of dot products between the available data points and then use the kernel function K(XI ,X2) = (<^>(xi),0(x2)) for computation of the inner product between the pairs of images of vectors xi and X2 under a nonlinear transformation <j){.) without explicit computation of <j)(.). Let a d-dimensional data collection of N points be represented as a matrix X = [xi • • • XJV] with column x* representing a data point. The kernelized version K of the Gram matrix X*X can be computed using the kernel function as Kjj = K(XJ, XJ) . This matrix is central to kernel-based algorithms as it encodes the geometry between all pairs of the points in </>(Xi). The matrix J m X n which denotes a m x n matrix with all elements equal to 1 is frequently used to make the kernelization easier.

2 .1 . BDA and Nonlinear BDA

Biased Discriminant Analysis BDA11 extracts features that best separate the positive class from the negative class. Let X p and X„ denote the data matrices consisting of the positive and negative examples. The scalars p and n denote the number of positive and negative examples respectively. The goal of BDA is to find the discriminant direction w which

maximizes

1 ; ||w*Spw|| | |w<DpDpw||

where the scatter matrices Sp and S„ are expressed in terms of the positive and negative data matrices centered around the positive mean, D p = X p — p-XpJpxp and D n = X n - | X p J p x n . The kernel variant of BDA has been suggested in 12 to overcome the small sample problem in retrieval applications.The kernelization of BDA has been approached in various ways.13'14 We derive it using the dual formulation. Since the discriminants vector w lies in the space spanned by input data: w = 5Z,-_1 xâ-7 = X a

where a = [a1 a2 • •• aN] . The problem now is to solve for a that maximizes

= q ' X ' D n D ^ X a v ' a ' X ' D p D p X a

Substituting the values of D p and D n , using the notation Kp and K„ for the kernelized versions of the matrices X ' X P and X ' X n the kernel BDA problem reduces to finding a that maximizes

, , Ot ( K n — ^-K-pJpxnJi-K-n — -K-pJpxn) &

J[a) = — — Q (Kp — - K p J p x p ) ( K p — -KpJpxp) ex.

The solution to this problem is obtained by solving a generalized eigen value problem involving the two matrices in the above equation. A method to overcome singularity and small-sample issues proposed for MDA14 is used for solving this equation. The projection of a test sample onto w can be evaluated as a ' X ' x , since the kernelized version of X*x is computable using «(.,.).

2.2. Kernel Multiple Exemplar BDA

Two well-known drawbacks of the traditional discriminant analysis schemes is the assumption that data is distributed normally and the small-sample size problem.15 The assumption that data follows a Gaussian distribution is violated in the case of face images and similar biometrics. Similarly, the number of samples available is often insufficient to model the positive class in BDA. Moreover, the fact that the positive class is modeled by the mean of the available samples alone implies that the positive samples are not used completely to describe the distribution of the positive class. A technique to address

Ranjeeth Kumar and C. V. Jawahar 287

these problems by modeling the class of interest using the set available exemplars and not just their mean is proposed in the case of Linear Discriminant Analysis.15 Using the same idea for BDA results in Multiple Exemplar BDA(MEBDA). Denoting the ith positive sample by x? for i = 1 • • • p, we define the data matrices centered around this sample as Dpi = J\.p Xj J i x p anu L)ni — J\.n x^ J x x n • Accordingly the dual formulation of MEBDA reduces to the problem of solving for a that maximizes

a t X t ( E ? = 1 D n i D ^ ) X a J[ > a * X ' ( E L i D P i D ^ ) X a

where the key difference with the traditional BDA is that the scatter matrices Sp = X^f=i DpiD^ a n ^ S n = X^?=i umuni are computed using all the samples of the positive class. This algorithm can be ker-nelized by using the notation kf for the vector denoting kernel version of X'x?. The kernelized version of MEBDA reduces to solving for a that maximizes

« ' E L i ( K n - k?J l x „ ) (K„ - kf J l x n ) « a J(a) =

a* £ ? = i ( K p - k f J l x p ) ( K p - k ? J i X p ) ' a

As in the previous case a can be found by solving the generalized eigen value problem involving the matrices given above. The manner in which the projections of a test sample on the discriminant directions are computed does not change as the discriminants are a linear combination of all the samples present in the dataset in both the cases. A comparison of the results of the normal KBDA and Kernel MEBDA are given in Table 2.

class implies a different set of class-specific features are extracted for each class which is desirable for verification algorithms. Moreover, the non-parametric nature of the kernel-based algorithms may lead to over-fitting. For instance, by using a Gaussian-rbf kernel of sufficiently small width a it is possible to obtain an arbitrarily small FAR while the FRR might increase. To address these problems we select the optimal kernel function for each class by performing a search over a set of kernel functions (the parameters of which are varied smoothly during the search) such that the quantity

J (it) = (FAR + T)FRRf

over a validation dataset is minimized. The parameter r) allows us to change the relative importance of FAR and FRR in determining the feature set. The quantities FAR and FRR are dependent on the data set used and hence it must be ensured that they reflect the true values. This is done by partitioning the data set randomly into train and validation sets several times and acquiring the mean values of FAR and FRR. The objective function used above was found to vary smoothly as the parameters of the kernel function are varied. Offline and gradient-descent based methods for searching the optimal kernel are being explored. Figure 1 shows the variations of FAR and FRR as the kernel parameters are varied and the optimal FAR and FRR rates as the value of r] is varied. It can be seen that by selecting an appropriate kernel and r\ it is possible to obtain acceptable FAR and FRR.

3. Kernel Selection for Verification

The kernel function plays an important role in kernel-based algorithms as it encodes the similarity between the images of the samples in a different feature space. Selecting the appropriate kernel for a specific problem is a challenging and widely researched issue.6 In the case of single-class problems the objective of kernel selection differs from other problems since the goal is to choose a kernel that maximizes the similarity between positive samples and the dissimilarity between the positive and negative samples while the (dis)similarities between negative samples bear no significance. Using a different kernel for each

(a) (b)

Fig. 1. (a) The variation of the FAR and FRR rates using the KBDA algorithm as the width of the Gaussian kernel is varied (b) The FAR and FRR rates at the optimal value of J(K) obtained for different values of r\ using the linear(-), polynomial(-) and Gaussian(-.) kernels. It can bee seen that the polynomial and Gaussian kernels perform better in general.

288 Class-Specific Kernel Selection for Verification Problems

3.1. Hierarchical Authentication

A major source of errors in a verification system is the presence of similar classes in the data which affects the performance in two ways. Since the bi-asmap algorithm treats all the classes other than the positive class as a single class, the presence of data distributed similarly as the positive class results in higher number of false acceptances and also corrupts the computed biased discriminants resulting in other errors. The reason for this is that the small number of features that help in discriminating between the two similar classes are overlooked in the presence of a large number of negative classes. A direct way to overcome this is to introduce hybrid classes that consist of classes that are difficult to separate using the biasmap. The authentication is done in two stages after forming the hybrid class. In the first stage, a biasmap-based method authenticates if the sample belongs to the correct hybrid class (the hybrid class might be homogeneous if it is well separated from other classes). In the second stage the sample is recognized by combining multiple dichotomizers that use discriminative features for separating two classes which are extracted using kernel fisher discriminant analysis.10 The second stage requires the more discriminative framework owing to the high similarity between classes within the hybrid classes. This hierarchical framework is similar, in spirit, to the one used for classification.16 The resulting algorithm provides an increased performance particularly in the case of similar classes.


The proposed scheme is used to perform biomet-ric authentication using hand-geometry features used in.5 Figure 2 shows the image acquired and the contour extracted. The raw features describing the hand-geometry are not rich enough to discriminate between the subjects accurately. Alternate set of features were extracted using linear and nonlinear variants of BDA. Figure 3 depicts the resulting feature distributions of a subset of the classes. The authentication experiments were performed by using the features extracted by projecting on to the top 15 discriminants. The complete dataset is randomly split in to two sets : the test and train datasets. The biased discriminants for each class are learned using the train set. Then, for each class, the test samples are all claimed to have the label of that class and

are authenticated by projecting on the discriminant space of that class. The mean FRR and FAR values over all the classes is taken as the FRR and FAR values for the dataset. The average values of FRR and FAR over 100 such trials using the raw, BDA-transformed and kernel BDA-transformed features are shown in Table 1. A second degree polynomial kernel and Gaussian-rbf kernel of width 0.5 were used. Note the increase in FRR and and decrease in FAR with kernel BDA. This is expected as the estimated distribution is more closer to the distribution of train samples. The kernel selection experiments were done in a similar manner and the kernel was selected to minimize (FAR + t]FRR)2 and the results using the optimal kernel selected are shown in Table 3. Observe that with the kernel selection both the FAR and FRR reach an acceptable value while it may not be so in the general case. Further the parameter rj allows us to weigh the importance of these two rates. A value 77 = 0.5 was used for the experiments. This means that the FAR is given more importance than the FRR. The results obtained reflect this fact. When the two-stage hierarchical authentication scheme was used the results further improved. The 40 classes resulted in 24 hybrid classes with all the hybrid classes containing less then 3 homogeneous classes. The kernels for the second level (in KFDA) were chose using cross validation. The resulting FAR and FRR are shown in Table 3. Observe the increase in both the FRR and FAR over the normal scheme.

Fig. 2. The hand image acquired and the raw hand-geometry features extracted from these images

Table 1. FAR and FRR rates using the linear and kernel BDA using different kernels.

FRR FAR

Raw 3 . 3 % 1 5 %

BDA 0 . 8 % 8.3%

(i + <x ,y ) ) ' 1.4% 2.9%

RBF <r = 0.5 1.9% 1.4%

Ranjeeth Kumar and C. V. Jawahar 289

T~

a b

Fig. 3. Five classes from the hand-geometry data on to the top two directions found by the linear (a) and kernel variants of BDA (b). A Gaussian-rbf kernel with a = 2 was used. It can be seen the positive class(shown in red) is better separated in the second case.

Table 2. FAR and FRR rates using KBDA and Kernel MEBDA using different kernels. The first two rows show the rates with KBDA and the next two rows show the rates with kernel MEBDA

FRR FAR FRR FAR

Linear 0 . 8 % 8.3% 0 . 5 % 4.3%

(l + <x,y)>* 1.4% 2.9% 0.7% 1.9%

RBF a = 0.5 1.9% 1.4% 0 . 8 % 1.4%

Table 3. Comparison of results using KBDA, KBDA with kernel selection and the hierarchical scheme.

FRR FAR

RBF a = 0.5 1.9 1.4

Optimal 0.96% 0.28%

Hierarchical 0.62% 0.13%


In this paper techniques for improving the perfor

mance of a verification systems are investigated.

Nonlinear biased discriminants based on the kernel

trick are used for authentication. The feature selec

tion problem is posed as a kernel selection problem.

The tunable objective function using FAR and F R R

rates is used to perform the selection. The full set of

available samples is used to describe the distribution

of the positive class. A hierarchical authentication

framework is introduced to reduce the errors caused

by highly similar classes. Efficient search techniques

for kernel selection and for learning a class specific

kernel matr ix are promising directions for future re

search.

R e f e r e n c e s

1. A. Ross., A. K. Jain and S. Prabhakar. An introduction to biometric recognition. IEEE Trans. Circuits and Systems for Video Technology, Special Issue on Image- and Video-Based Biometrics 14(1), 4-20 (2004).

2. A. K. Jain and A. Ross. Learning user-specific parameters in a multibiometric system. Intl. Conf. on Image Processing (ICIP 2002) 1, 57-60 (2002).

3. A. K. Jain and D. Maltoni. Handbook of Fingerprint Recognition. ( Springer-Verlag New York, Inc., Se-caucus, NJ, USA, 2003).

4. C. S. Avilla and R. S. Rello. Biometric identification through hand geometry measurements. IEEE Trans, on Pattern Analysis and Machine Intelligence 22, 1168-1171 (2000).

5. V. Roy and C. V. Jawahar. Hand-geometry based person authentication using incremental biased discriminant analysis. In National Conference on Communications (NCC 2006) , 261-265 (2006).

6. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. (Cambridge University Press, New York, NY, USA, 2004).

7. B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Computational hearing Theory , 144-152 (1992)

8. B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5), 1299-1319 (1998).

9. F. R. Bach and M. I. Jordan. Kernel independent component analysis. JMLR 3, 1-48 (2002).

10. S. Mika, G. Ratsch, J. Weston, B. Scholkopf, A. Smola, and K.-R. Muller. Constructing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel feature spaces. IEEE Trans. Pattern Analysis and Machine Intelligence 25(5), 623-633 (2003).

11. P. J. D. Pillo. Biased discriminant analysis: Evaluation of the optimum probability of misclassification. Communications in Statistics-Theory and Methods A8, 1447-1457 (1979).

12. X. Zhou and T. Huang. Small sample learning during multimedia retrieval using biasmap. In IEEE Conference on Computer Vision and Pattern Recognition!, 11-17 (2001).

13. Dacheng Tao and Xiaoott Tang. A direct method to solve the biased discriminant analysis in kernel feature space for content based image retrieval. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing 3, 441-444 (2004).

14. C. H. Park and H. Park. Nonlinear discriminant analysis using kernel functions and the generalized singular value decomposition. SIAM J. Matrix Anal. Appl. 27(1), 87-102 (2005).

15. S. K. Zhou and R. Chellappa. Multiple-Exemplar Discriminant Analysis for Face Recognition. Intl. Conf. on Pattern Recognition 4, 191-194 (2004).

16. M. N. S. S. K. P. Kumar and C. V. Jawahar. Configurable hybrid architectures for character recognition applications. In Intl. Conf. on Document Analysis and Recognition , 1199-1205 (2005).

290

Confidence Estimation in Classification Decision: A Method for Detecting Unseen Patterns

Pandu R Devarakota* * and Bruno Mirbacht

UEE S. A., ZAE Weiergewan, 11, rue Edmond Renter,

L-5326 Contern, Luxembourg E-mail: {pdv,bmi}Oiee.lu

Bjorn Ottersten*

* Signal Processing Group School of Electrical Engineering

Royal Institute of Technology (KTH) SE-100 44 Stockholm, Sweden

The classification task for a real world application shall include a confidence estimation to handle unseen patterns i.e., patterns which were not considered during the learning stage of a classifier. This is important especially for safety critical applications where the goal is to assign these situations as "unknown" before they can lead to a false classification. Several methods were proposed in the past which were based on choosing a threshold on the estimated class membership probability. In this paper we extend the use of Gaussian mixture model (GMM)to estimate the uncertainty of the estimated class membership probability in terms of confidence interval around the estimated class membership probability. This uncertainty measure takes into account the number of training patterns available in the local neighborhood of a test pattern. Accordingly, the lower bound of the confidence interval or the number of training samples around a test pattern, can be used to detect the unseen patterns. Experimental results on a real-world application are discussed.

Keywords: Pattern classification, confidence based classifier, density estimation, confidence intervals

1. Introduction

A statistical classifier estimates for each pattern a class membership probability. Correspondingly, the pattern is assigned to a class which has a maximum class membership probability. In this respect, challenges are to reduce the misclassification rate and to reject unseen patterns by assigning them with a low confidence (reliability). To overcome the above challenges, a classifier with a reject option is one approach which is often exercised in the literature.2 If the conditional probability exceeds a required minimum threshold, then the respective decision is rejected. As a consequence of this method, a test pattern close to a decision boarder implicitly defined by the classifier is prone to be rejected, while a test pattern far away from the boarder will be assigned to a class. This approach, however needs an optimum Bayesian decision rule.2 It is, however, known that it is very difficult to determine the Bayesian error in real problems1 as it needs a complete knowledge of the distribution of data. As a consequence of the above fact, a statistical classifier in practice overly estimates the class membership and thus always has an uncertainty related to each estimate. In order to

use the classifier with a reject option the above uncertainty should be subtracted upon the classification.

To our best knowledge there exists yet no a method which is able to define statistically the uncertainty in the estimated class membership probability which can allow further to define a confidence interval for the same. In this paper we extend the use of Gaussian mixture model to estimate the uncertainty of the estimated class membership probability. The basic idea of the proposed approach is that it takes into account the number of training patterns falling in the local neighborhood of a test pattern. The motivation behind this approach is: If there is a small number of training patterns falling in the local neighborhood of a test pattern, the uncertainty in the estimate will be large and if there is a large number of training patterns around a test pattern, the uncertainty will be small.

The paper is organized as follows: The proposed method for estimating the number of training patterns in the neighborhood of a test pattern is presented in Section 2 and its simplification using Gaussian mixture models is described in Section 3. The computation of confidence intervals which takes into account the uncertainty is also introduced in the

Pandu R Devarakota, Bruno Mirbach and Bjorn Ottersten 291

same section. Results are presented in Section 4, and the concluding remarks are given in Section 5.

2. The density estimation of a test pattern

Let D be a training set {(xt, 2/j)}£Li with each sample Xi € 1Zd and the corresponding output label yi G 1 , . . . , iVc, where Nc is the number of classes. Then, the density of training patterns at a data point x is given by

N

P(x) =^5(x-Xi) (1) »=i

where 6 denotes the Dirac delta function. Accordingly the number of training patterns falling in the region R centered at x can be calculated as

NR(x) = [ p{x')dx' (2) Jx'eR(x)

The solution to Eq. 2 may result a discrete value and in practice, a continuous value of density is desirable. This can be achieved by considering a window function <p which is centered at x and covers a region R. Equation. 2 may be rewritten now as

NR(x)= [ p(x')<p{x - x')dx' (3)

The window function <p has a flexibility that the region R can be tuned by including width parameter r into it.

Consider a special case where a radial basis function of width r is selected as a window function. The number of training patterns falling in a sphere of radius r which is centered at x is now calculated as

(4)

Furthermore, Eq. 4 can be generalized by considering an hyperellipsoid centered at a; as a window function (nothing but a Mahalanobis distance):

exp( - (x ' - x)TC-\x' - x)/2r2)dx' (5)

This approach should be considered if the components of training data show some correlation. In that

case C should be chosen to be the covariance matrix of the training set D. The formation in Eq. 5 is straightforward to compute the number of training patterns in the hyperellipsoid centered at the test pattern x. However it needs the storage of the entire training data in the memory which is computationally very demanding and hence not useful for a real-world application. In this work, we consider a Gaussian mixture model approximation to estimate p(x) which is reasonable and sufficient in practice, allow us to develop explicit expressions to compute Nr(x) and requires only a small number of estimated parameters as input.

3. The Gaussian mixture model

Assume that the density function of the training data is expressible as the mixture of Gaussian functions in the following way:

£ J ^(2n)ddetSk

exp[-(x-^k)TS^(x-(ik)/2} (6)

where K represents the number of Gaussian functions in the mixture, p,k the center of the k-th Gaussian function, Sk a matrix describing the widths of the k-th Gaussian function, d the dimension of the training data and the Nk represents normalization factors fulfilling

K

k

where N is the total number of training patterns. The parameters p,k, Sk and Nk can be computed offline using the Expectation-Maximization algorithm.3

With p(x) as defined in Eq. 6, the integration in Eq. 5 can easily solved by the convolution of two Gaussian functions and the final expression may be written as

K

Nr(x) =YtNk(det(TkSk-1))1'2 *

exph^Cz-Mfc ) 7 - - '

Sîl-nS^Xx-n)] (8)

where Tj^1 is given as follows,

292 Confidence Estimation in Classification Decision: A Method for Detecting Unseen Patterns

Finally, the Eqn. 8 for computing Nr(x) can be brought into the form given below:

Nr(x) = ^ N ' k e x p [ ( x - nk)TS'k~\x - fik)/2]

k

(10)

where Nk = A ^ d e t ^ ^ 1 ) ) 1 / 2 and Sk = (1 -

The parameters Nk and Sk can be computed off-line and the number of training samples around a test sample can now easily be calculated using the explicit form of the Eq. 10.

Let pe(x) be the estimated class membership probability and Nr(x) be the number of training patterns in the neighborhood of the test pattern x. Then, one can calculate the confidence interval for pe using the standard Wilson interval for a Binomial distribution*1,4

, ^ _ , (11)

A , , JWjWr^pe(l - Pe) + \*/{ANT) ^•(Pe) ~ . £

"1~ Nr

where p± is the upper, respectively the lower bound of the confidence interval (CI), and A is determined by the confidence level that has been chosen. A(pe) can be regarded as uncertainty in the estimated class membership probability pe. A reject of pattern as "unseen" can then be easily established by applying a threshold to the lower bound of the confidence interval. Alternatively one could also establish a reject based on the width A of the confidence interval or based on the number of training samples Nr(x) itself. Here, the lower bound of the confidence interval is chosen.


The aim of experiments is not to improve the classification rate but to see how the classifier reacts to the unknown patterns which were not included in the training data. The effectiveness of the proposed method to estimate the confidence is tested on a real-world application5 and the results are then compared with a rejection threshold on a class membership probability alone which is a state-of-art in

aThis problem of estimating the confidence interval is well established for a so-called Bernoulli process, which follows statistically a Binomial distribution.

the literature.2 In Ref. 5, an occupant classification system was evaluated where the goal is to detect the occupancy of passenger seat by an optical system and then classify it into one of the following four classes 1. Empty seat 2. Rearward facing infant seat (RFIS) 3. Forward facing child seat (FFCS) 4. Adult (P) (see Fig. 1). A low-resolution range sensor based on the time-of-flight principle is employed for capturing the image sequences. For more details about the system see Ref. 5. The challenge is to cope with the large variation in the scene. If designed properly, in addition to the classification task, such a system should also be able to overcome the unexpected situations that may occur in real life which were not defined in the training data. The goal of confidence measure is there to reject those situations by assigning them as "unseen".

The data set used for training and testing the classifier is shown in Table 1. The number n x m represents the number of sequences n and the number of frames m in each sequence respectively. In order to take the variation of occupant scenes into account, different occupants with varying hand postures, leg postures, and torso gestures were recorded. Two-third of the data used to train the classifier and one-third is used for testing. Refer to the Ref. 5 for details about the Camera used in this application, pre-processing, and feature extraction. As an unseen data, we recorded few sequences of data where the passenger seat is occupied with an object, for instance, boxes, rucksack which were not in the part of training data. One example for unseen data where the passenger seat is occupied with a box can be seen in Fig. 1. The unseen data information is also listed in Table 1. For the classification task, the clas-

Table 1. Dataset used for evaluating the confidence measure

Class Empty RFIS FFCS P Unseen data

No.of Sequences 18 x 50

236 x 20 30 x 50

45 x 200 4 x 50

sifier based on a polynomial regression is considered. The advantage of the Polynomial classifier is that it makes no assumptions on the statistical distribution of the data and leads, at least when using the least

Pandu R Devarakota, Bruno Mirbach and Bjorn Ottersten 293

B Q E E E Fig. 1. Range images of possible occupancy in a vehicle (a) Empty (b) RFIS (c) FFCS (d)Adult respectively. Blue represents the closest point, and red represents the furthest point to the camera. One possibility of unseen class is shown in (e) where the passenger seat is occupied with a box. Note that the shown images are preprocessed images.

mean-square error optimization criterion, to a closed solution of the optimization problem. The more details about the Polynomial classifier can be found in Ref. 6. But the output of a polynomial classifier does not corresponds to a probabilistic output thus the estimate of class membership probability is not readily available. A method to transform the SVM classifier output into a probabilistic output using a sigmoid approximation was proposed.7 The same method is adapted here to produce probabilistic outputs for the Polynomial classifier.

The evaluation of confidence is as follows: On the basis of the performance on the training set patterns, we establish a reject criterion once based on the estimated class membership probability and once based on the lower bound of the CI. In both cases a rejection threshold is chosen such that less than 5% of the training data are rejected. After that the reject criterion is applied to the test data and to the unseen data based on the chosen threshold value. A radius of r = 1 and K = 10 are chosen in Eq. 9 for all experiments, and a 95% confidence level is chosen to calculate the A value in Eq. 11.

In Fig. 2, the percentage of patterns rejected based on the class membership probability for the train data, test data and unseen data is plotted at different thresholds. The vertical dotted line shows the corresponding threshold value to reject 5% of the training data. It can be seen that at the respective threshold, 5.5% of the test data and only 3% of the unseen data is rejected. Figure 3 shows the percentage of patterns rejected based on the lower bound of CI for the train data, test data and unseen data is plotted at different thresholds. At the respective threshold, the lower bound of the CI is able to reject 86.5% of the "unseen" data at the expense of only 4% rejection in the test data. Thus taking the uncertainty of the class membership clearly shows the ability of detecting the "unseen" patterns.

class membership probability 100 • • • • 1 • 90 ^ / 80 -*-train data I

s? 70 -o-testdata I aj go o unseen data, / . 03 /

^ 5 0 ^ / -.£. 40 : / -C 30 : / -

20 I

0.5 0.6 0.7 0.8 0.9 1 threshold value

Fig. 2. % of patterns rejected based on the class membership probability

0 0.2 0.4 0.6 0.8 1 threshold value

Fig. 3. % of patterns rejected based on the lower bound of confidence interval.

5. Conclusion

In this paper, we have formulated a method for estimating the uncertainty (reliability) of the class membership probability estimated by a statistical classi-

294 Confidence Estimation in Classification Decision: A Method for Detecting Unseen Patterns

fier. The basic idea of the approach is, it takes into account the the number of training patterns in the neighborhood of a test pattern. The density of the training data around a test pattern is defined and then an explicit expression is derived for calculating the number of training patterns in a neighborhood of a test pattern. The use of Gaussian mixture model is extended for this purpose. It is shown that this method needs a small number of estimated parameters and thus this method does not need the training patterns to be stored in the memory. Furthermore, the uncertainty is represented in terms of confidence interval for the estimated class membership probability. For this a standard Wilson interval for a Binomial distribution is considered. Though we have shown a particular application in a pattern recognition where the presented method is useful, it can be applicable to several fields of applications where the "novel pattern" detection is important.

6. Acknowledgements

This project is funded by IEE S. A. and Luxembourg International Advanced Studies in Information Tech

nology (LIASIT), Luxembourg.

References

1. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis (Wiley, 1991).

2. C. Chow, IEEE Trans, on Information Theory It-16 1, 41 (1970).

3. M. A. Figueiredo and A. K. Jain, IEEE Trans, on Pattern Analysis and Machine Intelligence 24, 1 (March 2002).

4. L. D. Brown, T. T. Cai and A. DasGupta, Statistical Science 16, 101(May 2001).

5. P. R. Devarakota, B. Mirbach, M. Castillo-Franco and B. Ottersten, To appear in IEEE Trans, on Vehicular Technology (2006).

6. J. Schiirmann, Pattern Classification: Statistical and Neural Network based Approach (John Wiley and Sons, Inc., New York, 1990).

7. J. Piatt, Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods (In: A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans (eds.): Advances in Large Margin Classifiers, Cambridge, MA, 2000).

ECG Pattern Classification Using Support Vector Machine

S. S. Mehta and N. S. Lingayat

Electrical Engineering Department Jai Narain Vyas University MBM Engineering College

Jodhpur- 342 001.(Raj), India Email: [email protected], [email protected]

This paper presents a new algorithm for reliable pattern classification in the Electrocardiogram (ECG) based on Support Vector Machine (SVM). Among all ECG components, QRS complex is the most significant feature. Once the positions of the QRS complexes are found, a more detailed examination of the ECG signal can be carried out, in order to study the complete cardiac period. This paper presents SVM as a QRS detector in ECG signal. Two different preprocessing methods are applied for the generation of features. First involves digital filtering to remove base line wander and power line interference while the second involves entropy criterion for feature generation. The processed signal is further used for QRS detection using SVM. This algorithm implements an idea of supervised learning i.e. learning through examples. Algorithm performance was evaluated against the CSE ECG Database. The numerical results indicated that the algorithm finally achieved about 98.7% of the detection rate and also it could function reliably even under the condition of poor signal quality in the ECG data. The successful detection depends strongly on the quality of the learning set (selection of cases), data representation and the mathematical basis of the classifier.

Keywords: SVM, ECG, QRS Complex

1. Introduction

ECG is the electrical manifestation of the contractile activity of the heart that can be recorded easily. It has been widely used by physicians for a variety of diagnostic purposes. The performance of computerized ECG processing systems relies heavily upon the accurate and reliable detection of the cardiac complexes and mainly upon the detection of the QRS complexes. Among all ECG components, QRS complex is the most significant feature. Once the positions of the QRS complexes are found, a more detailed examination of the ECG signal can be carried out, in order to study the complete cardiac period. The QRS detection provides an important basis for instantaneous heart rate (HR) computation since the accuracy of instantaneous heart period estimation relies on the performance of QRS detection [1]. On the other hand, it is acknowledged that QRS complex is varying with the physical variations and also affected by noise as time evolves. Therefore, seeking for a reliable QRS detection algorithm is essential to the realization of automatic ECG diagnosis. Many researchers have conducted the research of QRS detection for over three decades. Numerous approaches to QRS detection have been proposed previously [2-16]. The algorithms that QRS detectors employ can generally be divided into three categories viz. non-syntactic, syntactic, hybrid. A common technique

utilized in the non-syntactic approach, is to employ a scheme that consists of a preprocessor and a decision rule [17]. The purpose of the preprocessor is to enhance the QRS, while suppressing the other complexes as well as the noise and the artifacts. The preprocessor consists of a linear filter and a transformation. The purpose of the decision rule is to determine whether or not QRS complexes are present in the signal. In this paper SVM has been proposed for QRS detection in ECG signal. It is a new technique for data classification. The main idea of SVM is to construct a hyperplane as a decision surface in such a way that the margin of separation between the positive and negative examples is maximized. Section 2 presents a brief description of the machine. A brief methodology is provided in section 3. The performance of the proposed algorithm is demonstrated in section 4 with the aid of computer simulations.

2. Support Vector Machine

SVM is a new paradigm of learning system. The technique of SVM, developed by Vapnik [18], was proposed initially for classification problems of two classes. SVM use geometrical properties to exactly calculate the optimal separating hyperplane directly from the training data. They also introduce methods to deal with non-linearly separable cases, i.e., where no separating straight line can be found as



296 ECG Pattern Classification Using Support Vector Machine

well as with cases in which there is noise and /or outliers in the training data, i.e. some of the training samples may be wrong. Basically, the SVM is a linear machine working in the highly dimensional feature space formed by the nonlinear mapping of the n-dimensional input vector x into a if-dimensional feature space (K > n) through the use of a mapping 0(x). The following relation gives the equation of hyperplane separating two different classes:

K

2/(x) = wT0(x) = ^2 Wj<l>j(x) + u)O = 0 (1)

where the vector 0(x) = [0(0), 0(1), 4>{K)}T

with 0(0) = 1 and w = [io0, w\, •••, WK]T is the weight of the network. Fulfillment of condition y(x) > 0 means one class and y(x) < 0 means the opposite one. The most distinctive fact about SVM is that the learning task is reduced to quadratic programming by introducing the so-called Lagrange multipliers. All operations in learning and testing modes are done in SVM using kernel functions. The kernel is defined as K(x, Xj) = 0 T ( X J ) 0 ( X ) . The best known kernels include radial Gaussian, polynomial and sigmoid functions. The problem of learning SVM, formulated as the task of separating learning vectors x, into two classes of the destination values either di = 1 or di = — 1 with maximal separation margin is reduced to the dual maximization problem of the function defined as follows:

v , v v

Q(a) = ^2 oti - - Y^ ] T aiajdidjK(xiXj) (2) i~\ i= l j = \

with constraints v

^ aidi = 0 0 < oti < C (3) i = l

where C is a user defined constant and P is the number of learning data pairs (xj,di) . C is the regularizing parameter and determines the balance between the complexity of the network, characterized by the weight vector w and the error of classification of data. For the normalized input signal the value of C is usually much bigger than one. The solution with respect to Lagrange multipliers gives the optimal weight vector wopt as

Ns

Wopt = ] P Qsidsi0(xsi) (4) i= l

In the above equation index s points to the set of Ns

support vectors i.e. the learning vectors Xj, for which the relation

di ( Yl W^J(xi)+wo) > 1 - & (5)

(£i > 0 -the slack variables) are fulfilled with the equality sign. The output signal y(x) of the SVM in the retrieval mode after learning is determined as the function of kernels

Ns

y(x) = Y asidsiK(xsi, x) + w0 (6) <=i

and the explicit form of the nonlinear function 0(x) need not be known. The value of y(x) greater than 0 is associated with 1 (membership of the particular class) and the negative one with —1 (membership of the opposite class).

3. Methodology

3.1. Signal Preprocessing

A raw digital sample of 12-lead ECG signal of a patient is acquired. Every record picked from CSE ECG database is of 10 seconds duration sampled at 500 Hz thus giving 5000 samples. The power line interference and baseline wander in ECG signal is removed using low pass and high pass filters [19,20]. The slope at every sampling instant of the filtered ECG signal is calculated and these are clustered into two classes, namely QRS and non-QRS classes using K-means of clustering [21]. The probability of slope at each sampling instant belonging to each of the two classes is calculated using

Pri(x) = 1

y/2TT<Ti exp

mi (7)

i = l , 2 ; x = 1,2, ...5000

Where at and m* are the standard deviation and mean of ith class. The entropy /ij(x) at each sampling instant for both classes and for all the 12 leads are calculated using the following equation [21].

fci(x) = -P r i (x ) /o 3 e Pr i (x ) ; (8)

i = l , 2 ; x = 1,2, . . .5000

S. S. Mehta and N. S. Lingayat 297

The average entropy of all 12 leads is calculated at a sampling instant (to nullify any spurious signal) for both classes. These entropies are normalized. The combined entropy is calculated by using

hcni*) = (1 - / l ln(x)) * /l2n(x) (9)

The filtered ECG signal and the combined entropy values for the record MO1.015 of CSE database are plotted in Fig.l

4. Results

The algorithm for QRS detection has been tested on 50,12-lead ECG records from the CSE database. The SVM with sigmoid kernel function gives a detection rate of 98.7%. The percentage of false positive and false negative detections was very low. The false negative detections were mostly confined to ECG signals with low signal to noise ratio. Figure 2 shows QRS detection in the record MOl_015 of CSE database.

600

400

200

0

ECG Signal

O l \LA h

\LA i\

Ul_^4 A

I / U A «

L/UA|

•

-

/I l/u

QRS Detection by SVM

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Combined Entropy of ECG Signal

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time »

Fig. 1. Filtered ECG and the combined entropy curves

-

•4LM^^

1 1 1

—1—I 1—1—: 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time ».

Fig. 2. QRS Detection in ECG signal

3.2. Decision Making

During the training of SVM, a sliding window of size ten sampling instants is moved over the combined entropy curves. When the window lies completely in the QRS region, the desired output of the SVM is set to 1 and when it lies completely in the non-QRS region, the desired output is set to 1. The SVM is trained on a set of training data covering wide variety of ECG signals, picked from CSE ECG database. A set of ten calculated combined entropy values is picked up to form the input vector for the SVM. On testing, a train of Is is obtained when the window traverse through the QRS region and -1 for the non-QRS region. The train of Is is picked and using their duration, an average pulse width of Is is evaluated. Those trains of Is whose duration turns out to be more than the average pulse width are detected as QRS regions and the other ones are detected as non-QRS regions.

5. Conclusion

SVM provides a new approach to the QRS detection in the ECG signal. It provides a flexible and open tool for cardiologists to perform classification of all heartbeats. The test results are consistent and encouraging. The successful detection depends strongly on the quality of the learning set (selection of cases), data representation and the choice of Kernel functions.

Acknowledgement The authors wish to acknowledge All India Council for Technical Education (AICTE), New Delhi for providing research grant vide project no: 8022 /RID/ NPROJ/RPS-10/2003-04 titled, Development of computer based techniques for identification of wave complexes and analysis of multi-lead ECG.

298 ECG Pattern Classification Using Support Vector Machine

References

1. B. U. Kohler, C. Hennig, and R. Orglmeister, The principles of software QRS detection, IEEE Eng. Med. Biol. Mag. ,pp4257,2002.

2. N. V. Thakor, J. G. Webster and W. J. Tompkins, Optimal QRS detector, Med. Biol. Eng. Comput., vol.21, pp.343-350, 1983.

3. J. Pan and W. J. Tompkins, A real time QRS detection algorithm, IEEE Trans. Biomed. Eng.,vol. BME-32,no.3, pp230-235,1885.

4. O. Pahlm and L. Sornmo, Software QRS detection in ambulatory monotoring, Med. Biol. Eng. Comput., vol.22, pp.289-297, 1984.

5. N. V. Thakor, J. G. Webster and W. J. Tompkins, Estimation of QRS complex power spectra for the design of QRS filter, IEEE Trans. Biomed. Eng., vol. BME-31, no.ll,pp702-706,1884.

6. S. S. Mehta and N. S. Lingayat, Computational techniques for QRS detection in ECG: A review, National Conference on Emerging Computational Techniques and their applications, MBM Engg. College, Jodhpur, 2005.

7. S. S. Mehta, S. C. Sexana and H. K Verma, Computer-aided interpretation of ECG for diagnostics, International Journal of System Science, vol-27, pp.43-58, 1996.

8. E. Pietka, Feature extraction in computerized approach to the ECG analysis, Pattern Recog., vol.-24, pp. 139-146, 1991.

9. G. M. Friesen, T. C. Jannett, M. A. Jadallah, S. L. Yates, S. R. Quint and H. T. Nagle, A Comparison of noise sensitivity of nine QRS detection algorithms, IEEE Trans Biomed eng, vol.37, pp. 85-98,1990.

10. J. Fraden and M. R. Neurman, QRS wave detection, Med. and Bio comp., vol. 18, pp. 125-132, 1980.

11. K. Lin and W. Chang, QRS feature extraction using linear prediction,/.Ei?i? Trans. Biomed. Eng., vol-36, pp. 1050-1055, 1989.

12. K. M. McClelland and J. M. Arnold, A QRS detection algorithm for computerized ECG monitoring, Computers in cardiology, IEEE computer Soc, pp.447-450, 1976.

13. L. Sornmo, O. Pahlm and M. NyGards, Adaptive QRS detection : A study performance, IEEE Trans Biomed. Eng., vol. BME-32, no. 6, pp. 392-401, June 1985.

14. P. S. Hamilton and W. J. Tompkins, Quantitative investigation of QRS detection rules using the MIT/BIH arrhythmia database, IEEE Trans. Biomed. Eng., vol-33, pp.1157-1165, 1986.

15. P. Trahamias and E. Skordalakis, Syntactic pattern recognition of the ECG, IEEE Trans, on Pattern Analysis and Machine Intelli.,\o\. PAMI-12.no.7,pp.648-657, July 1990.

16. Q. Xue, Y. M. Hu and W. J. Tompkins, Neural network based adaptive matched filtering for QRS detection, IEEE Trans. Biomed. Eng, vol 39, pp 317-329, 1992.

17. F. Gritzali, Towards a generalized scheme for QRS Detection in ECG waveforms, Signal Processing, vol.15, pp.183-192, 1988.

18. V. Vapnik, Statistical Learning Theory, New York: Wiley, 1998.

19. N. V. Thakor and J. G. Webster, Design and evaluation of QRS and noise detectors for ambulatory ECG monitors, Med. Biol. Eng. Comput, vol.20, no.6, pp.709-714, 1982.

20. J. A. Van Alste and T. S. Schilder Removal of baseline wander and power line interference from the ECG by an efficient FIR filter with a reduced number of taps,IEEE Trans. Biomed. Eng., vol BME-32, no. 12, pp 1052-1059, 1985.

21. J. L. Tou and R. C. Gonzalez, Pattern Recognition Principles, Addison-Wesley Publishing Company MA, 1974.

299

Model Selection for Financial Distress Classification

Srinivas Mukkamala

Institute for Complex Additive System Analysis, New Mexico Tech Socorro, New Mexico 87801, USA

Andrew H. Sung

Department of Computer Science, New Mexico Tech Socorro, New Mexico 87801, USA

Ram B. Basnet

Institute for Complex Additive System Analysis, New Mexico Tech Socorro, New Mexico 87801, USA

Bernadette Ribeiro

University ofCoimbra, P-3030-290 Coimbra, Portugal

Aarmando S. Vieira

ISEP and Computational Physics Centre, University ofCoimbra

P-3004-516 Coimbra, Portugal

This paper describes results concerning the robustness and generalization capabilities of a supervised machine learning methods for classifying the financial health of French companies. Financial data was obtained from Diana, a large database containing financial statements of French companies. Classification accuracy is evaluated with Artificial Neural Networks, TreeNet, Random Forests and Liner Genetic Programs (LGPs). LGPs have the best performance accuracy in both balanced data and unbalanced dataset. Our results demonstrate the potential of using learning machines in solving important economics problems such as financial distress classification. Feature selection is as important for financial distress classification as it is for many other problems. We present several feature selection methods for financial distress classification. It is demonstrated that, with appropriately chosen features, financial health of a company can be detected. Experiments on Diana dataset have been carried out to assess the effectiveness of this criterion.

1. Introduction

Bankruptcy prediction is a very hard classification

problem as it is high-dimensional, most data distribution

is non-Gaussian and exceptions are common [1]. A

nonlinear classifier should be superior to a linear

approach due to saturation effects and multiplicative

factors in the relationships between the financial ratios.

ANNs, implemented by multilayer preceptrons, have

been increasingly used to default prediction as they

generally outperform other existing methods [2,3,4].

Recent methods, such as Support Vector Machines,

Genetic Algorithms and Genetic Programming have also

been applied in this problem with success [5,6]. In

general all these approaches outperform Multiple Discriminant Analysis. However, in most cases the datasets used are very small (sometimes with less than 100 cases) often highly unbalanced which does not allow a fair comparison [7,8].

In this work we compare the efficiency of four machine learning approaches on bankruptcy detection using a large database of French private companies. This database is very detailed as it contains a wide set of financial ratios spanning over a period of three years, corresponding to more than one thousand healthy and distressed companies. The approaches used are: Artificial Neural Networks in two versions: multilayer

300 Model Selection for Financial Distress Classification

perceptrons and Hidden Layer Learning Vector Quantization, TreeNet, Random Forest, and Linear Genetic Programs (LGPs):

This paper is organized as follows: Section 2 describes the dataset used for analysis. Section 3 presents the Artificial Neural Networks; a brief introduction to TreeNet is given in section 4. Random forests are described in section 5. Section 6 describes LGPs. Section 7 presents the results and discussion. Feature selection and ranking are described in section 8. Finally, Section 9 presents the conclusions.

2. Dataset

We used a sample obtained from Diana, a database containing financial statements of about 780,000 French companies. The initial sample consisted of financial ratios of 2,800 industrial French companies, for the years of 1998, 1999 and 2000, with at least 35 employees. From these companies, 311 were declared bankrupted in 2000 and 272 presented a restructuring plan ("Plan de redressement") to the court for approval by the creditors. We decided not to distinguish these two categories as both signal companies in financial distress. The sample used for this study has 583 financial distressed firms, most of them small to medium size, with a number of employees from 35 to 400, corresponding to the year of 1999 - thus we are making bankruptcy prediction one year ahead.

Most companies on the verge of bankruptcy have heterogeneous patterns which are difficult to identify by any learning machine. Therefore type I error is in general higher than type II. Since the cost associated with this type of error is in general higher, in real applications global accuracy may not be the best performance indicator of the algorithm.

To study the effect of unbalanced datasets, we randomly added healthy companies in order to get the following ratios of bankrupted to healthy firms: dataset 1 (50/50), dataset 2 (36/64) and dataset 3 (28/72). Lower ratios put stronger bias towards healthy firms, deteriorating the generalization capabilities of the network and increasing type I error which is undesirable.

3. Neural Networks

The Hidden Layer Learning Vector Quantization (HLVQ) is an algorithm recently proposed for classification of high dimensional data [9,10]. HLVQ is inplemented in three steps. First, a multilayer perceptron is trained using back-propagation. Second, supervised Learning Vector

Quantization is applied to the outputs of the last hidden layer to obtain the code-vectors Wc, corresponding to each cla§s c, in which data are to be classified. Each example, Xi, is assigned to the class ck having the smallest Euclidian distance to the respective code-vector:

* = m n | w c , - A ( * ) | (1)

where h is a vector containing the outputs of the hidden layer and | | denotes the usual Euclidian distance. In the third step the MLP is retrained but with two differences regarding conventional multilayer training. First the error correction is not applied to the output layer but directly to the last hidden layer being the output layer ignored from now on. The second difference is in the error correction back propagated to each hidden node:

E =\^ck-hCxi)f (2)

where A^ is the number of hidden nodes. After retraining the MLP a new set of code-vectors,

~new — — WCi =WCj+AWci (3)

is obtained according to the following training scheme:

Aw c , =a{n){x — wCl) if x e class c, ,

AwC|. = 0 if X £ class c, (4)

The parameter a is the learning rate, which should decrease with iteration n to guarantee convergence. Steps two and three are repeated following an iterative process. The stopping criterion is met when a minimum classification error is found.

The distance of given example X to each prototype is:

rf/ = |*(x)-wc1 | (5)

This is a proximity measure to each class. After HLVQ is applied, only a small fraction of the hidden nodes is relevant for the code-vectors. Therefore HLVQ simplifies the network thus reducing the risk of over fitting.

Srinivas Mukkamala, Andrew H. Sung, Ram B. Basnet, Bemadette Ribeiro and Aarmando S. Vieira 301

Treenet 6. Linear Genetic Programs

In a TreeNet model classification and regression models are built up gradually through a potentially large collection of small trees. Typically consist from a few dozen to several hundred trees, each normally no longer than two to eight terminal nodes. The model is similar to a long series expansion (such as Fourier or Taylor's series) - a sum of factors that becomes progressively more accurate as the expansion continues. The expansion can be written as [11,12]:

F(X) -F0 + PftiX) + /J2r2 (*)... + pMTM (X) (6)

Where T; is a small tree Each tree improves on its predecessors through

an error-correcting strategy. Individual trees may be as small as one split, but the final models can be accurate and are resistant to over fitting.

5. Random Forests (RF)

A random forest is a classifier consisting of a collection of tree structured classifiers {h(x,T|,), k=l, ...} where {T,} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class of input X . The common element in random trees is that for the k"1 tree, a random vector Tk

is generated, independent of the past random vectors T),... Tfc, but with the same distribution; and a tree is grown using the training set and T b resulting in a classifier h(x,Tk) where x is an input vector. For instance, in bagging the random vector T is generated as the counts in N boxes resulting from N darts thrown at random at the boxes, where N is number of examples in the training set. In random split selection T consists of a number of independent random integers between 1 and K. The nature and dimensionality of T depends on its use in tree construction. After a large number of trees are generated, they vote for the most popular class [11,13].

The random forest error rate depends on two things: 1. The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate. 2. The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.

Linear Genetic Programming (LGP) is a variant of the genetic programming technique that acts on linear genomes [Koza, 1992]. The linear genetic programming technique used for our current experiment is based on machine code level manipulation and evaluation of programs. Its main characteristic, in comparison to tree-based GP, is that the evolvable units are not the expressions of a functional programming language (like LISP); instead, programs of an imperative language (like C) are evolved [14,15,16].

In GP an intron is defined as part of a program that has no influence on the fitness calculation of outputs for all possible inputs. Fitness F of an individual program p is calculated as:

The mean square error (MSE) between the prerf des

predicted output ( *' ) and the desired output ( *' ) for all n training samples and m outputs. The classification error (CE) is defined as the number of misclassifications. Mean classification error (MCE) is added to the fitness function while its contribution is determined by the absolute value of weight (w) [14].

7. Results

We applied to Artificial Neural Networks in two versions: multilayer perceptrons and Hidden Layer Learning Vector Quantization, TreeNet, Random Forest, and Linear Genetic Programs (LGPs) two datasets, balanced dataset (580 healthy companies and 583 companies with financial distress) and unbalanced dataset (1470 healthy companies and 583 companies with financial distress). In balanced dataset 500 randomly selected samples are used for training and 660 samples are used for testing. In unbalanced dataset 950 randomly selected samples are used for training and 1110 samples are used for testing.

Detection rates and false alarms are evaluated for Diana, a database containing financial statement and the obtained results are used to form the ROC curves. HLVQ was applied upon this MLP with a very fast convergence - only 8 iterations. LGPs perform the best on both balanced and unbalanced datasets. Results obtained are presented in table 1.

302 Model Selection for Financial Distress Classification

Table 1. Summary of Classification Accuracies.

TreeNet

RF

LGPs

HLVQ

MLP

Balanced Dataset

83.0

87.3

92.1

77.03

75.20

Unbalanced Dataset

91.8

86.1

96.6

80.94

78.85

Figures 1 to 2 show the ROC curves of the detection models of LGP, Random forests and TreeNet. In each of these ROC plots, the xaxis is the false positive rate, calculated as the percentage of normal companies considered as bankrupt; the y-axis is the detection rate, calculated as the percentage of bankrupt companies detected. A data point in the upper left corner corresponds to optimal high performance, i.e, high detection rate with low false alarm rate.

• No discrimination

- L G P

" Random Forests

- TreeNet

1 - Specificity (false positives)

Figure 1. Classification accuracy for balanced dataset.

No discrimination

- L G P

-Random Forests

02 0.4 0.6 0.8 1

1 - Specificity (false positives)

Figure 2. Classification accuracy for unbalanced dataset.

Neural Networks

Multilayer Perceptrons (MLP) containing a single hidden layer from 5 to 20 nodes were tested in this problem. The best performing set was a hidden layer of

15 neurons trained by back propagation with a learning rate of 0.1 and a momentum teim of 0.25.

- - • - - Healthy

— • — Bankrupted

2 3 4 5 6 7 8 9 10 I ! 12 13 14 15

Hidden neuron #

Figure 3: HLVQ code-vectors

HLVQ was applied upon this MLP with a very fast convergence - only 8 iterations. Results obtained with MLP and HLVQ are presented in table 1.

Figure presents the code vectors obtained by HLVQ corresponding to the two categories: healthy and bankrupt companies. Note that of the total 15 components, five are very similar, thus redundant. The remaining ten components are the effective features used by HLVQ to classify data.

8. Feature Selection and Ranking

The feature ranking for financial distress classification is similar in nature to various engineering problems that are characterized by: • Having a large number of input variables x = (x;, x2,

..., x„) of varying degrees of importance to the output y; i.e., some elements of x are essential, some are less important, some of them may not be mutually independent, and some may be useless or irrelevant (in determining the value of y)

• Lacking an analytical model that provides the basis for a mathematical formula that precisely describes the input-output relationship, y = F (x)

Table 2. Cey features identified by LGPs, TreeNet, Random Forests

Feature Rankin

g Method

LGPs

Unbalanced Dataset

• Altman ratios (Z-Score)

• Debt ratio

• Return on equity before extra items and taxes

Srinivas Mukkamala, Andrew H. Sung, Ram B. Basnet, Bemadette Ribeiro and Aarmando S. Vieira 303

TreeNet

Random Forests

» Return on equity » Financial autonomy » Collection period

• Altman ratios (Z-Score) • Net margin > Financial debt to cash earnings value added

margin « Total debt to cash earnings » Number of employees » Value added margin » Altman ratios (Z-Score) » Net margin > Total debt to cash earnings » Cumulative depreciation rate > Value added margin > Financial debt to cash earnings > Quick ratio

9. Conclusions

Performance of the five methods is comparable in all datasets. Hidden Layer Learning Vector Quantization algorithm did not perform well when compare to TreeNet, Random Forests and LGPs.

LGP performed the best on both the datasets, unbalanced dataset with an overall accuracy of 91.5 (0 false positives 24 false negatives), balanced dataset with an overall accuracy of 91.2 (0 false positives and 26 false negatives).

For unbalanced samples the overall accuracy improves. However, error type I, the most costly for banks, degrades in all machine learning methods applied. Therefore unbalance samples should be avoided.

Financial distress classification is an important, interesting but difficult problem and further investigation is still needed. As a future work we plan to use a more complete data set including annual variations of important ratios from two or more years. As more inputs are added, feature selection will have to follow a more stringent scrutiny.

References

1. C. Zavgren, The Prediction of Corporate Failure: The State of the Art, Journal of Accounting Literature. 2,1 (1983).

2. P. K. Coats and L. F. Fant, Recognizing Financial Distress Patterns Using a Neural Network Tool. Financial Management (Autumn). 142 (1996).

3. F. Atiya, Bankruptcy prediction for credit risk using neural networks: A survey and new results. IEEE Transactions on Neural Networks. 4,12 (2001).

4. G. Udo, Neural Network Performance on the Bankruptcy Classification Problem. Computers and Industrial Engineering. 25,377 (1993).

5. A. S. Vieira, B. Ribeiro, S. Mukkamala, J. C. Neves, A. Sung. On the Performance of Learning Machines for Bankruptcy Detection. Proceedings of IEEE International Conference on Computational Cybernetics. IEEE Computer Society Press. 323 (2004).

6. S. Mukkamala, G. D. Tilve, A. H. Sung, B. Ribeiro, A. S. Vieira, Computational Intelligent Techniques for Financial Distress Detection. InternationalJournal of Computational Intelligence Research (IJCIR). 2, no. 1,61 (2006).

7. J. S. Grice and M. T. Dugan, The limitations of bankruptcy prediction models: Some cautions for the researcher, Rev. of Quant. Finance and Account. 17 no. 2,151 (2001).

8. F. Cucker and S. Smale, On the mathematical foundations of learning. Bulletin of the American Mathematical Society. 39,1 (2001).

9. A. Vieira and N. P. Barradas, A training algorithm for classification of high dimensional data. Neurocomputing. 50C,461 (2003).

10. A. Vieira, P. Castillo and J. Merelo, Comparison of HLVQ and GProp in the problem of bankruptcy prediction, IWANN03 - International Workshop on Artificial Neural Networks. Springer-Verlag 665 (2003).

11. Salford Systems. TreeNet, Random Forests Manual.

12. J. H. Friedman, Stochastic Gradient Boosting. Journal of Computational Statistics and Data Analysis, Elsevier Science. 38,367 (2002).

13. L. Breiman. Random Forests. Journal of Machine Learning. 45, 5 (2001).

14. J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press. (1992).

15. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley. (1989).

16. AIM Learning Technology, http://www.aimlearning.com.

http://www.aimlearning.com

304

Optimal Linear Combination for Two-Class Classifiers

O. Ramos Terrades, S. Tabbone* and E. Valveny

Computer Vision Center — Dept. Informatica, Universitat Autbnoma de Barcelona, Edifici O, Campus UAB, 08193 Bellaterra, Catalonia (Spain)

E-mail: {oriolrt,ernest}Qcvc.uab.es

*LORIA-Universiti Nancy 2, Campus scientifique BP 239, 54 506 Vandceuvre-les-Nancy Cedex, France


Combining output classifiers has been one of the strategies used to outperform classifications rates on general purpose classification problems. In this paper we propose two linear combination rules that minimize the classification error under some constraints in two-class classifiers.

Keywords: Classifier fusion; Shape recognition

1. Introduction

In the literature1'2 many types of shape descriptors have been described, but unfortunately, one kind of descriptor is not usually enough to achieve satisfactory results in general pattern recognition problems. We cannot find a general descriptor able to properly represent all kinds of shapes. The "best" descriptor will depend very much on the kind of application and, sometimes, more than one descriptor can be necessary in order to detect all relevant features of shapes. Consequently, many research has been done in finding and proposing classifiers that attenuate descriptors lacks and improve recognition rates. Thus the information from different sources is combined to improve the performance of single descriptors.

In this sense, there are several starting points to face this problem. We can find strategies consisting of merging features from different descriptors into a single descriptor and then, training high performance classifiers like neural networks,3'4 boosting-based classifiers5-7 or Support Vector Machine8 which have proved to be suitable in many applications. Specially those than are specific designed to manage problems with small number of classes in a closed environments. However, for general purpose problems in which the number of classes begins to be high and the shapes to be recognized can be counted by thousands, these expert classifiers begins to fail.9 Therefore, we have to find strategies that easily permit us to adapt to the user needs, avoiding to train again systems each time that problem conditions changes.

One possibility, which is that we have followed in this work, is to use different classifiers and combine them in such way the global performance increase. The goal is to take advantage of the skills of each classifier to improve overall performance. This strategy is known in the community under some different names: aggregation operators, combination rules or classifier fusion, depending on the context and the characteristics of classifiers.

In this work, we have tackled the problem of combining information from different sources from the perspective of classifier fusion methods. We have made face this problem from a probabilistic point of view in a supervised framework.10 The two main reasons that have motivated to work with binary classifiers are: simplicity and flexibility. Simplicity because two-class classifiers —a.k.a. binary classifiers— are easier to formalize than multi-class classifiers. Flexibility because each time we want to introduce a new class to our system it will enough to train the new classifier for the new class, without needing to train again those classifiers fitted for other classes. Thus, we focus on the linear combination of binary classifiers at the measurement level11 choosing a parallel architecture to fusion classifiers.

The paper is organized as follows: in section 2 we define some random variables and then, the problem of linear combination of classifiers is raised. In section 3 we introduce the two combination rules proposed in this paper. The experimental evaluation is done in section 4 and we finalize with the conclusions and perspective in section 5.

http://Qcvc.uab.es


O. Ramos Terrades, S. Tabbone and E. Valveny 305

Table 1. Definition of random variables

N a m e N o t a t i o n Meaning

Shape S the shape to recognize Label Y = {—1,1} the class of the shape Descriptor X = FEM(S) the computed descriptor Prediction Z = C(S) the classifier output Validation U = YZ the validity of the prediction

2. The problem of classifier fusion: definitions

To face the problem of classifier fusion, we have considered the classifier output as a random variable (r.v.). More specifically, we propose a probabilistic framework based on five r.v —summarized in table 1— defined as follows:

Shape variable, S: We consider the shape to recognize is given by the random variable S. All other random variables depend on this variable.

Label variable, Y corresponds to the set of class labels: Y = {-1,1}.

Descriptor variable, X, is given by the feature extraction method (FEM) used to extract the descriptors, X = FEM(S), when it is applied to the shape r.v.

Prediction variable, Z is the r.v of classifier. Z = C(S) and hence, this variable depends on S. We prefer denoting the prediction r.v. as a function of the descriptor X by writing Z = C(X).

Validation variable, U tells whether the prediction is correct or not. It is defined as the product between the label variable and the prediction variable, U = YZ.

The validation r.v. play a central role in our theoretical development. Although it is not a new notion in classification problems because it is related to the concept of margin in classifiers based on neural networks4 and support vector machines.8

Thus, given a set of L binary classifiers, Zi, we can denote the L prediction and the L validation r.v. as random vectors: Z = (Z\,...,ZL) and U — (Ui, ...,UL). Moreover, as we have constrained our approach to linear combination of classifiers, we can express the linear combination of any type of classifier using the standard dot product defined in M.L: Za = (Z, a) where a is a vector of weights. Thus, given a set of L binary classifiers, Zi, the problem of

looking for an optimal operator to apply to the L classifiers turns into the problem of finding an optimal vector of positive weights a minimizing the probability error of the linear combination. Therefore, according to the definition of the label r.v. Y, the classification error occurs when Ua is negative:

,rr v « Uz,a)<0 i f y = l (U, a) < 0 <=> lx '

\(Z,a)>0 i f r = - l Problem 2.1. With the precedent definitions of r.v., the problem of linearly combining classifiers can be expressed as the optimization of the following objective function:

a = argminP((l / ,a) < 0|5) (1) a

with constraints:

{ at > 0 for all the weights.

INI* ( I > = i).

3. Optimal Linear Combination Rules: IN and DN

In this section we introduce two linear combination rules, namely IN and DN, which find the vector weight a minimizing the expression 1 under two sets of hypothesis in the classifiers. On the one hand, the IN method assume that the validation r.v. are independent and normal distributed. On the other hand, the DN method assume that the validation r.v. are dependent and multivariate normal distributed. Under this assumptions, we can simplify the expression 1 in such way we are able to compute the optimal weights.

The IN and DN methods are obtained after the "normalization" of the validation r.v Ua, Ua, in such a way that it is centered in 0 with variance 1. Thus, if we denote by \i the mean vector of U and £ the covariance matrix of U, then using expectation and variance properties, we can compute the mean, /j,a

and the variance o\ of the combination of validation r.v., Ua-

fia = (a, n) oa = a ' S a = IHIs (3)

Therefore, the probability of expression (1) is equivalent to:

306 Optimal Linear Combination for Two- Class Classifiers

Algorithm 3.1: IN

Input : {{Xn, V"n)|Xn is a vector of L descriptors} Outpu t : H, combination of classifier

Begin: Train L classifiers for the L descriptors, hi; for I = 1 . . . L,

Get hypothesis z^n = /ijpQ.n); Obtain the validation values: u ; n = S/n2;,n> Compute mean and variance of uj: m, 07 2; If oj = 0,

af = w; else,

endif; endfor;

Set: A = E j . t^V a l a n d ^ = £(,«=£> <*b ifA>B,

<W = 5 ^ ^ and Ajv = 5 ^ 5 ! else

Xtf = 0 and Ajy = 1; end; update:

af = X*af; o f = A D af ;

and normalize a such that: £) ( a; = 1; r e tu rn H = £)j a(fy;

End:

P({U,a)<0\X)= f ga J—00

(w|a;)du

= / ffa(«N)dv */ —OO

(4)

Finally, imposing that ga is the density of a normal distribution, 5° does not depend on a and we can find the optimal weights a by maximizing the function <j> defined by:

na) = — = / - T = - (5)

Then, imposing either dependence or independence of the validation r.v. has permitted us to maximize expression (5) in two different ways. In the IN

method we have found an explicit solution for the Problem 2.1 whereas in the DN method we must solve a constrained optimization problem.

3 . 1 . IN method

In the IN method, we can observe that the weight

of each classifier depend on the variance of the val

idation r.v. —sketched in the Algorithm 3.1. If the

variance is 0, we have considered that classifiers are

"almost" perfect obtaining in this case, tha t the op

timal weight is af = m. Otherwise, we have found

that the optimal weights is: a ^ = ^ . Then, once

we have computed the weights for all classifiers, we

have to compute A = (A^/-, XD) such that :

A-B V 2 A - B ' 2 A - B

A = ( 0 , 1 )

if A>B

if A<B (6)

where we denote A = H o ^ H ^ and B = | | a ' D | | L 1

—the proofs of these results are given in a technical report.

In the IN method, we train each classifier using a learning database and then, we estimate the val

idation r.v. from the same database by multiplying the label, yn, and the prediction value returned for each classifier, for each element in the database. Af-tewards, we compute the mean and the variance, m

and cr;, respectively, to compute the weights of classifiers according to the value of af. Finally, and once we have computed the weights for all classifiers, we have to update each weight according to the values of Ajv" and AD computed as explained in Eq.(6).

3 .2. DN method

Conversely, if we assume dependence for Ui r.v. the optimal weights are obtained from maximizing the expression (5):

,/•* (7)

with the constraints:

In the DN method, we train the L classifiers using the same learning database used for computing the optimal weights given by the IN method. Similarly, we estimate the validation r.v. by multiplying the label, yn, and the prediction value returned by each classifier: uiiH = ynzi,n- In this way, U is a matrix where each column represents the validation r.v. of each classifier. Then, the function DependentWeight called in the Algorithm 3.2 is the numerical implementation of the constrained optimization problem raised above (cf. Eq. (7)).

O. Ramos Terrades, S. Tabbone and E. Valveny 307

Algorithm 3.2: DN

Input: {(Xn, Yn)\Xn is a vector of L descriptors} Output: H, combination of classifier

Begin: Train L classifiers for the L descriptors, hi; for 1 = 1... L,

Get hypothesis z^n = hi(X^n); Obtain the validation values: u ; n = ynzi<n;

endfor; Obtain weights: a = DependentW eight(U); return H = ^ a;/i/;

End:

4. Experimental results

In order to test the validity of this approach, we rely on our previous work on symbol recognition. Therefore, we have used the database defined for the contest on symbol recognition in the GREC Workshop in 2003.12 Besides, we have also used the MNIST digit database as it is a reference dataset for shape recognition.

We have used several descriptors developed in some of our previous works in order to define a set of descriptors to evaluate IN and DN methods. This set of descriptors is composed of the Fourier 7?.-Signature,13 the Local Norm Features based on ridgelets transform14 and the Angular Radial Transform.15 The two former descriptors are based on our previous work, while the last one is a standard shape descriptor used in MPEG-7. With these databases and sets of descriptors we have defined two different tests. The first one, called "GREC-descriptors" and "MIST-descriptors".

For each model, we train a binary classifier for each class using a learning dataset consisting of positive samples from the class itself and negative samples, from all other classes, randomly chosen. The test dataset is defined in the same way for each class but it depends on the shape database considered in each case. In this sense, we have divided the GREC database into two halves: one half for training and the other half for testing. In this database, there are few positive examples for each class, so we have included all of them in the training set whereas the negatives examples have been randomly chosen up to 200 examples per class. Conversely, the MNIST database already contains a training set formed by

60.000 hand-written examples of the ten numerals and a test database composed of 10.000 examples. For this database we have randomly chosen 200 examples of numerals 0,1,2,3,6,8. We have not considered the ten numerals because the database is composed of hand-written numerals and descriptors are invariant to rotation. More specifically, we have discarded the numerals 4 and 9 to avoid misclassifica-tion with the numeral 6. The numeral 7 is discarded to avoid misclassification with 1 and finally, 5 as it is symmetric to 2.

Then, we have trained three types of binary classifiers: the Discrete Adaboost (DAB),5 a Linear Classifier (LinClass) and a normal classifier (CNormal), which are applied to the distribution of the Euclidean distances between the shape query and the shape model. The LinClass classifier is a linear classifier defined as:

h(Dn) = lrpos^Dn^ if Dn - dr (8) \-rneg(Dn) otherwise

where the distance threshold, dy is obtained from the intersection of two straight lines, rpos and rneg. We have defined both lines by implicitly assuming that the margin distribution of Z\y=i and Z\Y=-i are normal distribution. Thus, rpos is the tangent to the pdf of Z\y=\ at point tpos = fipos + apos and rneg is the tangent to the pdf of Z\y=-i a t points

*neg ~ f^neg 0~neg-Conversely, in the CNormal classifier we have

also looked for a distance threshold, dr, for determining whether the query shape belongs to a given class or not. In this case, we have applied the same weak classifier used in the learning process of the DAB classifier. Afterwards, we transform the distribution of distance in order to obtain a distribution consisting of a mixture of two normal distribution.

The combination rules introduced in this paper are suitable for binary classification problems. Thereby, we have compared IN and DN rules with regard other combination rules: max, mean and median16 —figure 1. Results are quite satisfying. IN and DN rules outperform the other combination rule even though the hypothesis of independence and normal distribution have not been verified. Except for the "GREC-descriptors" test using CNormal classifier. In this case, the median rule outperform the IN and DN method.

308 Optimal Linear Combination for Two- Class Classifiers

IN M H D N I Imax r~~lmean

0.35

I Single

(a) GREC-descriptors

P '• h 1 i i til i-: H il

(b) MNIST-descriptors

Fig. 1. Mean of misclassification rates for binary classification for "GREC-descriptors" and "MNIST-descriptors" using IN, DN, max, mean median and single —from left to right columns.

5 . C o n c l u s i o n a n d P e r s p e c t i v e s

In this work two optimal linear combination rules

for binary classifiers have been introduced. Besides,

experiment results show tha t IN and DN method

outperform other usual non-linear combination rules

even though classifiers do not verify the condition of

normal r.v. However, shape recognition is essentially

a multi-class recognition problem. Thus, we must ex

tend this approach from binary classifiers to multi-

class classification problems. Then, experiment eval

uation and a deep s tudy for non normal classifiers

have to be done.

R e f e r e n c e s

S. Loncaric, Pattern Recognition 31 , 983 (1998). D. Zhang and G. Lu, Pattern Recognition 37, 1 (2004). R. P. Lippmann, SIGARCH Computer Architecture News 16, 7(March 1988). K. Turner and J. Ghosh, Connection Science 8, 385 (1996). R. E. Schapire and Y. Singer, Machine Learning 37, 297(march 1999). M. Skurichina and R. P. W. Duin, PA A 5 (2002). Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in Thirteenth International conference on machine learning, 1996.

8. C. J. C. Burges, Dataming and knowledge discovery 2, 1 (1998).

9. J. Kittler, A framework for classifier fusion: Is it still needed?, in Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition, (Springer-Verlang, 2000).

10. A. K. Jain, R. P. Duin and J. Mao, IEEE Transactions on PAMI 22(January 2000).

11. L. Xu, A. Krzyzak and C. Y. Suen, IEEE Transaction on Systems, Man and Cybernetics 22 (1992).

12. E. Valveny and P. Dosch, Symbol recognition contest: A synthesis, in Graphics Recognition: Recent Advances and Perspectives, eds. J. Llados and Y. Kwon, Lecture Notes in Computer Science, Vol. 3088 (Springer-Verlang, 2004) pp. 368-386.

13. S. Tabbone and L. Wendling, Technical symbols recognition using the two-dimensional radon transform, in Proceedings of the 16th International Conference on Pattern Recognition, August 2002. Montreal, Canada.

14. O. Ramos Terrades and E. Valveny, Local norm features based on ridgelets transform, in 8th International Conference on Document Analysis and Recognition, 2005.

15. W.-Y. Kim, Y.-S. Kim and K. Y. S., A new Region-based shape descriptor, tech. rep., Hanyang University and Konan Technology (December 1999).

16. J. Kittler, M. Hatef, R. P. W. Duin and J. Matas, IEEE Transactions on PAMI 20, 226 (1998).

309

Support Vector Machine Based Hierarchical Classifiers for Large Class Problems

Tejo Krishna Chalasani, Anoop M. Namboodiri and C.V. Jawahar

Center for Visual Information Technology International Institute of Information Technology, Hyderabad, India

E-mail: [email protected], [email protected], [email protected]

One of the prime challenges in designing a classifier for large-class problems such as Indian language OCRs is the presence of a large similar looking character set. The nature of the character set introduces problems with accuracy and efficiency of the classifier. Hierarchical classifiers such as Binary Hierarchical Decision Trees (BHDTs) using SVMs as component classifiers have been effectively used to tackle such large-class classification problems. The accuracy and efficiency of a BHDT classifier will depend on: i) the accuracy of the component classifiers, ii) the separability of the clusters at each node in a hierarchical classifier, and iii) the balance of the BHDT. We propose methods to tackle each of the above problems in the case of binary character images. We present a new distance measure, which is intuitively suitable when Support Vector Machines are used as component classifiers. We also propose a novel method for balancing the BHDT to improve its efficiency, while maintaining the accuracy. Finally we propose a method to generate overlapping partitions to improve the accuracy of BHDTs. Comparison of the method with other forms of classifier combination techniques such as lus l , IvsRest and Decision Directed Acyclic Graphs shows that the proposed approach is highly efficient, while being comparable with the more expensive techniques in terms of accuracy. The experiments are focused on the problem of Indian language OCR, while the framework is usable for other problems as well.

Keywords: Binary Hierarchical Decision Tree, Support Vector Machine, Decision Directed Acyclic Graph.

1. Introduction

Efficient and accurate OCR engines play a highly critical role in making the information present in large quantities of document images, available for searching and indexing. The challenges in developing OCR systems for Indian languages are different from that of English due to a variety of reasons. Unlike English, most of the Indian languages have a large number of characters in their scripts, which makes the task of designing a classifier for them, much more difficult. The characters in Indian languages are formed as a composition of basic shapes and sometimes also a composition of basic characters. This composition not only leads to a large number of characters because of the numerous possible combinations, but also similar looking characters, which makes the designing of the classifier more difficult.

Support Vector Machines (SVMs)1 that build large margin classifiers for binary class classification problem, they have proved to have high generalization performance both theoretically and empirically. The SVM formulation tries to find a hyper plane that divides a set of two classes with largest margin. Extending this formulation of SVM directly to more than two classes is generally avoided due to the complex optimization equation it leads to. Instead, the multi-class SVM problem is dealt with by using

an ensemble of two-class SVMs. There are various strategies to achieve the combination, of which lvsl, IvsRest, and hierarchical classification are the popular methods. Consider a problem with N classes. In lvRest strategy, N two-class classifiers are trained, where the ith classifier is trained considering the i-th class samples as the positive class and all the other samples as the negative class. When an unseen sample is given for testing, the distance from the separating plane is calculated for each classifier and the sample is assigned the label of that classifier for which it is farthest from the separating plane. In the IvsRest strategy, a classifier is trained for each pair of classes, resulting in NC2 classifiers. When a sample is given for classification, all the NC2 classifiers are used and a vote is taken from each classifier. The sample is assigned the label of the class that has the maximum votes.

Hierarchical decision classifiers divide a complex problem into simpler ones and tackle the sub-problems thus created. The results from these sub-problems are integrated to solve the main problem. Decision Directed Acyclic Graphs (DDAGs) and Binary Hierarchical Decision Trees (BHDTs) are two such popular hierarchical classification methods, which extend a binary classifier to multi-class classification using different ensemble strategies.




310 Support Vector Machine Based Hierarchical Classifiers for Large Class Problems

a, e , e , g VS b , d . f , h

a, c VS e , g

a V S c eVSg

b.d VS f .h

bVSd fVSh

QQ0QQQQQ Fig. 1. A BHDT that classifies the first 8 alphabets of English.

A Decision Directed Acyclic Graph using SVM has been used successfully for multi-class classification problems.3 The disadvantage with this approach is that the evaluation time taken for an unseen sample is of O(N) where N is the number of classes, since the number of classes is huge for IL-OCR it is advisable to look for a logarithmic order, which becomes the prime motivation for exploring Binary Hierarchical Decision Trees.

Hierarchical Classification has been used for character recognition to considerable success using DDAGs, BHDTs or a hybrid of both.4-5 The emphasis of the existing work has been on designing of DDAGs3 or on hybrid techniques,4,5 where at each node in the tree, a decision is made whether to choose a DDAG or a BHDT, depending on the complexity of the decision boundary. Another possible approach is to build regular decision trees for a particular problem and then replace the classifier at each node with a large-margin classifier6 such as SVMs. However, many issues in designing BHDTs for applications such as character recognition have not been examined in sufficient detail.

A BHDT for N classes has N - 1 binary classifiers arranged as a binary tree, with N leaf nodes, where each leaf node represents a class. The BHDT is built by combining the classes recursively, two at a time, and a classifier learnt between the two groups. Figure 1 gives an example of a BHDT that classifies the first 8 alphabets of English language.

The running time of algorithms are often described using the Big-Oh (0(.)) notation, which gives the time required to solve an instance of the problem in terms of its size, AT.7 The running times are usually reported for the worst case and the average case. The average running time for labeling an unseen sample will be 0(log(N)), while the worst case time complexity is O(N). However, computing the optimal partition for a BHDT is NP-Complete,8 that

makes problem intractable as the number of classes increase. Hence, one needs to resort to approximate solutions in practice.

The primary issues that affect the accuracy and efficiency of a BHDT that uses SVMs as component classifier are: i) the use of an appropriate distance measure in computing the clusters in a BHDT, ii) maintaining the balance in cluster sizes to improve efficiency, and iii) dealing with error rates that cascade with the levels of the tree.

In this paper, we present a new distance measure (Section 2.1) that intuitively suits the SVM (the binary classifier to be used at each node of the binary decision tree) Section 2.2 presents a novel approach to balancing the tree. Section 2.3 proposes an approach for improving the accuracy of the BHDT using overlapping partitions at each node in the tree. Our conclusions and future directions of work are presented in Section 3.

All the experiments are performed on Telugu OCR data set with 329 classes. Each segmented character is scaled to a size of 25x25 pixels maintaining the aspect ratio and then binarized. Each such character is taken and the 25X25 matrix is converted to a single row my appending the rows consecutively resulting in a 625-dimensional feature vector. A linear kernel is used for the SVMs in all the experiments since the performance was comparable (or sometimes better) to any polynomial kernel. All the results are provided in Section 2.4. As a reference point for the results, we provide the accuracies of various classifiers such as K-Nearest Neighbor, Artificial Neural Networks, and various ensembles of SVMs in Table 1.

Table 1. Comparison of accuracies of various classifiers. The classifiers considered are K-Nearest Neighbors, Artificial Neural Networks, and 4 different ensemble strategies for SVMs

Training samples per class

KNN ANN

lvsRest lvsl

DDAG

15

80.29% 85.78% 20.31% 93.28% 91.97%

20

83.87% 88.45% 22.31% 97.91% 95.89%

25

87.34% 90.38% 25.31% 98.01% 97.52%

30

91.81% 89.64% 25.97% 98.74% 98.86%

The results for lvRest ensemble strategy are par-

Tejo Krishna Chalasani, Anoop M. Namboodiri and C.V. Jawahar 311

ticularly bad because of the imbalance in training samples of the +ve and -ve classes, for example consider the case of 20 samples per class, and we are designing a the i — th classifier meaning it will be give 20 4-ve sample and 6560 -ve samples, because of which +ve class is not represented properly. Though the lv l ensemble strategy gives a good experimental results, every training sample has to go through NC<z classifiers for the it to be labeled making it quadratic in time and impractical.

2. BHDT for IL-OCR using SVMs

In this section, we describe the specific approaches that we propose to improve the accuracy and efficiency of the BHDT classifier:

• The use of an appropriate distance measure, • Improving balance of the BHD Tree • Dealing with cascading error rates within the

tree.

2.1. Distance Metric for Binary Decision Tree

A BHDT contains a decision tree is that partitions the classes into two sets at each node. The partitioning is often achieved by a hierarchical clustering algorithm,8 and the accuracy of the classifier depends on the clusters generated. Different clustering algorithms such as k-means, agglomerative clustering and graph-cut based partitioning can be used to cluster the classes, and they use a distance measure between samples to achieve the clustering. A good distance metric should be invariant to the type of features, and should be robust and easy to compute on small training sets. It should also be compatible with binary classifier used at each node. Commonly used distance metrics such as Euclidean distance does not work well with binary features for clustering.9 Other metrics such as Mahalanobis distance and Kullback-Leibler distance10 needs a large number of samples per class for robust estimates. However, the segmented characters used for recognition are binary; and having a large number of samples per class with already huge number of classes often makes the training process prohibitively expensive.

As the clustering algorithm tries to group character classes, we propose a new distance measure based on the separability of the classes. To compute

the distance measure, an SVM classifier is trained between each pair of classes. The margin of the classifier for a class pair is used as the distance between the two classes. We use the single link clustering algorithm with the above distance metric to build the tree. This clustering algorithm considers all classes to be different clusters and merges the nearest classes and continues the process till two clusters are left.2

Fig. 2. Each row contains a character cluster that is placed near the leaves by the proposed clustering scheme. Note the similarity in shapes.

Since margin can be used as a measure of classi-fiability, the classes at the leaf nodes should be those that are most difficult to classify and this observation has been consistent with our experiments (See Fig 2). The results on Telugu OCR data set when the proposed metric is used as the distance measure in comparison with Euclidean distance are provided in Section 2.4.

2.2. Balanced BHDTs (BBHDT)

A balanced tree can bring down the bound on worst case time complexity from 0{N) to 0(logN). This can considerably increase the speed of classification in large class problems such as IL-OCR. We present an algorithm that balances the tree with only minimal reduction in accuracy. Let P(WJ) be the apri-ori probability for the class Wj £ fi and dX{ be the distance from the current node, x, its successor, corresponding to class w*. Each node x is given a weight Wx = T,i€subtree(x) dxi.P(wi). The weight of the node is the expected time to classify a sample that arrives at x. Let the imbalance of a node x be

defined as imBx = \Wx—>ieft-chUd ~~ Wx

—*right-.child\'

Let T be the root node of tree that is built using some clustering algorithm for the given classes. With this notation Algorithm 2.1 describes the balancing procedure, that balances the tree in a recursive fashion. The idea is to rectify the imbalance at the node, by


Q

A ° o © / \ J 0 O0 0

Fig. 3. Push operation: The darker nodes represent the roots of heavier subtrees. Note that the overall imbalance is reduced after the push.

pushing the classifier at i to a lower position in the heavier subtree.

Algorithm 2.1 Balance Tree(T, 5)

if T = leaf node then return T

end if Balance Tree(T —> leftjchild, 5) Balance Tree(T —• right-child, 8) if imBr > S then

T' = push(T) if imBr1 < imBr then

T = T' updateWeigth(r)

end if end if

Let Cmax denote the child of node C that has the maximum weight and Cm jn child with the minimum weight, push operation is defined in Algorithm 2.2 using this notation. Figure 3 shows an illustration of the push operation and the balance change that results.

Algorithm 2.2 push (T)

V = T 1 = J- max

XTYIJ) = = J-max,,tii

T = T'

TLax = tmp

The algorithm first balances the left and right children at each node and then calculates the imbalance. If the imbalance is greater than 6, a parameter set by the user, the operation push takes place. push is committed only if the imbalance that is cre

ated after the push operation is lesser than that of the imbalance present before, push operation also ensures that the natural clustering is not disturbed if the value of imbalance is large. The accuracies before and after balancing the decision tree using the above algorithm are reported in Section 2.4

2.3. Overlapping BHDT (OBHDT)

One of the reasons for the lower accuracy of BHDT when compared to DDAG is the presence of overlapping clusters at certain nodes in the BHDT. We now present an algorithm that takes this overlap into consideration, and modifies a BHDT to improve its accuracy. When classifiers are designed between two large clusters of classes, the accuracy often suffers as some classes will overlap the cluster decision boundaries. To get around this problem, we introduce the overlapping subset of classes into both the clusters. This will postpone the classification of such classes until the clusters are smaller and more manageable. The algorithm builds on the already existing BHDT. We use an evaluation set to identify the nodes at which misclassifications occur. Algorithm 2.3 shows how to build the OBHDT.

Algorithm 2.3 Build OBHDT(T) l: for each class w, do 2: Let t' be the first node at which u>i gets mis-

classified 3: pushclass(i',W{) 4: end for

Algorithm 2.4 pushclass(T, UJI) 1

2

3:

if T is leafnode with class u>x then build classifier Wj v uix at T return

end if if above r% of samples of Wj fall to x-child then

pushclass(T —• x-child,Wi) else

pushclass(r —> left-child, u>i) pushclass(T —+ right-child, u>i)

end if

The parameter r controls the overlap at which we decide to add a class to both the clusters at a

Tejo Krishna Chalasani, Anoop M. Namboodiri and C. V. Jawahar 313

node. The results of OBHDT algorithm in comparison to the results of using a DDAG and BHDT on Telugu OCR data set are presented Section 2.4.

2.4. Experimental Results

In this section we present the results of various algorithms and experiments done on the Telugu OCR data set. Telugu language script has 329 character classes, which makes it challenging for us to classify and correctly, label the unseen data.

Table 2 shows the improvement of classification when the margin as specified in section 2.1 between the classes is used as a distance measure to partition the data set as opposed to Euclidean distance. Ma-halanobis distance based metric was not able to give sufficient separation between clusters for the SVM classifier to converge.

Table 2. Comparison of accuracies with different distance metrics used for clustering.


Euclidean dis. Margin

15

80.04% 87.70%

20

83.27% 89.27%

25

84.82% 92.78%

30

86.95% 96.30%

Table 2.4 gives the accuracies and time taken, which is measured as the weight of the root node before and after applying the Balance Tree Algorithm 2.1. Since we don't have any apriori information about the data we considered all are equally probable, so the value of P(wi) is made 1/nVz to calculate the weight at each node as defined in Section 2.2. The column, time, specified in the table 2.4 is the weight (calculated according to the formula in Section 2.2 )of the root node of the BHDT, which is the expected number of classifications each sample might take.

Table 3. Accuracy(acc) and time taken before and after applying the balance tree algorithm.

Train.

smpls

BHDT

DDHDT

20 acc time

89.27% 30.32

90 .03% 23.37

25 acc time

92.78% 32.25

91 .31% 25.21

30 acc time

96.30% 32.01

95 .71% 23.04

Table 4 shows the improvement in accuracy of the overlapped version of BHDT over the regular BHDT, while maintaining the efficiency. The value of T in Algorithm 2.3is set to 75% for the experiment and the time taken in the case of OBHDT was generated using a test set with 10 samples per class, and noting down the average number of classifications.

Table 4. Comparison of accuracies and time taken for the various ensemble of SVM classifiers.


DDAG BHDT

OBHDT

15

91.97% 87.70% 91.34%

20

95.89% 89.27% 96.11%

25

97.52% 92.78% 98.05%

30

98.86% 96.30% 98.91%

time

328 32

34.5

Table 5 shows the performance of the BHDT classifier on various datasets chosen from The UCI machine learning dataset repository. The datasets that were chosen had larger number of classes, to demonstrate the ability of the proposed classifier design algorithm. Note that the classifier under comparison is the non-overlapping BHDT, and the accuracies can be further improved by considering the overlapping version, at the cost of a slight increase in classification time.

Table 5. Performance comparison with popular classifiers on various datasets.

Classifier Dataset

optdigits pendigits

glass yeast

ANN

89.76% 90.34% 55.65% 73.35%

KNN

93.43% 97.74% 76.47% 48.36%

DDAG

95.46% 96.32% 80.41% 76.01%

BHDT

95.53% 97.01% 81.31% 75.89%

3. Conclusions and Future Work

We note that a hierarchical classifier system performs better when separability(margin) is used as the distance metric for partitioning the data sets. An overlapping partitioning scheme is proposed that increases the accuracy of a BHDT, with only minor loss of efficiency. The resulting classifier performs better than DDAGs, while being an order of magnitude smaller in both memory footprint as well as time


taken for classification. Further increase in the efficiency of the classifier is obtained using a balancing algorithm. This design suits the large class classification problems such as OCRs, and it can be applied to other problems as well.

One could combine balancing with the clustering algorithm to directly generate more balanced clusters, tha t could possibly result in efficient classifiers of higher accuracy.

References

1. J.C.Burges , A Tutorial on Support Vector Machines for Pattern Recognition , Datamining and Knowledge Discovery vol 2, 121-167 (1998)

2. R.Sibson , SLINK: An optimally efficient algorithm for the single-link clustering method , The Computer Journal vol 16, 30-34 (1973)

3. John.C.Piatt,Nello Cristainini , Large Margin DAGs for multi class classification Advances in Neural Information Processing Systems vol 12, 547-553 (2000).

4. M.N.S.S.K Pavan Kumar, C.V. Jawahar , Design of Hierarchical Classifier with Hybrid Architectures

PReMI, 276-279 (2005) 5. M.N.S.S.K. Pavan Kumar, C.V. Jawahar, Config

urable Hybrid Architectures for Character Recognition Applications International Conference on Document Analysis and Recognition vol 1, 1199-1203 (2005)

6. D. Wu, K.P. Bennet, N. Christianini, and J.S. Taylor, Enlarging the Margins in Perceptron Decision Trees Machine Learning vol 41 , 295-313 (2000)

7. T.H. Cormen, C.E. Leiserson, R.L. Rivest, C.Stein, Introduction to Algorithms, MIT Press and McGraw-Hill 2nd ed., (2001)

8. L.Hyafil and R.L.Rivest , Constructing optimal binary tree is NP-Complete , Information Processing Letters vol 5, 15-17 (1976).

9. Volkan Vural, Jennifer G.Dy, A Hierarchical Method for Multi-Class Support Vector Machines International Conference on Machine Learning vol 69, 105 (2004)

10. Yangchi Chen, Melba M.Crawford, Joydeep Ghosh , Integrating Support Vector Machines in a Hierarchical Output Space Decomposition Framework IEEE International Geoscience and Remote Sensing Symposium vol-2, 949-952 (2004)

Unsupervised Approach for Structure Preserving Dimensionality Reduction

Amit Saxena

Computer Science and Information Technology Department G. G. University

Bilaspur 495001, India Email: amitsaxena65@rediffmail. com

Megha Kothari

Computer Science and Engineering Department Jadavpur University


In this paper, a new technique is presented that reduces the dimensionality of large data set without disturbing its topology and also maintains high accuracy of classification. For this Genetic Algorithm is used with Sammon error as the fitness function. The proposed technique is tested on four real and one synthetic data set. High value of correlation coefficient between proximity matrix of original data set and the corresponding data set with reduced number of features ensures that topology of the data set is preserved even with reduced number of features. Comparative study of the clustering results obtained with reduced and original data set justifies the capability of the proposed technique to give good classification accuracy even with reduced number of features.

Keywords: Dimensionality reduction; Feature analysis; Genetic Algorithm; Classification techniques

1. Introduction

One of the major problems in data mining of large databases is the size of dimension of databases. Frequently it is observed that a number of features of patterns do not affect the decision of classifying a pattern but these features do make the process of classification and subsequently mining the databases, quite complex. It is therefore realized that selecting a subset of features from the original databases will be of much importance as these features alone can be decisive in classifying the patterns to maximum or sufficiently large accuracy. Feature extraction and dimensionality reduction, the two important issues have therefore been associated in mining process. Another topical concern due to large number of features is that extra or redundant features not only lead to more computational overhead but often it can create confusion by degrading the performance of the classifier designed to work on them. Dimensionality reduction is mainly done in two ways: selecting a small but important subset of features and generating (extracting) a lower dimensional data preserving the distinguishing characteristics of the original higher dimensional data.1 Dimensionality reduction not only helps in the design of a classifier, it also helps in other exploratory data analysis. It helps in both clustering tendency assessment as well as to decide

number of clusters by looking at the scatter plot of the lower dimensional data. Feature extraction and data projection can be viewed as an implicit or explicit mapping from a p dimensional input space to a Q (P > Q) dimensional output space such that some criterion, is optimized. This formulation is similar to a function approximation problem.

A large number of approaches for feature extraction and data projection are available in the pattern recognition literature.2-4 These approaches differ from each other in the facts like characteristics of the mapping function, how it is learned, and what optimization criterion is used. The mapping function can be either linear or nonlinear, and can be learned through either supervised or unsupervised methods. The feature selection can be supervised or unsupervised. Much of the work published so far belongs to supervised approach which includes supervised feature selection using Neural Networks, Fuzzy, Neuro-fuzzy, and other similar soft computing methods.3-5

Unsupervised feature selection has been rarely addressed. Recent publications have shown some contribution in this area.6-8 Unsupervised feature selection methods can be classified into two categories. In first category, methods involve maximization of clustering performance based on some index.7-9 In the other category, correlation coefficient,10 measures of

316 Unsupervised Approach for Structure Preserving Dimensionality Reduction

statistical dependencies11 or linear dependence12 are used. The proposed scheme falls in the second category.

In this paper we propose a new approach of unsupervised feature selection which preserves the topology of the data set. Genetic Algorithm (GA) is used to select a subset of features by taking Sammon error as the fitness function. The data set with reduced numbers of features is then tested for accuracy using the well known k-means clustering technique, the correlation coefficient between the proximity matrix of original data set and the reduced one is computed to ensure that the inter-data distance i.e. the topology of the data set is preserved. The paper is organized as follows. Section 2 describes selection of features using Genetic Algorithm, Section 3 presents experimental results including various data sets and simulations, results and analysis followed by Section 4 with conclusions.

2. Selection of Features using Genetic Algorithm

Genetic Algorithm (GA)13 is one of the powerful evolutionary algorithms which search best solution of a problem out of several other existing solutions. In a typical GA, each chromosome in a set of chromosomes represents a prospective solution of the problem. The problem is associated with a fitness function which serves as the objective of the problem. The set of chromosomes usually called a population goes through a repeated set of iterations with crossover and mutation operation to find a better solution after each iteration. At a certain fitness level or after certain number of iterations, the procedure is stopped and the chromosome giving the best solution is preserved as the best solution of the problem.

Let X = {xk |xk = (xfci,a;fc2,...,a;fcp) ,k = 1,2, ...,n} be the set of n input vectors and let Y = {yk|yk = (yki,yk2,-,ykg) ,k = 1,2,...,n} be the unknown vectors to be found. Let d.*. = d(xi,Xj), x ; , XJ eX and dtj = d(yi,yj), yi, yj eY, where d(xi,Xj) is Euclidean distance between xi and Xj. Sammon error E is given by,

E irT. T d*- d* (1)

Entire procedure is outlined below.

• Initialize the population: Length of each chromosome is equal to the number of features present in the original data set. Let p be total number of features in a data set, thus the length of the chromosome will be p. Let Np be the size of the population matrix having Np number of chromosomes. Initialize each chromosome of the population by randomly assigning a zero and one to its different positions. A 1 represents presence of the feature whereas a 0 means absence of the feature, which is termed as reducing the feature in the present context.

• Measure of fitness: If the ith bit of the chromosome is a 1, then ith feature of the original data set is allowed to participate while computing Sammon error; otherwise that feature does not participate in the process. Compute d*j and dij for the data set for entire and reduced data set respectively. Calculate the fitness of each chromosome by computing the Sammon error E given by Eq. 1.

• Selection, crossover and mutation: Using roulette wheel selection method, different pairs of chromosomes are selected for crossover process. The selected pairs are allowed to undergo a single point crossover. The selected new off springs are allowed to mutate. Fitness of each chromosome of new population is computed and NE best chromosome with low Sammon error of previous generation replaces the worst NE chromosome of the new population.

• Termination of GA: GA is terminated when predefined number of generations are completed.

• Validation: Correlation between the proximity matrices with all the features and with reduced feature (d*j and d^) is computed to test the potentiality of the proposed technique to preserve the topology of the data set. More over, k-means clustering is then used to compare the accuracies of the reduced and entire data sets.


3.1. Experiment Scheme

Four real data sets obtained from UCI Machine learning repository14 and one synthetic data set have been used for experiments. Synthetic data set has 588 data patterns with five features distributed in two classes representing two isolated hyper sphere.

Amit Saxena and Megha Kothari 317

Table 1. Description of data sets used.

Data set

WBC Wine

Ionosphere Soanr

Synthetic

#classes

2 3 2 2 2

#Features

9 13 34 60 5

Size

683(444+239) 178(59+72+47) 351(225+126) 208(97+111)

588(252+336)

All these data sets are summarized in Table 1. For each of the five data sets population size Np is considered to be 50; number of retained chromosomes in elitism NE is considered to be 2; crossover and mutation probability is set to 0.8 and 0.07, respectively. With these parameter settings GA is evolved for 20 generations and chromosome of the final population with minimum Sammon error is marked to record the features that are selected. Value of Karl Pearson correlation coefficient is than computed between proximity matrix with all the features (d*,) and with the reduced number of features (dij). Mean values along with standard deviation (in brackets) reported are averaged over ten such independent runs of GA.

Table 2. Sammon error and Correlation values for different data sets.

Data set

WBC Wine

Ionosphere Sonar

Synthetic

# Reduced features

2 3 8 10 3

Sammon error

0.26061 (0) 0.0001 (0.0003) 0.2385 (0.0042) 0.2614 (0.0141)

0 (0 )

Correlation coefficient

0.8167 (1.9E-08) 0.9999 (3.0E-05) 0.9122 (0.0122) 0.8531 (0.0400)

1(0)

Table 3. k-means classification accuracies for data sets used.

Data set

WBC

Wine

Ionosphere

Sonar

Synthetic

# Features

2 9 3 13 8 34 10 60 3 5

Mean eror

5.139 (1.909) 4.348 (1.250) 29.213 (1.45)

30.056 (0.8882) 31.908 (1.414)

28.774 (6.36E-07) 43.269 (2.584)

44.711 (0) 0 (0 ) 0 (0 )

3.2. Results and Analysis

Table 2 reports Sammon error for all the data sets for the reduced number of features. The last column in the same table for each data set shows the correlation coefficient value obtained between the proximity matrix with all the features and with reduced number of features. The higher value of correlation coefficient between proximity matrices implies that the relative inter-point distance between the data points in the reduced feature space and that in the original feature space is same which indicates that the topology of the data set is well preserved even in the reduced feature space. Further, to justify the potentiality of the proposed scheme in terms of classification accuracy, we have compared the classification results using the k-means method obtained with all and with reduced number of features in Table 3. From this table it is noted that the mean and standard deviation errors obtained with reduced number of features and with all features, are in a good agreement.

Table 4 presents the best results obtained for all the data set in 10 runs of GA with minimum Sammon error with their corresponding correlation coefficient value, k-means accuracy with all the features and with the reduce features. For each data set, shown in column 1 of the table, least number of possible features is shown in column 2 along with total number of features immediately below. It is observed that for Synthetic data sets, number of reduced features is around half the total number of features. In rest of the data sets, the number of features is reduced to less than 20% of total number of features. The k-means based accuracies of various data sets are shown in last column of the table. From observation of these values, it is deduced that for WBC data set accuracy is quite high. For the remaining cases, the accuracy is not that high but it is important to note that, in these cases, accuracy obtained with all features is also not high and much closed to that with reduced number of features.

4. Conclusions

In this paper, we propose a dimensionality reduction technique using unsupervised feature selection. To select a minimum possible number of features from a large data set, Genetic Algorithm is used in a manner that the fitness given by Sammon error

318 Unsupervised Approach for Structure Preserving Dimensionality Reduction

Table 4. Best results obtained in 10 runs of GA with minimum Sammon error.

Data set

WBC

Wine

Ionosphere

Sonar

Synthetic

# Features

2 9 3 13 8

34 10 60 3 5

Sammon error

0.26061 0

0.00001 0

0.23117 0

0.24735 0 0 0

Corelation coefficient

0.81678 1 1 1

0.90731 1

0.88696 1 1 1

k-means accuracy

96.046 96.046 69.662 69.662 70.370 71.225 56.730 55.288

100 100

is minimum after i terating population matrix solu

tions a sufficient number of times. Higher value of

correlation coefficient between proximity matrices of

reduced and original da t a set ensures that topology

of the da ta set is preserved even with the reduced

dimension. The accuracy of the classification due to

reduced da t a set and with complete da ta set is com

pared using unsupervised k-means clustering tech

nique. It is noted tha t the accuracy of classification

for reduced da t a set and that for complete da ta set

are quite closed to each other. The technique is tested

on four real and one synthetic diverse da ta set. It is

resolved tha t selection of a subset of features from the

entire set of features, does not affect significantly the

classification accuracy of the da ta set. It is therefore

concluded tha t the proposed techniques is capable

to reduce the dimension of the da t a sets to a great

amount without disturbing the structural topology

of the da ta set as well as maintains a high accuracy

of classification.

R e f e r e n c e s

1. N. R. Pal, V. K. Eluri and G. K. Mandal, IEEE Transaction on Fuzzy Systems 10, 277(June 2002).

2. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data (Prentice Hall, 1988).

3. N. R. Pal, Fuzzy Sets and Systems 103, 201 (1999). 4. D. P. Muni, N. R. Pal and J. Das, IEEE Transaction

on Evolutionary Computation 8, 183 (2004). 5. D. P. Muni, N. R. Pal and J. Das, IEEE Transaction

on System Man and Cybernetics 36, 106 (2006). 6. P. Mitra, C. A. Murthy and S. K. Pal, IEEE Trans

action on Pattern Analysis and Machine Intelligence 24, 301 (2002).

7. M. Dash and H. Liu, Unsupervised feature selection, in Asia Pacific Conference Knowledge Discovery and Data Mining, 2000.

8. S. K. Pal, R. K. De and J. Basak, IEEE Transaction on Neural Networks 1, 366 (2000).

9. I. Dy and C. Brodley, Feature subset selection and order identification for unsupervised learning, in 17 Internationale on ferenceonMachineLearning, 2000.

10. M. A. Hall, Correlation based feature selection for discrete and numeric class machine learning, in 17 International Conference on Machine Learning, 2000.

11. R. P. Heydorn, IEEE Transaction on Computers, 1051 (1971).

12. S. K. Das, IEEE Transaction on Computers, 1106 (1971).

13. D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning (Addison Wesley, New York, 1989).

14. D. J. Newman, S. Hettich, C. L. Blake and C. J. Merz, UCI repository of machine learning databases University of California, Irvine, Department of Information and Computer Sciences, (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html.

http://www.ics.uci.edu/~mlearn/MLRepository.html

PART J

Shape Recognition

321

A Beta Mixture Model Based Approach to Text Extraction from Color Images

Anandarup Roy

Electronics and Communication Sciences Unit Indian Statistical Institute,

Kolkata-700108, India * E-mail: roy. anandarupOgmail. com

Swapan Kumar Parui

Computer Vision and Pattern Recognition Unit Indian Statistical Institute,

Kolkata-700108, India "E-mail: [email protected]

Utpal Roy

Dept. of Computer and System Sciences Visva-Bharati,

Santiniketan, Birbhum-731235, India "E-mail: [email protected]

Text regions are to be separated from a document image before being sent to an OCR system for further processing. The problem of text separation takes a different dimension when the input image is a color image. In the present study, a new statistical mixture model, namely, a mixture of multivariate Beta distributions is proposed to approximate the distributions of the spectral features of pixels in a color document image. An unsupervised algorithm, based on Expectation Maximization framework, for learning the parameters of a multivariate Beta mixture distribution is presented. Individual components of the mixture correspond to different colors present in the image. Color clustering is performed on the basis of maximum a posterior probabilities. After color clustering, a geometric analysis for elongatedness is done on each connected component of each cluster. Based on the degree of elongatedness of a connected component, the text portion is identified and extracted. A comparison is made between results obtained with Beta and Gaussian mixture models.

Keywords: Beta Mixture Model; Color Segmentation; Text Extraction.

1. Introduction

As compared to the binary and gray scale document images, color document images contain much richer information as far as segmentation of text regions from a document image is concerned. The color allow the subject text regions to be more distinguishable from the background and non-text regions. The color segmentation process can be divided into feature-space based, image-domain based and physics based techniques.1 Feature-based methods focus their attention only on the color features where color similarity is the only criterion to segment an image. Image-domain based methods take spatial factors into consideration. Physics based techniques are mainly used to process real scene images where the physical models of the reflection properties of materials are utilized. Text extraction from images and video sequences finds many useful applications in document processing2 and content-based

image/video retrieval from image/video databases.3

There have been several studies on text segmentation in the last few years. Wu et oi.4 use a local threshold method to segment texts from gray image blocks containing texts. By considering that texts in images and videos are always colorful, Tsai et al.5 develop a threshold method using intensity and saturation features to segment texts in color document images. Lienhart et al.6 and Sobottka et al.7

use color clustering algorithm for text segmentation. The present article attempts to extract the text regions embedded in a complex colored background. The proposed method is based on a finite mixture of multivariate Beta distributions where each component of the mixture represents a homogeneous cloud in the color spectral space. The reason behind using a Beta distribution instead of a Gaussian distribution is that a Beta probability density function (pdf) has a more versatile shape than a Gaussian



322 A Beta Mixture Model Based Approach to Text Extraction from Color Images

pdf. A Beta pdf is not necessarily be symmetric, can be skewed in both ways and can even be bimodal. Thus, a Beta mixture can approximate the distribution of a real life dataset more closely and with a less number of components than a Gaussian mixture. We assume each color corresponds to one component of p-variate Beta mixture. By estimating the parameters of the mixture, the components can be identified and separated. Thus we perform color clustering. The connected components are detected for each cluster. Based on the degree of elongatedness of a connected component the text portions are detected and extracted from the image. A color segmentation scheme based on a mixture of multivariate Gaussian distributions has been reported earlier8 and we make a comparative study in this paper between the two mixture models for color segmentation.

2. Mixture of Multivariate Beta Distributions

A p-variate Beta distribution of a random vector X = {Xi,X-z, • •• , Xp) is given by the pdf:

fix, ••• x) f\T(mh+nh) mh-in x.)nh-i

with (0 < Xh < 1, TO/, > 0, n/j > 0). The random variables X\, X2 ,••• , Xp are assumed to be independent. The first and second order raw moments of Xh,

E(Xh)

E{Xl)

mh

mh + nh

mh{mh + l)

(1)

(mh +nh)(mh +nh + l)

A mixture of K p-variate Beta distributions is given by

K

fl(x|e) = ^P j / (x |j )e i) , (2) j = i

K where pj(0 < p} < 1 and J2 Pj = 1) a r e t n e m^x'

ing proportions,/(x|j, &j) is a p-variate Beta distribution representing the j t h component of the mixture and ©j = (rriji, • • • , mjP, rij\, • • • , nJP) is the set of parameters of the j t h component. The symbol 0 refers to the entire set of parameters to be estimated, i.e. 0 = ( 0 i , • • • ,@K,PI, • • • ,PK)-

3. Expectation Maximization Algorithm

Let K = {XJ, • • • ,XJV} be a set of N p-dimensional feature vectors on the basis of which the parameters of the mixture model in Eq. (2) are to be estimated. Suppose the conditional distribution q(j\x.i, 0 j ) , Vj, i of the so-called hidden variables is known. Then the Expectation Maximization(EM) algorithm9 maximizes the log-likelihood function given by

N K

m®) = EE^Wj'îMjixi.e,-). (3) »=i j=i

First K-means algorithm is applied to obtain initial partitions of H. From these partitions, the parameters in © are obtained from Eq. (4) below and by setting pj = -jj-, where Nj is the number of elements in the j t h cluster. Note that Eq. (4) can be obtained from Eq. (1).

(E\Xh]-E{Xl\ m>h - E[Xh] U*fl-iW ~ (4)

njh = (1 - E[Xh}) E\Xh]-E\Xl) E\Xl\-E[Xh\*

3.1. Expectation-Step: Distribution Estimation

Given the set of parameters 0 , the distribution of the hidden variables can be computed as:

q{j^&]) = ^miÂ ZpiHxi\i,®i) 1=1

(5)

3.2. Maximization-Step: Parameter Estimation

The parameters of the mixture model are now re-estimated. Let m and Uj be respectively the first and second order raw moment vectors for the j t h

component of the Beta mixture. The expressions for lij and Uj are given by:

1 N

Vi = 77- 5 3 «(J |xi, ©j )xi

N

J i= l

N

where Nj•. = q (j | Xj, ©j;).

(6)

(7)

(8) i=\

Anandarup Roy, Swapan Kumar Parui and Utpal Roy 323

The parameters in 0 are re-estimated by Eq. (4) and by setting pj = -jj-, with Nj as in Eq. (8). The EM algorithm performs Expectation(E)-step and Maximization(M)-step successively until the log-likelihood value in Eq. (3) stabilizes. In order to obtain an optimal value of K, the Bayesian Inference Criterion(BIC)10 is used. The BIC function for a p-variate Beta mixture is given as:

N

BIC(K) = -2^2]ng{xi\e) + ((2p+l)K-l)]nN.

The value of K that minimizes the BIC function value is chosen as the optimum value of K.

4. Color Segmentation and Text Extraction

The rgb color model is used here for color segmentation purpose. The rgb model is derived from the RGB color model by the following expressions:

R+l r = _— ,g

G+l, B+l 257 ' J 257 '" 257 '

so that (r,g,b) values lie strictly between 0 and 1. We assume the (r,g,b) values of pixels in a color image arise from a finite mixture of trivariate (p = 3) Beta distributions described in Sec. 2. The parameters of this mixture are estimated on the basis of the (r,g,b) values of pixels of an image using the EM algorithm described below.

Estimation of Tri-variate Beta Mixture Parameters: Input: A set N = {x*, • • • ,XJV} of (r,g,b) values of

AT pixels of a color image. Output: 0 = ( 0 i , - - - ,&K,Pi,--- ,PK)-

(1) Perform initialization process as described in Sec. 3 for tri-variate case.

(2) {The E-Step}: Using Eq. (5), compute

gO'lxi»©j)-(3) {The M-Step): Using Eq. (6), Eq. (7), compute

moments for each of the three variables, for each component.

(4) Update ©j and pj using Eq. (4) and Eq. (8). Repeat Steps 2 to 4 until convergence.

The EM algorithm is employed on the (r,g,b) values for several values of K and the corresponding BIC(K) values are computed. We start with K = 2 and then successively increase it by 1. The BIC(K) value first decreases and then starts increasing again.

The value of K at the first local minimum in BIC(K) gives the optimal value of K, i.e., the optimal number of colors present in the input image. After finding the optimum number of clusters, and estimating all the parameters of the trivariate Beta mixture distribution, we segment the image into K clusters or colors. We use maximum a posteriori(MAP) probabilities for this purpose. A pixel Xj 6 Cj where Cj is the j t h cluster, if and only if the posteriori probability q(j\xi,@j) > q{k\Xi,&k), Vfc ^ j . This way we perform color clustering on the input image. We make the following observations on the clusters obtained for a color image having texts:

(1) Normally the text parts make one single cluster. (2) However, text parts may belong to several clus

ters. This is when texts in the image have more than one color.

(3) A single cluster may have both text and non-text parts. This is when some text and some non-text region in the image have the same color.

Each cluster is now separately studied. The connected components are obtained from each cluster. Sufficiently small components are removed. We take into account the degree of elongatedness of a connected component to distinguish between text and non-text regions of the image. The text like patterns are usually elongated. A measure of elongatedness of a component is now defined as:

™ , ,-r^^s No. of boundary points Elongatedness ratio(ER) = ——— : .

Total no. of points

Empirically it has been found that a component with ER value greater than 0.7 is a text part. Otherwise, it is a non-text part.

5. Results and Discussions

The Beta mixture model based color segmentation scheme proposed above, has been tested on several images and the results obtained are satisfactory. In many cases, the results obtained with Beta mixture model are nearly the same as those obtained from Gaussian mixture model. However, in some cases, the former gives visually better results than the latter. The (r,g,b) values for an image may not always follow Gaussian distributions and are free to make any multimodal skewed curve. Figure 1 below shows the Normal Probability Plot of the (r,g,b) values of a sample image.


180 200 Red values

0.001 -

180 200 Green values

J2 CO

0.25 0.05

140 160 180 Blue values

Fig. 1. The (r,g,b) Normal Probability Plot for the image in Fig. 2

The plot clearly indicates that no component of (r,g,b) follows a Gaussian distribution. For such images in which the spectral distribution is not symmetric for some color clusters, Beta mixture performs better than Gaussian mixture. This is because the former has better capability to approximate asymmetric distributions than the latter and hence produces better estimates of an arbitrary mixture distribution.

Experimental results are shown in the following

figures. Figure 2 displays an original image. The images obtained after color segmentation with Beta and Gaussian mixture models (with K = 4) are shown in Fig. 3 and Fig. 4 respectively.

Fig. 2. Original Image

Anandarup Roy, Swapan Kumar Parui and Utpal Roy 325

&-'

w&)$f&sm& Fig. 3. Segmentation result from trivariate Beta mixture model.

large components from Fig. 6. Starting from the leftmost we denote these components by 'A', 'B', ' C and 'D' respectively. We can verify from Table 1 that for the cluster containing text, the elongatedness ratio is high for each component.

g * ' " - ' .'5 ••

Fig. 4. model.

Segmentation result from trivariate Gaussian mixture

Using BIC, the optimal value of K is found to be 4 for Beta mixture model. In Fig. 5 and Fig. 6, two sample clusters (indicated by black pixels) generated from Fig. 3 are separately displayed. Text regions of the original image belong to the cluster shown in Fig. 6.

Fig. 5. A sample cluster from the segmented image in Fig. 3.

"SR^ênt Fig. 6. Another clusters from the segmented image in Fig. 3.

The cluster containing text regions in Fig. 4 is separately shown in Fig. 7. This cluster corresponds to the cluster in Fig. 6.

V[JW^ *T[e*TJ Fig. 7. A cluster from Fig. 4.

Table 1 shows the values of elongatedness ratio for the clusters in Fig. 5 and in Fig. 6 respectively. In Fig. 5 only one component is large enough for computation of ER. On the other hand, we find four such

Table 1. Results of cluster analysis for elongation computation Components

Components

Single component Component 'A' Component 'B ' Component ' C Component 'D '

from Fig. 5.

ER value

0.120332 0.941003 0.955882 0.854772 0.952381

The following figure shows the four connected components from Fig. 6. However when the cluster in Fig. 7 is subjected to elongatedness computation, very few text like components survive, which carry no vital information about the original text. Visually comparing Fig. 6 and Fig. 8 with Fig. 7 we can conclude that trivariate Beta mixture model gives better result for both color clustering and text extraction.

Fig. 8. Connected components of Fig. 6.

One more example is shown below for the sake of clear comparison between Beta and Gaussian mixture models (BMM and GMM respectively).

;^fl|fSf^f' fifr^p!

Fig. 9. Original Image.

'sn^pt^R Fig. 10. Segmentation result with BMM (K=3).


Fig. 11. Text extracted from Fig. 10.

Fig. 12. Segmentation result with GMM (K=3).

Fig. 13. Text extracted from Fig. 12.

R e f e r e n c e s

1. L. Lucchese and S. Mitra, Color image segmentation: A. state-of-the-art survey, in Proc. of the Indian Na

tional Science Academy, (67-(A)2)2001. 2. K. C. Fan, L. S. Wang and Y. K. Wang, Signal Pro

cessing 45, 329 (1995). 3. R. Lienhart, Indexing and retrieval of digital video

sequences based on automatic text recognition, in Proc. of the ACM International Multimedia Conference and Exhibition, 2001.

4. V. Wu, R. Manmatha and E. M. Riseman, IEEE Trans, on Pattern Analysis and Machine Intelligence 21, 1224 (1999).

5. C. Tsai and H. Lee, IEEE Trans, on Image Processing 11, 434 (2002).

6. R. Lienhart and F. Stuber, Automatic text recognition in digital videos, in Proc. of the SPIE (Image and Video Processing IV 2666), 1996.

7. H. B. K. Sobottka and H. Kronenberg, Identification of text on colored book and journal covers, in Proc. of the International Conference on Document Analysis and Recognition, 1999.

8. C. Carson, H. Belongie, S. ans Greenspan and J. Malik, IEEE Trans, on Pattern Analysis and Machine Intelligence 24, 1026 (2002).

9. A. P. Dempster, N. M. Laird and D. B. Rubin, Journal of Royal Statistical Society B 39 (1977).

10. G. Schwartz, Annals of Statistics 6, 461 (1978).

327

A Canonical Shape-Representation for a Polygon

Sukhamay Kundu

Computer Science Dept, Louisiana State University, Baton Rouge, LA 70803, USA, Email: [email protected]

We give an efficient 0(n.log n) algorithm for determining a canonical shape-representation of a polygon of n nodes, which is invariant under rotation, translation, and scaling. The canonical representation can be used for matching image objects for retrieval operations and image conflation.

Keywords: canonical representation, image conflation

1. Introduction

Shape representation plays a critical role in many image analysis tasks, particularly in image matching and retrieval [1-3, 5-6]. Objects in an image are typically represented by a polygonal curve (for roads and coast lines) or by a (closed) polygon. A polygon P can be represented using a combination of distances of its nodes from a suitably chosen origin, lengths of its edges, angles between successive edges, angles at the origin between the successive radial lines from the origin to the nodes, etc. (The origin itself can be either the centroid of the nodes of P or of the edges of P, making it independent of the coordinate system. The latter is more robust that the former, and it equals the weighted average of the mid-points of the edges, with the weight for an edge being its length.) Such a representation is preferred when the actual positions of the nodes of P are not important, as is the case for most applications in image processing. In some special applications, such as the determination of the best conflation transformation T for two polygonal curves [4], the simple representation by giving the ^-coordinates of the successive nodes of the polygonal curve, starting with one end-node of the curve, suffices. The same method cannot be used, however, easily for conflating two polygons because a polygon does not have a specific starting node.

Our main contribution here is a canonical representation of a polygon P which is invariant under rotation, translation, and scaling and which can be computed in 0(n.log n) time for a polygon with n nodes. We can use this representation to match two n-node polygons in time 0(n). A canonical representation of P distinguishes certain nodes of P as the anchor-nodes (see Section 3) which have the following property: if pi and pi+k are two anchor-nodes, then P looks geometrically identical when viewed from pi and from Pi+k- That is, for each q = 1, 2, • • •,

n, the gth edge of P starting at Pi and going in the anti-clockwise direction (say) has the same length as that starting at pi+k and, moreover, the internal angle at Pi+q is the same as that at pi+k+g. See Fig. 1.

?•

1I? 9

=$ 9

(i) (ii) (iii) Fig. 1. Illustration of anchor nodes: (i) A plygon with two anchor nodes p\ and pz, (ii) Moving p2 slightly towards p\ makes p\ the unique anchor-node, (iii) After rotation of (i) by /degree 90; it does not change the anchor nodes.

2. Properties of a Shape-Representation

The simplest representation of a polygon P as a list of the xy-coordinates of the successive nodes of P in the anti-clockwise direction starting at some node of P does not have the property (Rl) below because a representation must be independent of the co-ordinate system in order for (Rl) to hold. The properties (R1)-(R3) are needed for most applications. Although the property (R3) is less critical compared to (R1)-(R2), we feel that the notion of an anchor-node should be defined in such a way that allows one to identify an anchor-node easily for simple cases without elaborate computations. (Rl) Two polygons P and Q have the same representation if and only if they are equivalent (denoted by P « Q), i.e., one can be obtained from the other by a combination of rotation, translation, and scaling. (R2) A small change in the relative position of a node makes only a small change in the representation. (R3) For simple cases, it should be possible to determine the representation (i.e., an anchor node) easily by visual inspection. We can obtain an initial representation which satisfies (R1)-(R3), except for the invariance under scaling,


328 A Canonical Shape-Representation for a Polygon

as follows. Choose an arbitrary node of P as px and let p2, p-i, ••-, and p„ (n > 3) denote the successive nodes of P in the anti-clockwise direction starting at px. Let L, be the line segment from pt to pM (pn+l being the same as px), dj = length of L, = |L,|, and 0, = the interior angle at ph Moreover, we defi ne the ordered pair <?, = (dt, 0,) corresponding to pt. It is clear that the sequence (S\, S2, —, 8n) uniquely determines P, except for a cyclic rotation. For example, if we relabel the node p2 as px, p3 as p2, etc, then the new sequence (S2, S3, •••, Sn, 8\), is simply a cyclic rotation of the previous sequence. A simple way to make the sequence of £,'s scale invariant is to divide dt in each St by d = dx + d2 + ••• + d„, which essentially makes the perimeter of P of length 1. Henceforth, we assume this to be the case, i.e., d = 1.

3. The Canonical Representation Given a polygon P = (Su 82, —, S„) as a sequence

of successive <S,'s starting at some node in P, we write P 0 ' = (Si, SM, —, S„, Si, S2, —, £,_i), which is just a cyclic rotation of P = P(1) representing the same polygon P. This simply means we regard the list of <S,'s as a circular list, with St following S„. To match two «-node polygons P and Q using this representation, we can compare P(l) and Q term by term for each i; this requires a total of 0(n2) time. To reduce the computation time, we choose among the cyclic rotations P w , 1 < i < n, of P a unique canonical one in some way and denote it by Pc. It has the property that two n-node polygons P and Q are equivalent, i.e., P ~ Q if and only Pc = Qc. We will show that Pc can be determined in time 0(n. log n). This reduces the computation time for determining P = Q to 0(n. log n) from 0(n2).

Definition 1. The canonical representation Pc of P = (Si, S2, —, S„) is the unique rotation P( ,) which is the minimum in the lexicographic ordering among the n rotations. (Here, the ordering <5, = (dh 0j) < (dj, #,) = Sj is also taken to be the lexicographical ordering.)

Definition 2. The start-node /?, of P corresponding to Pc = PM is called an anchor-node. A polygon P may have more than one anchor-node and for any two anchor-nodes pt and /?,, we have P(i) = PC = PU).

Example 1. The rectangle in Fig. l(i) has two anchor-nodes p\ and /?3. Here, each internal angle 0, = 90° and thus only the lengths d{ of the edges matter. In Fig. l(ii), which is obtained by moving p2 in Fig. l(i) slightly towards pi (making the edge Pip2 the unique shortest edge), the only anchor-node is pt. Note that had we defi ned Si = (0,, dt), but keeping the lexicographically ordering Sj < Sj for the new <5,'s, we can make the unique anchor-node in Fig. l(ii) any one of {px, p2, p3} depending on how p2 is moved. This is one good reason why we chose Sj = (dj, 0,), in keeping with the property

(R2). Another reason for choosing Si = (dt, 0,) instead of Sj = (Oj, di) is that the distance d{ is more easily visualized than the angle 0t (which requires the consideration of three nodes compared to two for the distance). This makes it easier to determine visually the ordering of £,'s and hence easier to determine the anchor-node(s). •

Given a random polygon P, the probability that two of its edges have the same length, i.e., dt = dj for / * j is zero, and the same is true for 0, = Oj. Thus, with probability one, there will be a unique anchor-node. Moreover, if all dt are distinct and likewise all 0, are distinct, then a small change in dj or 0, will not affect the ordering St < Sj and hence the selection of the anchor-node for the canonical form. This, in turn, means that if the relative position of a node />, changes slightly then with probability one the anchor node will not change and thus the only change in the representation will be the values of <?,_!, St, and 6i+\. This shows Def. 1 satisfi es (R2); that it also satisfi es the properties (Rl) and (R3) is easily verifi ed.

Although the probability of two or more <?,'s being identical is theoretically zero, in practice the fi nite resolution in measurements can cause one or more £,'s appear more than once in the ^-sequence for a P. It is therefore important that an algorithm for computing the anchor node(s) works correctly even when not all <5,'s are distinct. Our algorithm CANONICAL given in Section 4 satisfi es this property.

It is worth pointing out that, except for some minor problems, we can replace the distance-angle pair Sj = (dt, Bi) throughout the paper by the distance-distance pair Si = (dj, d'j) where d\ is the distance of node Pi from the centroid of nodes (or edges) of P. There are two reasons why we prefer St to S-. First, because d'j involves the notion of a centroid (which is a global concept) it is easier to visualize Si than 8{ (see property (R3)). Second, a small change in the relative position of a node pt of P can affect the centroid and hence several d'j; in contrast, as we noted earlier, it can affect only £,_!, Sj, and Si+i. Note that (dhd^) is preferred to (d'j, dj) for reasons similar to why we prefer (dj, Oj) to (Oj, dj).

Since the unique rotation Pc of P depends only on the relative ordering ^, < Sj of the distance-angle pairs in P, we can simply use the consecutive integers 1, 2, — to represent the distinct <?,'s in P once we sort them in increasing order and determine the rank of each Sj.

3.1. Valid 5-sequences

Let S(n, k) be a sequence of length n > k > 1 which includes at least one occurrence of each of the items K = {1, 2, —, k} and which is valid in the sense that it corresponds to a polygon P with n nodes as

indicated above when we replace each <?, by its rank after sorting all the distinct <?,'s in the increasing order. Although a complete characterization of the sequences S(n, k) seems diffi cult, the result below shows that the pattern in a valid S(n, k) can be arbitrarily complex.

Let n = (ji, j 2 , - , j m ) , m > k > 2 be any sequence which includes each item of in K at least once; in particular, for m = k, II is simply a permutation of K. We show that Sn(2m, k) = (./',, - , j m , j u - , jm) = n . n , formed by repeating the sequence n twice, is valid by constructing a corresponding (convex) polygon P with 2m nodes and each 0, = 0 = n(m - \)lm. First, we lay the edge P\p2 of length d{ = j{ arbitrarily, and then we successively lay the edges p,p1+1 with the internal angle 0 at Pi and length d{ = j t for 2 S i < m and dt = dt_m for m + 1 < i < m. The confi guration of px to pm+l is identical with that of pm+1 to p2m+u with p2m+l = px. See Fig. 2 for II = <3, 2, 1, 2, 1), with m = 5 and k = 3. We can normalize the perimeter length of P to 1 by a simple scaling. Since St = (dh 0), we have P corresponds to Sn(2m, k). The above method can be used even when the pattern n is repeated q > 2 times, which would give 0 = (qm - 2)xl(qm). If we account for fi nite resolution (error) in the measurements of dt and 0,, we get many more valid S(n, k) than the ideal case of zero error.

Pi Pe

Pi Pi

Figure 2. Illustration of the construction of a polygon P = n .n , where n = (3, 2, 1, 2, 1).

The situation is similar when we use the pairs S\ = (d{, d'j) instead of J, = (d,, 0,). Once again, we show that the sequence Sn(2m, k) = n . n is valid by constructing a corresponding convex polygon P. This time all the nodes p(, \ <i < 2m, lie on a circle C of radius p. Let r, denote the radial line from the center of C to the node Pi. Having positioned px arbitrarily on C, we successively determine the positions of p2, P3, —, pm+\ by choosing the angle from the radial line r, to the radial line rM to be 0, = ./,. nla, where 0= j\+ j 2

H v 3m-Note that pm+\ is diametrically opposite to p\ on C. To complete P, we simply repeat the placement pattern of Pi to Pm+i o n t n e other half of C starting from pm+l. The symmetrical placement of the second set of m points (with P2m+i coinciding with p\) around C means that the center of C is the centroid of the points of P. This, in turn, means each d\ = p and the relative ordering of the

Sukhamay Kundu 329

lengths dj =/?.sin(0,/2) (equivalently, the ordering of £/) is the same as that in 5n(2m, k). This shows Sn(2m, k) is valid. We can normalize the perimeter length of P to 1 by scaling the radius p of C. A similar construction applies if the pattern n is repeated q > 2 times.

Example 2. We now give the anchor-nodes for a few complex patterns P made out of {1, 2, •••, k} for some k. We ignore for the moment whether or not P is valid and instead focus on some simple observations which help us in effi cient determination of the anchor-nodes. For a regular polygon, we have P = (l,l,—,l) = Pc, with every node an anchor-node. At the other extreme, P = n . n , where n is a permutation, has exactly two anchor nodes; for example, P = (1, 3, 2, 1,3, 2) has the anchor-nodes px and p4 , If P = (1, 1, 3, 3, 3, 3, 2, 1, 1, 1, 4), then Pc = Pm = <1, 1, 1, 4, 1, 1, 3, 3, 3, 3, 2), where the maximal repeating group "1, 1, 1" of the smallest term "1" forms the starting items in Pc. The same property holds if we replace the smallest item " 1" by any thing which is itself in a canonical form such as "1, 5, 2" or "1, 5, 2, 2" or "1, 1, 5", but not "1, 5, 1". o

4. Computation of pc

Our basic approach is to successively obtain a reduced index set / ( i+1) c I(k\ k > 1, for the candidate / corresponding to the anchor-nodes pt, starting with /(1)

= {/: Si = 1}. The process is continued until we get an / ( t ) with a single item. There are two reduction methods for deriving I(k+l) from I(k).

First method. For a given q > 1, we consider the </th neighbors Si+q of <?,, / e / ( t ) , and if Si+q < SJ+l/ for some /, j e 7(i) then eliminate j from / ( t ) . We repeat this for q = 1, 2, — as long as i + q < i' for each pair of successive items i and /" in I{k) and |/ ( t ) | > 1. Note that deletion of items from 7(t) might create opportunities for using larger values of q and this in turn may help deletion of more items. This is illustrated in Table 1, where the current positions in / (1) are marked T and in each row the ^-neighbors that are compared are marked •.

Table 1. Illustration of fi rst reduction method; initially, 7(,) = { 1,4, 6, 8, 11, 14}.

P'= ( 1 3 2 1 4 1 5 1 3 2 1 5 4 1 3 3 5)

q-\ • • • • • •

T T T q=2

T t q=3 • q=A • •

r The comparisons for q = 1 eliminates 4, 6, and 11, and

330 A Canonical Shape-Representation for a Polygon

those for q = 2 eliminates 14; fi nally, the comparisons for q = 4 eliminates 8, giving /(1) = {1} of size one and we stop. The case P' = <1, 3, 2, 1, 2, 1, 5, 1, 3, 2, 1, 2, 1, 5, 4, 3, 5) shows another interesting behavior, reducing 7(l) = { 1, 4, 6, 8, 11, 13} to {4} after q = 4. As another example, let P = (1, 5, 2, 1, 5, 2, 3, 3, 3, 3, 2, 1, 5, 2, 1, 5, 2, 1, 5, 2, 4). Here, /(1) = {1, 4, 12, 15, 18} and we successively compare the neighbors for q = 1, 2, 3, without getting any reduction; since we cannot use q = 4, we stop. If we modify S2 in P to be any of 2 or 3 or 4, then we start with the same 7(1) = {1, 4, 12, 15, 18} as before and q = 1 reduces it to /(1) = {1}.

Lemma 1. The total number of look-ups of items in P in the reduction of 7 ( t ) by the fi rst method, counting multiple look-ups of items, is 0(n).

Proof. As can be seen for Table 1, for an item in P to be looked at the second time as a ^-neighbor of / e 7( t) we must have an i' e I(k\ i < i' < i + q, which is eliminated earlier for some q' < q. Let L be the minimum distance between two successive items in I(k)

before we apply the method; that is, if 7(A) = {iu i2, —, im] where /, < i,+l for t < m, then L = min {/2

— * I» ' 3 - '2» •"» 'm ~ 'm-i. " + 'i - 'm 1 • Note that L > 2 for the fi rst method to be applicable. It follows that for an item in P to be seen rth time as the ^-neighbor of some / e 7( t), we must have q > Lr~x. This means the total number of look-ups is no more than n(l + ML + ML2 + •••) = nL/(L-l) = 0(n). D

Second method. We apply this method if \I{k)\ > 2 after reducing l(k) by the fi rst method and it uses the notion of repeating groups as indicated in Example 2. For the large P shown above, the pattern "1, 5, 2", i.e., "SXS5S2" repeats twice starting at position 1 e 7(1) and repeats thrice starting at position 12 e /(1). The corresponding groups of indices in /(1) are {1, 4} and {12, 15, 18}. Here, the successive items 4, 12 e 7(1) are not in the same group because their distance is 1 2 - 4 = 8 > 3 = L(1) = the minimum distance between successive items in / ( l ) after it is reduced by the fi rst method. Likewise, the successive items 18 and 1 (going around the circular list P) are in different groups because of their distance 4 > L(1). On the other hand, the items in the same group have at least a neighbor at distance exactly L(1). We now eliminate all but the starting position / e I(k) of the maximum size repeating groups; the reduced set is called /(*+')_ p o r m e above P, this gives 7<2) = {12}, and since it has only one index, we stop. If we slightly modify P by replacing the rightmost 1 = <S18 with <518 = 4, then we would have 7(1) = {1, 4, 12, 15}, and this would not be reduced by the fi rst method but would be reduced by the second method giving 7(2) = {1, 12} because of the groups {1,4} and {12, 15}. Next, we would reduce 7(2)

by the fi rst method to 7(2) = {1} and we would stop.

Lemma 1. |7 ( i+1) | < |7(*>|/2.

Proof. Since each repeating group, if any, will have at least two indices, the number of maximum size repeating groups in 7( t) after I(k) has been reduced by the fi rst method is at most half the original size of 7W

before the reduction by the fi rst method. Since we only keep the start position of the maximum size repeating groups to form 7(*+1), we have |/(A+1)| < |/ ( i ) | /2. •

Algorithm CANONICAL:

Input: A vector P = (Siy S2, —, Sn) of distance-angle pairs Sj = (dh 0,).

Output: The unique canonical form Pc of P.

1. [Initialization.] Let S = min {<?,: 1 = {/: Si = S). If |7(1)| = 1, then stop; else let k = 1.

2. [Compute 7(*+l) from 7(t), k > 0.]

(a) Reduce 7(t) by the first reduction method described above. If |7 ( t ) | = 1, stop.

(b) [There are repeating groups.] Reduce 7W by the second method and call the reduced set 7(*+1)andlet/t = * + l.

Example 3. Table 2 shows the computations of CANONICAL for a fairly large and complex hypothetical pattern P. Note that as we form repeated groups to form 7(2) after reducing 7(1) by the first method and we proceed to apply the fi rst method to 7<2) we can start considering ^-neighbors of i e 7(2) from q = m(1)L(1) instead of q = 1, where L(I) = length of the repeating group-units in 7(1) after reduction by the first method and m(1) = #(group-units in the maximum group). Here, L(1) = 3 corresponding to the group-unit "1 , 6, 2" and mm = 2, and thus in applying the fi rst method to 7(2) we start with q = 6. A similar remark holds for 7( t), k > 2.

5. Complexity of CANONICAL First, we have 0(n. log n) computation for the ini

tial sorting of St in 7* and assigning them the ranks 1, 2, —. By Lemma 2, step (2) is applied 0(log n). By Lemma 1 each application of step 2(a) takes O(n) time and each application of step 2(b) clearly takes O(n) time. This gives the total computation time of the algorithm 0(n. log n).

6. Conclusion We have presented here a canonical shape-repre

sentation of a polygon that can be computed in 0(n. log n) time for a polygon P with n nodes and which is suitable for matching image objects in image retrieval and for the confktion of two polygons.

Sukhamay Kundu 331

Table 2. Illustration of algorithm CANONICAL; t shows the positions i in current / ( t )

and • shows the various ^-neighbors compared in reducing I(k) by the fi rst method.

p

Initial /(1)

Apply fi rst method: q=\

Updated /(1)

q = 2 Final /(1)

Apply second method: 7(2)

Apply fi rst method: q = 6

Final /(2)

Apply Second method: /(3)

1 6 2 1 6 2 5 3 1 6 2 1 6 2 5 1 7 1 7 3 2 2 1 6 2 1 6 2 5

TT TT t t I T • • • • • • • •

• • • • • •

Three maximal repeated groups, each with 2 repetitions of "1, 6, 2". T T T

• • •

T t T One maximal repeated group, with two repetitions of "1, 6, 2, 1, 6, 2, 5".

T

7. Reference 1. E.M. Arkin, L.P. Chew, and D.P. Huttenlocher, An

effi ciently computable metric for comparing polygonal shapes, IEEE Trans. PAMl, 13(1991), pp. 209-216.

2. T. Bernier and J.A. Landry, A new method for representing and matching shapes of natural objects, Pattern Recognition, 36(2003), pp. 1711-1723.

3. J.M. Chen and J.A. Ventura, Optimization models for shape matching of non-convex polygons, Pattern Recognition, 28(1995), pp. 1053-1063.

4. S. Kundu, Confuting two polygonal lines, Pattern Recognition, 39(2006), pp. 363-372.

5. H. Nishida, Matching and recognition of deformed closed contours based on structural transformations, Pattern Recognition, 31(1998), pp. 1557-1571.

6. I. Schreiber and M. Ben-Bassat, Polygonal object recognition, IEEE Proc. 10th Int. Conf. on Pattern Recognition,, 1990, pp. 852-859.

332

A Framework for Fusion of 3D Appearance and 2D Shape Cues for Generic Object Recognition

Manisha Kalra and Sukhendu Das

VP Lab, Dept. of Computer Science & Engineering, Indian Institute of Technology Madras

Chennai - 600036 E-mail: {mkalraOcse, sdas@).iitm.ernet.in

This paper addresses the problem of pose invariant generic object recognition by proposing a novel framework which fuses information obtained from 3D appearance and 2D shape cues. The 3D appearance model of the object is captured using linear subspace analysis techniques (2D PCA and ICA) and is used to reduce the search space to a few rank-ordered candidates. For matching based on 2D shape features, we propose the use of distance transform based correlation. We have explored various measures for fusion of 3D appearance and 2D shape cues. Fusion of two linear subspace classifiers, 2D PCA and ICA, gives the best set of rank ordered samples, and then a combined decision of distance transform based shape matching and subspace classifiers selects the object class and pose for recognition. Experiments were conducted on COIL-100 database which contains a wide variety of objects from different classes with complex appearance and shape characteristics. The proposed technique is shown to outperform the other existing techniques in literature. Experimentation shows that the proposed method is more robust to noise.

Keywords: Generic Object Recognition, form, Classifier Fusion

Linear Subspace Analysis, Appearance, Manifolds, Shape, Distance Trans-

1. Introduction

The existing object recognition techniques focus on recognition of a particular class of objects where the exact features to be used for recognition are known, for example, principal components for faces, miniature features for fingerprint, shape features for handwritten characters. However, the focus of generic object recognition is on the domains where the exact features distinguishing one class of objects from others are unknown. Thus, the problem is not restricted to a single class of objects, say only face recognition or vehicle recognition. Rather, it involves recognition across multiple categories of objects. Content based image retrieval, infant perception and recognition are the potential areas of its application. The various approaches for object recognition can be grouped into structural decomposition based approaches;1 appearance based approaches;2-4 shape based approaches;5'6 and model based approaches7

based on the type of features and matching strategies used. Appearance based techniques are reported as the most robust techniques for object recognition. Murase and Nayar2 have addressed the problem of automatically learning object appearance models for recognition and pose estimation using ID PCA. From the set of 100 objects in COIL database, the authors have picked 20 objects (COIL-20 database) that do not possess pose ambiguities and have reported a

recognition rate of 100% on these 20 objects. They also report good recognition accuracy for COIL-100 database. Nagabhushan et al.3 have experimented the use of 2D Principal Component Analysis (2D PCA) on COIL-20 database for object recognition and have reported that 2D PCA gives a better recognition accuracy than ID PCA. They also report a 100% recognition rate on the 20 object database for noise-free test samples. They did not report their results on COIL-100 database. However, the existing appearance based techniques (ID and 2D PCA) summarized above do not use any shape cues for verification and improving of recognition accuracy.

2. Proposed Framework

We propose a two stage framework based on the studies in cognitive psychology where we try to model the 'human visual-pathway' starting with low-level processing like feature extraction (using appearance based cues), to high-level object representation in the human brain, such as perception (using shape cues) and recognition. The flowchart of the overall framework for generic object recognition is shown in Fig. 1. The entire framework can be logically divided into three phases: (a) Memory, (b) Appearance Based Representation and Classification, (c) Shape Perception. Below, we describe the three phases.

Manisha Kalra and Sukhendu Das 333

Test Image

Appearance-based Intelligent Choice

Selected Samples

Shape based Distances

T Select a set

of rank-ordered samples

TTTT V

Image Gallery

Shape Feature Extraction

and Matching

* * * T t T

Appearance based Distances

Sum. Select sample with minimum distance

Object ft, Pose #

Report Recognized Object and Pose

Stop

Fig. 1. Flowchart for Generic Object Recognition framework using a combination of appearance (Linear Subspace Analysis) and Shape (Distance Transform based matching) cues.

2.1. Memory : 2D Image Gallery

The image gallery contains the 2D views of objects from various poses. These views represent the objects already seen by human beings. This aspect of the framework models the recollection ability of the human beings to retrieve exemplars from the memory on seeing an object and studying its appearance from different poses.8 A subset of this database is used to train the system for generic object recognition. Rest of the samples are used for testing.

2.2. Appearance Based Representation and Classification

The human brain extracts a set of statistical features or cues from images of 3-D objects to represent or recognize it. Based on this hypothesis, we propose a method which uses second and higher order statistics9 to represent objects in a high dimensional space. We use a combination of two linear subspace analysis based classifiers : 2D PCA3 and ICA10 for this task. In general, any linear subspace technique (ID PCA, 2DPCA, ICA) can be used for search space reduction using appearance cues. Both 2D PCA and ICA are shown to capture the appearance manifold2

of the object. Also, since both 2D PCA and ICA are found to provide complementary information for the recognition task, we propose the following measure

for appearance-based object recognition

Dcomb = aVg(£>2DPCM, DICA) (1)

where DZDPCA and DICA are euclidean distances between the test and training features in the 2D PCA eigenspace and ICA space respectively. Since each classifier uses its own representation of input patterns, the matching scores {D^DPCA and DICA) are normalized using Max normalization11 before computing Dcomb. This can be considered equivalent to the sum rule of fusion strategy proposed by Kittler.11

2.3. Shape Perception

Since psychological findings indicate that shape dominates other cues in human object recognition, we incorporate a shape perception stage in our framework. Distance Transform (DT) based features have been preferred over moments6 and shape context5 due to reduced computational cost and robustness against noise and discontinuities in edgemaps.12 Let bXy be a bitmap with feature pixels of value 1 on the background pixels with value 0. Consider a second bitmap b' and let dx,y be the DT of b'. Cross-correlation between the distance transform d and the bitmap b is given as R(b,d) °. If two samples (test and target

*R(b, d) = ^ ^ • ^ t ' ' " ^

334 A Framework for Fusion of 3D Appearance and 2D Shape Cues for Generic Object Recognition

shapes) are similar, we obtain small value of correlation indicating a higher degree of match.13 Instead of using R(b, d)° as the distance measure between two bitmaps b and b', we use an average distance of cross correlation of DT of b with bitmap b';12 and that of DT of b' with the bitmap b. Thus the distance measure for choosing the best sample based on the shape of the object is given as

DTcorr = avg(tf(T, D(bi)), R(bi, D(T))) (2)

where bt is the edge map of the ith training sample, T is the edge map of the test image and D{.) is the distance transform function. This criterion works better than just using R as the shape similarity measure, as it is unbiased to T or 6j.

2.4. Fusion of 3D Appearance and 2D Shape Cues

For each object to be stored in the database, a large set of images from different poses of the object are obtained. The set of images is normalized with respect to scale and projected into the universal linear subspace constructed using 2D PCA and ICA from the set of all object images. Each object is then defined by a manifold in the universal linear subspace. The manifold of the object captures the appearance representation of objects keeping the images of consecutive poses of objects which are visually similar in appearance, close to each other in the manifolds. Given a test image, it is first projected onto the universal linear subspaces (constructed separately using 2D PCA and ICA) and a few rank ordered samples closest in appearance to the test sample are selected based on Dcomb (Eq. 1). Shape matching is then performed using distance transform based correlation. The object with the minimum value of a fusion of (a) appearance based ICA and 2D PCA distance (Eq. 1) and, (b) shape cues using distance transform based matching (Eq. 2) is selected as the best match. We have evaluated various measures suggested by Kit-tier11 for combining (fusing) the decisions of the linear subspace classifiers and distance transform based correlation, on COIL-100 database. The experimental results presented in Sec. 3 indicate that sum rule is the most robust classifier combination strategy to produce a fused experts opinion using 3D appearance and 2D shape cues. Thus, we propose the following criterion for generic object recognition using a combination of appearance and

shape cues

D\ = Dcomb + DTcorr (3)


To compare the performance of our proposed approach with the existing state-of-art techniques, we have used COIL-100 database14 which contains color images of 100 objects taken at pose intervals of 5 degrees (72 poses per object). COIL-100 has been previously used by2-4 to test the performance of their appearance based systems. Experiments were conducted with a part of the gallery chosen for training and the rest for testing. Different experiments were performed with training samples chosen for all objects in the gallery, obtained at increments of every 10, 15, 20, 25 and 30 degrees. The framework was trained (separately for each experimentation) using each of these five training sets. The performance was analyzed using four test samples per object (400 test samples), selected at random from the rest of the gallery.

3.1. Appearance Based Recognition

Both 2D PCA3 and ICA10 are shown to capture the appearance representation of objects. Fig. 2 (b) and (c) show the linear subspace representations (manifolds) of an object from COIL-100 database (Fig. 2 (a)) using the first 3 eigenvectors / independent components. These representations were computed using all 72 training samples of the object acquired at every 5 degree pose. Each training sample represents a point in 3D. Each of these 72 points yield a plot of a closed curve (object is rotated a full 360 degrees with increments of 5 degrees). The manifold of the object captures the appearance representation of objects by keeping the images of consecutive poses of objects which are visually similar in appearance, close to each other in the manifold. For the task of search space reduction based on appearance cues, given a test image, we project it onto the universal linear subspace of the training samples and select a few rank ordered samples closest to the test image.

3.2. Need for Shape Matching

Linear Subspace analysis techniques (2D PCA and ICA) give good results for objects having distinct

Manisha Kedm and Smkhendu Das 335

PC! IC2 . , " - 1 6

(b) (c)

Fig. 2. a: An object from COIL-100 database, b: Parametric Eigenspace (2D PCA) for object shown in (a) using first three eigenvectors, c: ICA representation of an object in (a) using three independent components for ICA. Appearance is represented by a curve with a single parameter r, (for rotation).

(a) (b)

Fig. 3. a: Universal Eigenspace (2D PCA) of two objects from COIL-100 database with similar appearance properties; b: IC Space of two objects showing an overlap. Overlap in both 2D PCA and ICA space causes hindrance in discrimination.

appearance and shape characteristics but fail for objects similar in appearance, but with minor differences in shape. Fig. 3 (a) and (b) show manifolds of two objects from COIL-100 database generated using (irst three eigenvectors/ICs) 2D PCA and ICA respectively. The linear subspace techniques show an overlap in the eigenspace/IC space (i.e. both methods fail to discriminate). We hence propose the use of 2D shape features to discriminate such objects and verify the results of 2D PCA/ICA.

3.8* Fusion of 8D Appearance and Shape cues

Using linear subspace analysis we first select a set of 10 rank-ordered samples based on their distances in

eigenspace/IC Space. Increasing the number of rank-ordered samples does not alter.the performance of the system by much, but increases'the computational complexity. The number of rank-ordered samples selected is empirically set to approximately 10% of'the total number of objects in the database. These samples are then matched'with the test object based on shape features and the object with minimum distance obtained by fusion of appearance and shape cues is returned as the best match. We have analyzed the performance of various fusion measures' (Sum, Min? Max and Product) suggested by Kittler11 for combining appearance and shape based classifiers. Table 1 shows the comparison of recognition accuracies of various fusion measures evaluated on COIL-

336 A Framework for Fusion of 3D Appearance and 2D Shape Cues for Generic Object Recognition

Percentage of Noise (In the 10 scale)

Fig. 4. Recognition rate for noisy images with number of eigenvectors for 2DPCA=10 and number of independent components^ 10.

100 database. The table suggests that sum rule (D\)

works the best for fusion of appearance and shape

cues. Note tha t the proposed criterion (Eq. 3) works

bet ter than simply using appearance cues (D2DPCA,

Die A or Dcomb) for recognition.

The proposed approach gives peak recognition

accuracy of 98.25% using D\, when tested with 400

samples on the entire COIL-100 database (pose in

crement of 10 degrees) containing 100 objects. The

method provides a recognition rate of 91.375% with

D\ even when a sparse database(15%) was used for

training. Most methods in the literature use only a

subset of the 100 objects (typically 20 to 30) from

COIL-100 database for experiments. Table 2 shows

a comparison of recognition rates of techniques pro

posed by2"4 with our proposed framework. Our re

sults provide bet ter performance than those reported

in2"4 . We have not used the popular leave-one-out

validation scheme (in biometry) for testing, as our

methodology is designed for recognition rather than

verification.

We have also conducted experiments to compare

the performance of appearance based methods

(D2DPCA, Die A and Dcomb) with our proposed ap

proach (D\) for noisy images. Training samples were

kept unaltered and the entire database was used for

training. All the images used for testing were uni

formly per turbed by noise with a certain degree.

Salt and pepper noise with a certain specific per-

centage(strength) was introduced to all the test im

ages of the objects and the recognition accuracy was

observed. The proposed method was again a winner

among all previously published work and exhibited

(see Fig. 4) the highest degree of tolerance with in

creasing strength of noise.

4. C o n c lu s io n

We presented a robust framework to generic object

recognition using a combination of appearance and

shape features. Analysis of the errors in object and

pose recognition demonstrate the proposed approach

to be very efficient. We use two linear subspace anal

ysis (2D P C A and ICA) techniques to reduce the

search space to a few objects based on appearance

and then select the closest match using a sum of

distances in linear subspace and distance transform

based shape matching. Both these techniques used

are very efficient, making our system feasible for

real-world applications. The proposed method out

performs the recognition accuracy of the existing

schemes of using only ID PCA, 2D P C A and SVM

for object recognition. The approach also works rea

sonably well when a sparse dataset is used for t rain

ing. A fusion of classifiers based on appearance and

shape features performs better than only appearance

based classifiers even in noisy environments. There is

however, a scope for analysis of the performance of

the proposed technique for generic object recogni

tion in presence of illumination variance, occlusion

and clutter.

R e f e r e n c e s

1. I. Biederman, Psychological Review 94, 115 (1987). 2. H. Murase and S. Nayar, International Journal of

Computer Vision 14, 5(January 1995). 3. P. Nagabhushan, D. Guru and B. Shekar, Pattern

Recognition 39, 721(April 2006). 4. M. Pontil and A. Verri, IEEE Transactions on Pat

tern Analysis and Machine Intelligence 20, 637(June 1998).

5. Belongie, J. Puzicha and J. Malik, IEEE transactions on Pattern Analysis and Machine Intelligence 24, 509(April 2002).

6. H. Victor and M. F. Moura, Robust Reorientation of 2D Shapes Using the Orientation Indicator Index, in IEEE International Conference on Acoustics, Speech and Signal Processing, March 2005.

7. D. Forsyth and J. Ponce, Computer Vision A Modern Approach (Pearson Education, 2003).

8. S. Edelman and H. Buelthoff, Vision Research 32, 2385 (1992).

9. N. Kanwisher, Nature Neuroscience 3, 759(August 2000).

Manisha Kalra and Sukhendu Das 337

Table 1. Comparison of recognition accuracies of D2DPCA, DicAi Dcornb, and using Sum (D\), Min, Max and Product Rules for fusion of appearance and shape cues on COIL-100 database with number of independent components=110 and number of eigenvectors for 2DPCA=10 using 400 test samples (4 test samples per object).

M e t h o d Used

&2DPCA DiCA Ucomb

Sum(DA) Min Max

Product

10 96.375

97 97.125 98.25

97 96.625 96.875

P o s e Increment 15

93.625 94.5 94.5

95.875 94.5

93.625 94

20 93.25 93.75 93.75 93.75 93.75 93.5

93.125

25 89.5

90.25 91

91.5 91.5

89.375 91.25

30 88.75

90 90.625 91.375

91 89.5 91

Table 2. Comparison of Recognition Rates of ID PCA, 2D PCA, SVM, ICA and proposed framework (with Dx as distance measure) on COIL-100 Database with 10 degree pose increment (36 training samples per object for training)

Technique IDPCA^

2DPCA J

SVM4

ICA

DX

# Objects 20" 100c

20" 100c

32c

100

100

Test Samples 720 400 720 400

3600 1152 400 3600 400 800

3600

%Accuracy 100

95.125 100

96.375 95.468 96.03

97 96.639 98.25

98.118 97.694

''results reported in literature cour implementation

10. A. Hyvarinen, IEEE Trans, on Neural Networks 10, 626 (1999). J. Kittler, P. Duin and J. Matas, IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 226(March 1998). M. Sanjay, S. Das and B. Yegnanarayana, Robust Template Matching for noisy bitmap images invariant to translation and rotation, in Indian Conference

11

12.

13.

14.

on Computer Vision, Graphics and Image Processing, New Delhi, India, December 1998. D. Paglieroni, G. Ford and E. Tsujimoto, IEEE Pattern Analysis and Machine Vision 16, 740(July 1994). S. A. Nene, S. K. Nayar and H. Murase, COIL 100 Database (1996), http://wwwl.cs.Columbia. edu/CAVE/research/soft l ib/coil-100.html.

http://wwwl.cs

338

Constructing Analyzable Models by Region Based Technique for Object Category Recognition

Yasunori Kamiya, Yoshikazu Yano and Shigeru Okuma

Graduate School of Engineering Nagoya University,

Furo-cho, Chikusa-ku, Nagoya City, Japan E-mail: {kamiya,yoshi, okuma} @okuma.nuee. nagoya-u. ac.jp

One of the main challenges of object recognition is how to construct a model of an object with various appearances. For example, a human or car are likely to have many different shapes and colors. Currently the technique for constructing models is proposed as object category recognition. But, this technique of model construction doesn't consider analyzing details of the constructing model, so it is difficult to directly modify the model to elevate recognition performance. In this paper, we focus on object's regions that can be labeled specifically and have meaning, and propose technique for constructing analyzable models from images. In the experiment, we analyze the constructed model and show that the recognition rate rise by modifying the model.

Keywords: Region Based Technique; Analyzability of Model; Object Category Recognition

1. Introduction

This paper discusses about the analyzability of constructing model for object recognition. It is very important to analyze the model expressing object class and get precise information of the model, when we find out the cause of bad recognition result or evaluate the performance of the model or modify the model to construct a model of another object that looks alike.

We especially target a object category recognition, where we define an object class as a collection of objects that share many meaningful regions that can be labeled specifically. Each object contained in an object category is different in appearance when compared to each other. For example, cars possess some various appearances in shape, color, size, and in details of headlights, body, grill, and windows. We are interested in the problem of constructing model for objects with many appearances.

Some of the research1^3 tackles this problem. The main idea of this research is to focus on the combination of local features which are expressed by small patches(e.g. lOxlOpixels). Patches represent the edges and textures of the objects, and are located on the characteristic places of the objects. A part is built from some patches that look like. And the model expressing object category is constructed with a number of parts.

Patch-based technique1"3 is not better to analyze the model. Because the model constructed from patches that doesn't have concrete meaning doesn't make it easy to understand why the model became

the structure. But, region-based technique4 is better, because this approach focuses on object's regions that can be labeled specifically, for example amongst the regions in a human, there are face, arms, body, and feet. Part of region-based technique is constructed from the same kind of regions. The model of region-based technique is constructed with the number of parts. This model can express the structure of the object with these parts, therefore it is easier to understand the model of the region-based technique in detail than the model of the patch-based technique. But this approach4 requires the regions to be manually defined and to be manually classified in groups of the same sort of regions for training each part.

When recognition system is built for the task that deals with a lot of object categories, it is not effortless to manually construct these categories models. Automatic construction is important to realistically construct models for a lot of object categories. We propose the method that can automatically construct the analyzable region-based model from images. Also we target the objects having regions that can be divided specifically by color and texture, and we don't discuss about appearance variety caused by posture of object.

2. Approach

2.1. Analyzability of model

To analyze the model from parts is a common way on patch-based technique, and the parts are ana-

http://ac.jp

learning images and positions of patches patch groups

<w pe as -AfiX3IEKELI rt f l r i "B: PTO Ui U M -CWWN|L!

Ewimjn FFT7FZ! GHtQH HHHH

region groups

'?? © O © 9 9

Ly Ll—ij BJ4J

liflff 11? d © 0 ® © © ©

nnun IA1 c i a c s a o o Q ^ i

Fig. 1. Difference between patches and regions

lyzed from patches that construct parts. Therefore, on region-based technique, we analyze the model from regions that construct parts of region-based technique.

Fig.l shows the differences of patches and regions, the left figures show examples of learning images and the position of patches on patch-based technique. The patch groups that construct each part of patch-based technique are shown at A-H in the middle figures. The region groups that construct each part of region-based technique are shown at a-g in the right figures.

On patch-based technique, it is difficult to see the meanings of patches. Because patch-based technique focuses on edges and textures that are small verging areas or the like of the object, and it is likely to mix the patches with other location or other kind of patches. In short, patches only have minor information. But region-based technique make it easy to analyze the model, because region is a area that can have concrete meaning, and can express shape that is different for each sort of regions, and same sort of regions are gathered.

Therefore, the behavior or construction of the model of region-based technique can be easier to understand than patch-based technique. The cause of bad recognition result can be found out, such as important sorts of parts(e.g. head, shirt) were not constructed. And the performance of the model can be easily evaluated, or the model can be modified for constructing a model for another object that looks alike. For example, we can improve a few parts by restudying for elevating the cognitive performance, or we can construct the model of another object to partially change the parts to the parts of the other objects.

Yasunori Kamiya, Yoshikazu Yano and Shigerv, Okuma 339

2.2. Construct model

The model of the proposed method is composed in the following. In the section after this, these items are explained.

• The parts constructed from each region group. • The relative positions between all parts. • The combinations of parts.

2.2.1. Construct parts

For automatically extracting regions from learning images that show targets without background, we use JSEG5 that is an image segmentation method based on color and texture. Fig.2 shows the examples of a man's extracted regions by using this method.

A 0

H

0

It H It An f'-\ H i

1 '1 f ^ 7 « f

Fig. 2. Examples of extracted regions

Suitable features for expressing regions are different for each sort of regions. For example, most faces are flesh color, and shape of the face is different in each person. Then one of the suitable features of faces is color, on the other hand shape is not better

340 Constructing Analyzable Models by Region Based Technique for Object Category Recognition

for expressing faces. And color and design of shirts are type of variety. In contrast, the shape of most shirts is similar. Then shape is one of the suitable features for shirts.

All of the extracted regions are classified in region groups based on suitable features. First, all regions are plotted in feature space of the candidates of all features. Differences of suitable features appear as a result of differences of distribution tendency in feature space. The regions plotted in feature space are clustered in the direction of a suitable feature axis, and distributed in the direction of an unsuitable feature axis, because the value of the suitable feature is similar, and the value of the unsuitable feature is varied. To classify all regions into region groups based on differences of distribution tendency, we use clustering method based on PCA. This clustering method is included in the technique of reducing the dimension of database.6 Each part is expressed by some eigenvectors that is calculated by PCA in the clustering method.

In the proposed method, the regions can be accurately divided into correct region groups. Because the various sort of features can be used and suitable features can be selected. Then parts expressed become more precisely expressed. This is an advantage that doesn't exist in the patch-based technique.

Tablel shows the candidates of features used in this paper. The dimension of feature space that constructed these features is 20. The shape feature is 10 dimensions that is the absolute value of Fourier coefficient calculated by P-type Fourier Descriptors.7

Table 1. Candidates of all features expressing regions

color and texture features

geometric features

average color(R,G,B), sum of edge value(vertical, horizontal) area, aspect rate, width, height, boundary length, shape

2.2.2. Relative position and Combination

The model of the proposed method has relative positions between all parts. Because representation of each relative position varies according to the sort of each part. For example with a human, the relative position of face and breast is almost the same in each case. But the relative position of face and shoe is different, because height and distance between shoe and shoe are different in each case. In addition, for

describing the individual relative position, we used Nearest Neighbor method. Also, the model has a relative position between same sort parts(e.g. between hand and hand).

The model of the proposed method contains some of parts' combinations. Because there are some combinations of parts in object category according to the sort of object category. For example with a human, if there is a short sleeved shirt part, long sleeved shirt part, and long pants part, then two combinations exist. The one contains short sleeved shirt part and long pants part. And another one contains long sleeved shirt part and long pants part.

In learning phase, the relative positions and combinations of parts that are constructed from the regions on learning images as a region assumes as a region assumes that the part belonging to that sort is the same as the part constructed in the region. The relative position of parts is constructed by the relative positions between all regions of all learning images. Also, the combination of parts is constructed from the combination of regions on learning images as parts on a learning image is one combination of parts. All combination of the learning images are added to the model. After that the orverlaped combinations are removed.

2.3. Recognition

The object recognition process can be divided into two smaller processes. The first is to extract candidates of object area from the input image. The second is to recognize the candidates of the object area. In this paper, we target recognizing candidates of the object area. The first process is the one we will target in the next research.

Recognition process is described as the following. First, JSEG is used for extracting regions from the candidate of object area of input image, and feature values of extracted regions are calculated. Next, the similarity between these regions and individual parts that the model has is measured. We use DFFS(Distance From Feature Space)8 for the evaluation value. When DFFS is lower than threshold value, this region is judged as correct. This technique expressing a class (port) as some eigen-vectors and discriminating classes by the distance between input vector(feature values expressing a extracted region) and some eigen-vectors is called subspace method.9

Yasunori Kamiya, Yoshikazu Yano and Shigeru Okuma 341

Last, relative position and combination are judged. The relative positions and combinations of the regions that were judged as correct are compared to relative positions and combinations of corresponding part that the model has. All the combinations between all parts that the model has and the regions that were judged as correct are evaluated by evaluation value as follows.

(evaluation value) = —f- x ^—-

Argument 1 in the formula, that is the rate of regions that were judged as correct. JVP is the number of parts at one of the combination of parts that the model has. Nr is the number of regions in one of the combinations of the regions that were judged as correct. The value of Nr is Np or less. Hence, the range of argument 1 is between 0 and 1. The argument 2 expresses the correct relationship rate of Nr regions that were judged as correct. The denominator of this argument shows the number of all relationships with possibility of existing on the Nr regions. Ns is the number of the correct relationships. To decide Ns

Nearest Neighbor method between each parts is used. The range of argument 2 is between 0 and 1 too, then the estimation value is 0 to 1.


We construct the model and show the analyzabil-ity of the model. Next we modify the model based on the result of analysing the model and show the recognition results of using the modified model.

The experiment was carried out as follows: the object that this experiment targeted is man. We prepared 160 images, and decided that 80 images are for learning and 80 images are for verification. The width of these images is 189-215pixel and height is 474-564pixel, and these are color images. The posture of each man is the same, it is a posture in whitch the model stands and faces forward. We extracted man's area of these images to assume the candidate of object area by hand. We prepared 80 images of miscellaneous objects for the non-man images. We extracted areas of man's shape from the images of miscellaneous objects. Fig.3 shows some images of man, and fig.4 shows some images of miscellaneous objects. We also set DFFS threshold to 0.2.

The number of constructed parts is 84 in all. Fig.5 shows the region groups that constructed some main parts and the outlier regions that fail in parts construction. These parts almost construct same sort of regions, and the meaning of these parts is easy to understand. In addition, we can analyze the constructed model about what type of image is weak by using fig.5. For example, this model doesn't have a red shirt part and a white pants part, therefore the model is not good for an image of a man who wears a red shirt or a white pants. If it is required, we can consider the addition of these parts. Also, there is a sort of region that is divided into many parts, and another one that is not almost divided. Because the distribution of appearance types of regions is sometimes biased widely on learning images, at such times proposed method judges that it should make two or more parts.

The number of constructed combinations is 78 in all. Table2 shows some main combinations on parts of fig.5. These alphabets in table2 show the parts of fig.5. Each combination comprises hair, face, shirt, hand, pants, and sock. We can verify these combinations because we can get the meaning of each part. Also when there aren't combinations that should exist in the model depending on lack of number of learning images(e.g. some of combinations of shirt and pants), we can add these combinations to the model for elevating performance. These are advantages of region-based technique that patch-based technique don't have.

The top of table.3 shows the result of recognition. The recognition rate for the man images is 91.3% and for the non-man images is 98.9%. Both the recognition rates are high. Bottom of table.3 shows the result of the modified model. We added the extra combinations to the normal model. These combinations are 2160 itmes, and are all these combinations consist of all main parts that sorts are hair, face, shirt, hand, pants, and sock. The recognition rate for man rose to 1.2% and for non-man didn't fall. So, the total performance rose to 1.2% for modifying the model. This result shows the effectivity of modifying the model.

4. Conclusion and future work

We propose the method that automatically constructs models that have analyzability for the objects

342 Constructing Analyzable Models by Region Based Technique for Object Category Recognition

t ha t are with many appearances from images. We fo

cus on the objects ' regions tha t can be labeled specif

ically. The experiment shows tha t the constructed

model has analyzability and the effectivity of modi

fying the model. As the next task, we will t ry other

modification and consider to extracting candidate

area from images.

References

1. R. Fergus, P. Perona and A. Zisserman, Object class recognition by unsupervised scale-invariant learning, in Proc. CVPR, 2003.

2. S. Agarwal and D. Roth, Learning a sparse representation for object detection, in Proc. ECCV, 2002.

3. M. Weber, M. Welling and P. Perona, Unsupervised learning of models for recognition, in Proc. ECCV, 2000.

4. K. Sung and T. Poggio, IEEE Trans. PAMI 20-1, 39 (1998).

5. Y. Deng and B. S. Manjunath, IEEE Trans. PAMI 23-8, 800 (2001).

6. K. Chakrabarti and S. Mehrotra, Local dimensionality reduction: A new approach to indexing high dimensional spaces, in Proc. VLDB, 2000.

7. Y. Uesaka, IEICE Trans. J6T-A-3, 166 (1984). 8. B. Moghaddam and A. Pentland, IEEE Trans. PAMI

19-7, 696 (1997). 9. E. Oja, Subspace methods of pattern recognition (Re

search Studies Press Ltd., 1983).

% ','« A % • » •

' •*&, is? •% nri Fig. 3. Example of man images

: V '

iBftflMB

'I # % # « # ^ r * ! % # « ^ t r # ^ « * ^

V ?f outlier regions

Fig. 5. Example of the constructed region groups

Table 2. Some of combinations

A, A, A,

A

B, B, B,

, B

D,

c, E,

, c

F, F, F,

, F

G, H, G, , G

I I I

Table 3. Result of recognition (rr: recognition rate)

target image rr (normal)

rr (modified)

man 91.3%(73/80) 92.5%(74/80)

non-man 98.9%(79/80) 98.9%(79/80)

it

X

Fig. 4. Examples of non-man images

343

DRILL: Detection and Representation of Isothetic Loosely Connected Components without Labeling

P. Bhowmick and A. Biswas

Computer Science and Technology Department Bengal Engineering and Science University, Shibpur

Howrah, India Emails: [email protected], [email protected]

B. B. Bhattacharya

Advanced Computing and Microelectronics Unit Indian Statistical Institute

Kolkata, India Emails: [email protected]

A novel algorithm to detect the isothetic A-connected components present in a binary image is proposed in this paper. The concept of isothetic X-connected component (ILCC) is newly introduced in this paper, which is defined as the smallest isothetic polygon that contains a set 5 of X-connected components (LCCs).* Two components are said to be X-connected if an only if the distance^ between them does not exceed A. Contrary to the existing algorithms that use labeling to detect the (l-)connected components, in our work, a combinatorial technique is used to find the LCCs, and hence ILCCs, which represent the underlying object(s) in an elegant way. Since a real-world object may possess disconnectedness, noisy information, etc. , the existing algorithms on component labeling are liable to produce components that are larger in number than sought for. Whereas, in the proposed method, two or more components that are spatially not far off, would be reported in the same ILCC. Further, the derived set of ILCCs are representable in an efficient high level description, which would facilitate subsequent applications where the ILCCs are used. Experimental results including CPU time reinforce the elegance and efficacy of the proposed algorithm.

Keywords: A-connected component, connected component labeling, image segmentation, digital geometry, image processing.

1. Introduction

Detection of connected components in a binary image is an indispensable task in image processing and pattern recognition.1'2 The set of extracted (appropriately) connected components in an image is central to many automated image analysis applications, some of which are: - biometrics: writer identification;3

- document image: newspaper layout analysis,4 skew detection,5 script identification;6

- image coding;7 etc. Detection of the connected components is usu

ally aided by a labeling process that transforms a binary image into a symbolic image in order that each connected component is assigned a unique label. Various algorithms on labeling (and detection, thereof) of connected components have been proposed so far, which follows one of the following principles. (i) Forward and backward passes in alternation to propagate the label equivalences along the image raster until no label change occurs8 — number of

passes increases enormously with image complexity, (ii) Two pass algorithms where the first pass assigns temporary labels and prepares the label equivalences, which are resolved in the second pass using the equivalence table9"11 — equivalence tables may grow abnormally large leading to high run time, (iii) Using hierarchical tree structures (e.g., bintree, quadtree, octree, etc. ) 1 2 - 1 4 — efficient for some images, but degenerates to array representation in the worst case. (iv) Using parallel processors (e.g., in the structure of mesh, pyramid, etc. ) 1 5 - 1 9 — not suitable for ordinary sequential processing. Strength of Our Method: A connected component labeling algorithm considers a connected component to be a maximal subset S of (object) points in which the points are "compactly connected" (see Def. 2.2). In an application dealing with real world digitized images, there may be some (small enough) components that should not be considered as separate components but should be treated as "loosely connected" with some other (big enough)




344 DRILL: Detection and Representation of Isothetic Loosely Connected Components without Labeling

c,

c,

c,

(a) Three CCs, namely C\, C2, and C3, which are basically (A = l)-connected components.

(c) Number of ILCCs reduces to 1 for A = 4 (i.e., g = 2).

Fig. 1.

(b) Number of ILCCs reduces to 2 for A = 2, since the ILCCs for A = 2 is defined on grid size g = A/2 = 1.

Three CCs that make two ILCCs for value of A = 1 and to a single ILCC for a sufficiently large value of A

component(s) so that the underlying object relationship is preserved. The concept of compact connectedness (Def. 2.2) has to be, therefore, relaxed, which has been incorporated efficiently in the proposed method. Further, we have not resorted to the traditional labeling procedure for detection of the connected components. Instead, we have derived the isothetic loosely connected components (ILCC: Def. 2.5) using the isothetic envelope(s)20'21 of the object points in an efficiently representable way. For a brief explanation, see Fig. 1.

2. Proposed Method

Given below some definitions that are newly introduced in this paper for finding the A-connected components in an image.

Definition 2.1 (isothetic distance). Isothetic distance between two points p(x,y) and p' (x',y') is given by dT(p,p') = max{\x — x'\, \y — y'\], and that between two sets/ components, C and C, is d±{C,C) = mm{mm{dT(p,p') : p' G C'} : peC}.

Definition 2.2 (CC). C is a compactly connected or ^-jconnected component (1-CC or CC) if and only if for each point p (x, y) £ C, there exists another point p' (x',y') £ C such that dr(p,p') = 1.

Definition 2.3 (LCC). Two CCs, namely C and C, are A-connected components (LCCs) if and only ifd±(C,C')<\.

Definition 2.4 (set of LCCs). If S be a set of CCs, then it is a set of LCCs if and only if S is

maximal in size, and for each C £ S, there exists some other C £ S such that C and C are LCCs.

Definition 2.5 (ILCC). If S be a set of LCCs, then the smallest isothetic polygon (whose edges are axis-parallel and vertices lie on the defining grid G) that contains all the components of S is said to be an isothetic A-connected component (ILCC) of S.c

Definition 2.6 (relation LC). Two points Pi and Pj on the object is said to follow the relation LC (X-connectedness), denoted by pi =\ pj, if and only if there exists an ordered sequence/path, namely {Pi,Pi+i,... ,pj-i,Pj), of object points, such that dr(pk,Pk+i) < X for k = i,i + l,...,j - 1.

Theorem 2.1 (relation LC). The relation LC is an equivalence relation.

Proof. Let C\, C2, and C3 be three CCs such that, w.l.o.g., Ci =A C2 and C2 =\ C3. Then, by Def. 2.6, for each point in C\, there exists a (A-connected) path to each point in C2, and so for each point in C2 to each in C3, whence transitivity satisfies. The properties of reflexivity and symmetry are easily met, and thus follows the proof. •

2.1. Detection of ILCC

Let p be a grid point. Let S = {<SLT, >SRT, <SLB, <S R B } ,

where ISLT> <5RTI ^LB, and <SRB are the four adjacent g x g square cells with common vertex p such that g = [A/2J, as shown in Fig. 2. Let S\ be the subset of S

cFor each S, there is a unique ILCC, which implies that S and ILCC are identical in number in any given image.

P. Bhowmick, A. Biswas and B. B. Bhattacharya 345

such that each cell of S\ contains at least one object point, and S0 be the subset of remaining cells (each containing no object point) of S, so that So n S\ = 0 and So U S\ = S.

J>T

>LT

>LB

>RT

?R / R

>RB

PB

Fig. 2. Four square cells with common vertex p.

Let eL, eR, ex, and ee be the respective left, right, top, and bottom edges incident at p. Then we have the following theorem on the edges that lie on the isothetic polygon enclosing the object points.

Theorem 2.2 (Polygon Edges). If E0 and Ei be the sets of edges of the cells in So and Si respectively, then the set of edges of the isothetic polygon incident at p is given by Eo C\ Ei.

Proof. Let e be an edge present in both Eo and Ei. Then e is an edge of some cell in So and some other cell in Si. Hence the object lies on one side (cell) of e but does not lie on the other side (cell), thereby partially enclosed (at p) by e. However, if e € Eo and e £ Ei, then e is not a polygon edge but an edge that lies completely outside the polygon; and if e $ Eo and e £ Ei, then e lies completely inside the polygon. Hence the proof. O

From Theorem 2.2, it may be concluded that p is a vertex of the polygon if and only if Eo n Ei either contains all four edges, e^, eR, e^, ee (the case where the polygon intersects itself), or contains two perpendicular edges. Note that, if two edges are unidirectional, then p is an ordinary edge point on the polygon.

Now, to find the ILCC, we need the following theorem.

Theorem 2.3 (ILCC). The ILCC of an image is given by the set of isothetic polygons that enclose the object points.

Proof. Follows from Theorem 2.1 and Theorem 2.2, see Fig. 3. •

From Theorems 2.2 and 2.3, therefore, the set of ILCCs, with A as an input parameter, is given by the set of isothetic polygons defined with grid size g = LA/2J. We check each grid point p for its vertex candidature using its four adjacent cells (Fig. 2), which finally gives the ILCC vertices. Changing the value of A changes the ILCCs as shown in Fig. 1.

Ci 1*"'

Fig. 3. C\ and Ci are (parts of) two connected components, which would be A-connected components (LCCs) (would belong to same ILCC, thereof) if and only if the isothetic distance d±(Ci,C2) (Dei. 2.1) does not exceed A, which is true if and only if there exist two (horizontally/vertically/diagonally) adjacent unit squares (i.e., with either one or two grid points in common) denned by the grid lines (shown in thick black lines) in which C\ and Ci lie.

Representation of ILCC: The ILCCs obtained as above are, therefore, merely a set of (isothetic) polygons that are representable in the form of an efficient coded form as follows. While describing/traversing a polygon, any of its vertices may be considered as the start vertex, VQ (XO, yo), and each of its other vertices indicate either a left turn (code = 0) or a right turn (code = 1). Hence, the coded form of each ILCC is given by

{x0,yo) • LQ-CI- Li---cn-i •£„_! ,

where, Cj S {0,1} is the code of vertex Vi, and Li is the length of the edge (ei,e,+i), i = 0 ,1 ,2 , . . . , n— 1, of the concerned ILCC. Such a representation has several advantages. For example, the perimeter, area, and other shape attributes of each ILCC can be obtained for subsequent applications. (Labeling procedure does not give perimeter in such a straightforward way.)

3. Experiments and Results

We have implemented the algorithm DRILL in C in SunOS Release 5.7 Generic of Sun.Ultra 5.10, Sparc,


(a) A noisy logo image.

(b) A = 4, # ILCC = 9. (c) A = 8, # ILCC = 7.

(d) A = 12, # ILCC = 1 with hole poly- (e) A = 34, # ILCC = 1 without any gons. hole.

Fig. 4. A sample noisy logo image consisting of 102400 pixels out of which 22588 pixels make the foreground/object (a). Number of CCs (i.e., A = 1) for this image is as high as 473, which does not reflect the actual number of components that make the image. For A in the range of 6 to 8, we get 7 components (ILCC), which is a better representation of the underlying image (c). For higher values of A, number of ILCCs gradually decreases, as shown in (d) and (e).

233 MHz, and tested it on various binary image files on a document image is shown in Fig. 5. In Table 1, with varying size, shape, and content. The result of number of CCs versus that of ILCCs have been given DRILL on a noisy image is shown in Fig. 4, and that along with CPU time to demonstrate the speed and

efficiency of DRILL.

S i

P. Bhowmick, A. Biswas and B. B. Bhattacharya 347

IEEK TRANSACTIONS ON PATTS

Fig. 9. Two part* of images used during Die tMrei contest.

other is asked for a erode estimation of p„,« and />„,«

parameters were left unchanged. Results are presented in Table 2, obtained with the official

data of the contest. This table adheres to Phillips and Chhabra's evaluation methodology {33] and has been obtained thanks to their evaluation software set with default parameters and text evaluation turned off. The processing time varies from 2rnn,14s to 4mn,42s per image.

(a)

mnisKSttRSSBea mm

y i

.b adsiA fe c <naae s i w e f e E? &S*S acsl fe-_-t5 sits-

i-r^fcste »ete!B66K aasSssaste^ p3>r sss i s l &&?.

n

u . [ I I K i l l .. .i I

uu'

l _ ' C _

I 11. . J,l

l _

i j t i n n , t.

_J U U ! 13,U i_

ii ill U 13

™»J p**, —w—. p**. p , j j p——, -™™—

"L™j u J 1_J U U 1J 'in

(b) (c )

Fig. 5. A part of a typical document image of size 3.71 MB (a). The image (b) shows the ILCCs, 115 in number, for A = 4, and the image (c) shows the same, 6 in number, for A = 28.

4. Conclusion

It is evident from the algorithm DRILL that a set of ILCCs extracted from a digital image is not only significantly smaller in size than that of CCs extracted from the same, but also carries a strong struc

tural relationship of the underlying parts constituting the image. For a proper value of A, each extracted ILCC considerably encapsulates the CCs that "appear to be connected" in the given image. Furthermore, the CPU time needed for ILCC extraction decreases drastically as A is increased.


Table 1. Results of DRILL on some images.

image & size* test-01 (1.00) test-02(1.00) test-03(1.00) logo-041 (0.10) logo-205 (0.10) docu-al2 (0.96) docu-a25 (0.96) docu-a35 (0.96)

# C C 27254 40982 22685

473 731

4083 5425 4417

# ILCC & CPU time® A = 2

462 (8.62) 706(14.62) 378 (6.51)

7 (0.06) 12(0.06)

286 (6.54) 308(6.81) 282 (6.20)

4 105(1.37) 164(2.32) 42 (0.54)

7 (0.04) 9(0.05)

174 (0.95) 168(0.92) 185(1.01)

8 28(0.21) 58 (0.37) 16(0.14)

1 (0.01) 5 (0.02)

37(0.30) 42(0.30) 41 (0.32)

shown in ( ) in Mega-pixels. °shown in ( ) in millisecs.

References

1. A. Rosenfeld and A. C. Kak, Digital Picture Processing, 2nd ed. (Academic Press, New York, 1982).

2. J. Wang, TPAMI20, 619 (1998). 3. L. Schomaker and M. Bulacu, TPAMI 26, 787

(2004). 4. P. E. Mitchell and H. Yan, Image Vis. Comput. 22,

307 (2004). 5. N. Liolios, N. Fakotakis and G. K. Kokkinakis, Im

proved document skew detection based on text line connected-component clustering, in ICIP, 2001.

6. L. Zhou, Y. Lu and C. L. Tan, Bangla/english script identification based on analysis of connected component profiles, in Document Anal. Sys., 2006.

7. J. Vass and X. Zhuang, Enhanced significance-linked connected component analysis for high performance error resilient wavelet image coding, in ICPR, 2000.

8. R. Haralick, Some neighborhood operations, in Real Time/Parallel Computing Image Analysis, eds. M. Onoe, J. K. Preston and A. Rosenfeld (Plenum

Press, New York, 1981) pp. 11-35. 9. A. Rosenfeld and J. Pfalts, J. ACM 13, 471 (1966).

10. Y. Shirai, Three-Dimensional Computer Vision . 11. R. Lumia, L. Shapiro and O. Zungia, CVGIP22, 287

(1983). 12. H. Samet, J. ACM 28, 487 (1981). 13. H. Samet, TPAMI 7, 229 (1985). 14. J. Hecquard and R. Acharya, PR 24, 515 (1991). 15. P. Biswas, J. Mukherjee and B. Chatterji, PR 26,

1099 (1993). 16. A. Choudhary and R. Thakur, J. Parll. & Distrib.

Computing 20, 78 (94). 17. P. Bhattacharya, J. Sys. Archi. 42, 309 (1996). 18. M. Manohar and H. Ramapriyan, CVGIP 45, 133

(1989). 19. A. Moga and M. Gabbouj, TPAMI 19, 441 (1997). 20. P. Bhowmick, A. Biswas and B. B. Bhattacharya,

LNCS 3776, 407 (2005). 21. A. Biswas, P. Bhowmick and B. B. Bhattacharya,

LNCS 3540, 930 (2005).

349

Pattern Based Bootstrapping Method for Named Entity Recognition

Asif Ekbal and Sivaji Bandyopadhyay

Computer Science and Engineering Department Jadavpur University, Kolkata, India

Email: ekbaLasifl2Qyahoo.co.in,sivaji-cse.juOyahoo.com

This paper reports about the development of a Named Entity Recognition (NER) system in Bengali. A pattern directed bootstrapping method has been used to develop the NER system from a tagged Bengali news corpus, developed from the web. Different tags of the tagged news corpus help to identify the seed data in the system. The training corpus is initially tagged against the different seed data and a lexical contextual seed pattern is generated for each tag. The entire training corpus is shallow parsed to identify the occurrence of these initial seed patterns and further patterns are generated through bootstrapping. Patterns that occur in the entire training corpus above a certain threshold frequency are considered as the final set of patterns learnt from the training corpus. The test corpus is shallow parsed to identify the occurrence of these patterns and estimate the named entities. System has been tested with four manually tagged Bengali news corpus (Gold Standard Test Sets) and it has demonstrated the highest Recall, Precision and F-Score values of 63.3%, 84.8% and 73.2% respectively.

Keywords: Named Entity Recognition, Bootstrapping, Corpus, Supervised Learning

1. Introduction

Named Entity Recognition (NER) is an important tool in almost all Natural Language Processing (NLP) application areas such as Machine Translation, Information Retrieval, Question-Answering system and Automatic Summarization. Named Entity (NE) expressions such as the names of people, locations and organizations as well as date, time and monetary expressions are hard to analyze using traditional NLP because they belong to the open class of expressions, i.e., there is an infinite variety and new expressions are constantly being invented.

The problem of correct identification of NEs is specifically addressed and benchmarked by the developers of Information Extraction Systems, such as the GATE system8 and the multipurpose MUSE system.9 Morphological and contextual clues for identifying NEs in English, Greek, Hindi, Romanian and Turkish have been reported in.10 The shared task of CoNLL-20031 were concerned with language independent NER. A statistical method for finding names in news wire articles in Chinese, English, French, Japanese, Portuguese and Spanish has been presented in.11 An unsupervised learning algorithm for automatic discovery of NEs in a resource free language has been presented in.12 A framework to handle the NER task for long NEs with many labels has been described in.13 For learning generalized names in text an algorithm, NOMEN, has been presented in.14 NOMEN uses a novel form of bootstrapping to

grow sets of textual instances and their contextual patterns. A joint inference model has been presented in15 to improve Chinese name tagging by incorporating feedback from subsequent stages in an information extraction pipeline: name structure parsing, cross-document co-reference, semantic relation extraction and event extraction. It has been shown in16

that a simple-two stage approach to handle non-local dependencies in NER can outperform existing approaches that handle non-local dependencies, while being much more computationally efficient. But in Indian Languages, no work in the area of has been carried out as yet.

The rest of the paper is organized as follows. Section 2 briefly describes the tagged Bengali news corpus development from the web. Section 3 deals with the NER task in Bengali. Section 4 shows the evaluation techniques and results. Finally, conclusion and future works road maps are drawn in Section 5.

2. Development of the Tagged Bengali News Corpus from the Web

A web crawler has been developed that retrieves the web pages in Hyper Text Markup Language (HTML) from the archive of a Bengali newspaper within a range of dates provided as input. The Hyper Text Markup Language (HTML) files that contain news documents are identified and rest of HTML files (e.g. Advertisement, TV schedule, Tender, Comics, and Weather etc.) are not considered further.

An HTML file consists of a set of tagged data

350 Pattern Based Bootstrapping Method for Named Entity Recognition

that includes Bengali and English texts. The HTML file is scanned from the beginning to look for tags like ... , where the Bengali Font Name is the name of one of the Bengali font faces as defined in the news archive. The text enclosed within font tags are the Bengali text that are retrieved and stored in the database after appropriate tagging. The Bengali texts in the archive are written in dynamic fonts and the Bengali pages are generated on the screen only when the system is connected to the web. Moreover, the newspaper uses graphemic coding whereas orthographic coding is required for text processing tasks. Hence, Bengali texts, written in dynamic fonts are not suitable for text processing activities. Font conversion tables have been developed that store each font code used by the Bengali newspaper and the corresponding ISCII (Indian Script Code for Information Interchange) code. This mapping is not always direct, for some cases we need to consider some other font codes before and / or after the font code in the Bengali text from the news archive to derive the ISCII code. The ISCII codes are orthographic in nature and are useful for text processing tasks in Bengali as well as in other Indian Languages. A separate ISCII - UTF-8 converter has been developed that stores the corpus in UTF-8 format also.

A news corpus, whether in Bengali or any other language has different parts like title, date, reporter, location, body etc. To identify these parts in a news corpus, the following tag-sets have been defined: agency (Agency providing news), bd (Bengali date), body (Body of the news document), date (Date of the news document), day (Day), ed (English date), header (Header of the news document), location (the news location), p(Paragraph), reporter (Reporter-name), t l (1st headline of the title), t2 (2nd headline of the title), title (Headline of the news document), table (information in tabular form), tc (Table Column), and tr (Table row). At present the corpus contains approximately 34 million wordforms with a collection of 5 years news documents.

3. Named Entity Recognition in Bengali

Bengali is one of the widely used languages all over the world. It is the fifth language in the world, second in India and the national language of Bangladesh. Named Entity (NE) identification in Indian Languages (ILs) in general and in Bengali in particular

is difficult and challenging. In English, the NE always appears with capitalized letter but there is no concept of capitalization in IL. The system is based on supervised learning and uses several linguistic features. The location, reporter and agency tags of the tagged Bengali news corpus help to identify the location names, person names and organization names respectively and these identified NEs serve as the seed data in developing the NER system. The bd (Bengali date), day (Day) and ed (English date) tags are used to identify the different date expressions.

3.1. Creation of Seed Data

Three different seed lists of person, location and organization names have been created to train the NER system. The words extracted from the reporter, location and agency tags of the training corpus are treated as the initial seed data and put in to the corresponding lists of Person names, Location names and Organization names. In addition to these extracted words, most frequently occurring person names, location names and organization names have been collected from the different domains of news and are kept in the appropriate seed lists. Linguistic features for identifying more NEs from the training corpus have been used in addition to the above seed lists. A list of clue words (kong, limited etc.) that often occur with the organization names have been kept. These clue words are matched in the training corpus to create the organization clue word seed list. A pattern of three consecutive words in the left and right directions of the clue word is identified in the training corpus as a possible organization name and kept in the organization clue word seed list. The entries of this table are analyzed and the actual parts of the organization names are kept in the table. The system uses the clue words like surname (e.g. mi-tra, dutta etc.), middle name ( e.g. chandra, nath etc.), prefix word ( e.g. shriman, shree, shrimati etc.) and suffix word (e.g., -babu, -da, -di etc.) for person names. A list of common words (e.g., neta, sansad, kheloar etc.) has been kept that often determines the presence of person names. It considers the different affixes (e.g. -land, -pur, -lia etc.) that may occur with location names. Tagging algorithm also uses the list of words (e.g. kar, dhar, kabita etc.) that may appear as part the of named entity as well as the common words. These linguistic clue words are kept in order to tag more and more NEs during training of the

Asif Ekbal and Sivaji Bandyopadhyay 351

system. As a result, more potential patterns are generated in the lexical pattern generation phase.

3.2. Tagging with Seed List and Clue Words

The tagger places the left and right tags around each occurrence of the named entities of the seed list in the corpus.

For example, <person> sonia ghandhi </person>, <loc> kolkata </loc> and <org> ja-davpur viswavidyalaya </org>.

After tagging the entire training corpus with the named entities from the seed list, the algorithm starts tagging with the help of clue words like surname, middle name, prefix word, suffix word for person name and location name. Tagging algorithm also uses the list of words that may appear as part of the named entity as well as the general word and the organization clue word seed list. These different clue words are kept in order to tag more and more person, location and organization names in the training corpus. As a result, more potential patterns are generated in the lexical pattern generation phase. Procedure 1. The training corpus is tagged with the Person name, Location name and Organization seed lists. 2. Match the training corpus with the list of organization clue word seed list, obtained from the organization clue words. 3. Match the entire training corpus with the list of surnames (person names). 4. If any match found, then 4.1. Extract the previous word from the corpus. 4.2. Match it with the list of Person name seed list. 4.2.1. Tag two consecutive words as Person name if any match found. 4.2.2. Match with the list of middle names if no match found. 4.2.2.1. Tag the three consecutive words as Person name while match is found with middle name list. 4.2.2.2. Extract the word, one word left to the word obtained in Step 4.1, from the corpus when no match is found with the middle name list. 4.2.2.2.1 Check whether this word appears in the list of prefix words of person names. Tag the word obtained in Step 3.1 along with the surname as Person name when match is found in the list of prefix words. 4.2.2.2.1.1 Check whether the surname appears in the list of general words (may be surname or other

words) while no match is found in the list of prefix words in Step 4.2.2.2.1. 4.2.2.2.1.1.1. The word is not required to be analyzed further when it appears in this list. 4.2.2.2.1.1.2. The surname along with the previous word is tagged as Person name if the word does not appear in the list. 5. Match the entire training corpus with the list of prefix words. If any match is found, then extract the next word from the corpus if it is not already tagged and tag the prefix and the next word as the Person name. 6. Match the training corpus with the list of suffix words of person names. If a match is found, tag the word without the suffix as Person name. 7. Match the training corpus with the location clue word (suffix) list.

3.3. Lexical Seed Patterns Generation from the Training Corpus

For each tag T inserted in the training corpus on Section 3.2, the algorithm generates a lexical pattern p using a context window of width 4 around the left and right tags, e.g., P — [/_2^-i <T > /+1Z+2] Where l±i are the context of p.

For example, politburo neta <person> bud-dhadeb bhattacharjaa </person> jadavpur bidhans-abhai is a pattern p generated by taking two words to the left of the left tag <person> and two words to the right of the right tag </person>, when "bud-dhadeb bhattcharjaa" is in the Person name seed list and tagged as above in the training corpus. Any of l±i may be a punctuation symbol (e.g. ;, ? etc.). In such cases, the width of the lexical patterns will vary. The lexical pattern p may take any of the following forms: I-2I-1 <T > l+i (Z+2 is any punctuation symbol) l-i <T > l+\(l-2 and 1+2 are punctuation symbols) I-2I-1 <T > (Z+i and I+2 are punctuation symbols) Z_i < T > (Z+i, I+2 and Z_2 are punctuation symbols) < T > l+i ( Z_2, Z-i and Z+2 are punctuation symbols) < T > I+1I+2 (1-2 and l-± are punctuation symbols)

These patterns are generalized by replacing the elements within the tags by the tags. These different types of patterns form the set of potential seed patterns, denoted by P. All these patterns, derived from the different tags of the tagged-training corpus, are


stored in a Seed Pattern table which has four different fields namely: pattern id (identifies any particular pattern), pattern type (Person name/ Location name/ Organization name/ Miscellaneous name) and frequency (indicates the number of times any particular pattern appears in the entire training corpus).

3.4. Generation of New Patterns through Bootstrapping

Every pattern p £ P is matched against the entire training corpus. In a place, where the context of p matches, p predicts where one boundary of a name in text would occur. The system considers all possible noun inflections during matching. At present there are 27 different inflections that may occur with the different Bengali noun words. Also, verb and adjective inflections have been considered during pattern matching. There are 214 different verb inflections and several inflections for the adjective that may occur in four different forms based on the affixes attached to either an adjective word forming the different degree forms (e.g. comparative, superlative etc.) or to a noun word.

During pattern checking, the maximum length of a named entity is considered to be six words. Each named entity so obtained in the training corpus is manually checked for correctness. The training corpus is further tagged with these newly acquired named entities to identify further lexical patterns. The bootstrapping is applied on the training corpus until no new patterns can be generated. The patterns are added to the pattern set P with the 'type' and 'frequency' fields set properly, if they are not already in the pattern set P with the same 'type'.

Any particular pattern in the set of potential patterns P may occur many times but with different 'type' and with equal or different 'frequency' values. For each pattern of the set P, the probabilities of its occurrences as Person name, Location name and Organization name are calculated.

For the candidate patterns acquisition, a particular threshold value of probability is chosen. If the probability for a particular pattern (along with the type) seems to be less than this threshold value then this pattern (only for that type) is discarded otherwise it is added to a new pattern table. Same procedure is followed for all other patterns. All these patterns form the set of accepted patterns and this set is denoted by Accept Pattern. A particular pat

tern may appear more than one time with different type in this set.

4. Evaluation and Results

The set of accepted patterns is applied on a test set. The process of pattern matching can be considered as a shallow parsing process. In a position, where the context of any pattern matches, the system predicts the boundary of a NE and the other boundary is obtained by matching the pattern with the test set accordingly. Only the patterns obtained through bootstrapping from the training corpus are used to identify NEs from the test corpus.

4.1. Training and Test Set

A supervised bootstrapping method has been followed to develop the NER system. The system is trained on a tagged Bengali news corpus. Some statistics of training corpus is as follows: Total number of news documents = 1819, Total number of sentences in the corpus = 44432, Average number of sentences in a document = 25, Total number of wordforms in the corpus = 541171, Average number of wordforms in a document = 298, Total number of distinct wordforms in the corpus = 45626.

Four manually tagged test sets (Gold Test Sets) have been used to evaluate the Bengali NER system. Each test corpus has been collected from a particular news topic (i.e. international, national or business) and contains approximately 5000 sentences.

4.2. Evaluation Parameters

The Bengali NER system is evaluated in terms of Recall, Precision and F-Score. The three evaluation parameters are defined as follows: Recall (R) =

No. of tagged NEs x 1 Q 0 %

Total no. of NEs present in the corpus P ™ (p) = ^•No.^geWEs^ 8 *100%

F"Score (FS) = ReSf+Pîsg * 100%-

4.3. Evaluation Method

All the tags of the manually tagged test sets (Gold Test Sets) are removed before evaluating the NER system with the test sets. A test corpus may be used in generating new patterns also i.e. it may be utilized in training the system after evaluating the system with it. The test sets have been ordered to


make them available for the inclusion in the training set. The four different test sets can be ordered in 24 different ways. Out of these 24 different combinations, a particular combination has been considered in the present work. It may be interesting to consider the other combinations and observe whether the results vary. Each pattern of the Accept Pattern set is matched against the first test corpus (Test Setl) according to the pattern matching process described in Section 3.4 and the identified NEs are stored in the appropriate NE tables according to the category of the NE. A particular identified NE may be assigned more than one NE categories. The same procedures described in Section 3.2 to Section 3.4 are performed for this test set (Test Set 1). Now, the resultant Accept Pattern is formed by taking the union of the initial Accept Pattern set and the Accept Pattern set of this test corpus. This resultant Accept Pattern set is used in evaluating the system with the next test corpus (Test Set 2). This process continues up to the final test set. So in each run, some new patterns may be added to the set Accept Pattern. As a result, the performance of the NER system gradually improves since all the test sets have been collected from a particular news topic (national, international, sports etc.).

The NER system could identify data (from the test set) that are not actually NEs or the part of NEs. To handle this situation, a portion of the developed corpus has been used to develop a knowledge base that contains the words that are not NEs along with their frequencies of occurrences throughout the portion of the corpus. All the distinct word-forms are extracted from the corpus and added to a database. The frequency of occurrences of each individual wordform has also been calculated. This database has been checked manually to exclude the different NEs and the part of the NEs in case of multiword NEs. Some erroneous data may exist due to printing mistake, misspelling etc. and the frequencies of these entries are very little. A particular frequency value (here 2) has been chosen to delete those elements from the database and thus the knowledge base is developed. The identified NEs of the test set are matched against this knowledge base and the matched portion is searched in the list that contains the common words that may be NEs or may appear as a part of the NEs. This is basically useful to decide about the correct identification of the word or

sequence of words as named entities. A particular pattern of the Accept Pattern set

may assign more than one NE categories to any identified NE of the test corpus. In order to solve this NE-classification disambiguation problem, the different linguistic features used as the clue words in Section 3.1 are used in order to assign the actual categories to the identified NEs. Once the actual category of a particular NE is explored, it is removed from the other NE category tables.

The Bengali date, Day and English date can be recognized from the bd, day and ed tags of the tagged news corpus. Some person names, location names and organization names can be identified from the reporter, location and agency tags in the corpus.

The system has been evaluated in two different ways. Recall, Precision and F-Score parameters are computed as a whole (not for each individual NE category) in Evaluation Technique 1. In Evaluation Technique 2, the three evaluation parameters are computed for each individual NE category i.e. for person name, location name, organization name and miscellaneous name. The Overall F-Score (O-FS) in Evaluation Technique 2 is computed as: (FS for Person name + FS for Location name+ FS for Organization name -I- FS for Miscellaneous name) / 4.

4.4. Results and Discussion

The performance of the system with the help of four news documents (test sets) collected from a particular news topic has been presented in Tables from 1 to 3. Table 1 gives the information about the tagged and correctly tagged NEs (tagged by the proposed system) for each test set and also shows the results with three parameters using Evaluation Technique 1. Table 2 shows the results of the system with test sets 1 and 2 using evaluation technique 2. The results of the system with test sets 3 and 4 using evaluation technique 2 is shown in Table 3. Here, following abbreviations have been used: Person name (PN), Location name (LOC), Organization name (ORG) and Miscellaneous (MISC).

Improvement in Recall, Precision and F-Score values for a particular test set in the order occurs as the previous test sets in the order are included as part of the training data i.e., new lexical patterns learnt from the previous test sets in the order are included as part of the training data. In case of Evalu-


Table 1. Result with Evaluation Technique 1 TSN:Test Set Number, TNE: Total no. of NEs in the test set, TTNE: Total no. of tagged NEs, CTNE: No. of correctly tagged NEs

TSN TNE TTNE CTNE R P FS

1 6598 4037 3290 61.2 81.5 69.9 2 7338 4600 3823 62.7 83.1 71.5 3 7155 4421 3710 61.8 83.9 71.9 4 7371 4665 3956 63.3 84.8 73.2

Table 2. Result for Test Set 1 and Test Set 2 with Evaluation Technique 2

NE Test Setl Test Set2

type R P FS O R P FS O

FS F S _

PN 71.6 79.2 75 72.8 80.3 76.37 LOC 66.2 78.2 71.7 66.65 67.9 77.2 72.3 68.43 ORG 64.7 77.2 70.3 66.3 76.3 70.95 MISC 33.2 98.3 49.6 37.2 99.1 54.09

Table 3. Result for Test Set 3 and Test Set 4 with Evaluation Technique 2

NE Test Set3 Test Set4

type R P FS O R P FS O

FS F S _

PN 73.3 81.2 77.04 73.9 82.9 78.14 LOC 67.3 78.2 72.34 69.01 68.8 79.7 73.85 69.87 ORG 67.2 76.5 71.27 68.7 76.9 72.57 MISC 38.7 97.4 55.4 38.1 98.3 54.91

ation Technique 2, the Overall F-Score values for the all test sets (corpus) are less than the F-Score values as obtained with the Evaluation Technique 1. This is because Evaluation Technique 1 deals only with the identification of the NEs as a whole and so the individual NEs are not considered. If any particular NE is incorrectly classified, then it is not counted as an error in Evaluation Technique 1. But in Evaluation Technique 2, any incorrectly classified NE is treated as an error. As a result, the Precision values of the individual NEs are reduced and it in turn affects the F-Score and Overall F-Score values. At present, the systems can only identify the various date expressions but cannot identify the other miscellaneous NEs like monetary expressions and time expressions. The F-Score values of miscellaneous NEs affect the performance of the systems while evaluat

ing with Evaluation Technique 1.

5. Conclusion and Future Works

In NER, patterns can further be generalized by replacing each of l±i of p with its lexical category (each word of the pattern is replaced by its part of speech information). More morphological knowledge of the words may be helpful during the time of matching the patterns against the corpus.

Currently, we are working to include a HMM based POS tagger and a rule based chunker in to the system. Increasing the size of the knowledge base containing non-NEs and including it also at the time of training could further enhance the performance. More linguistic knowledge could be helpful in NE-classification disambiguation problem and as a result


precision values of different NE categories would in

crease. Observation of the results with the other or

ders of the test sets would be an interesting experi

ment.

R e f e r e n c e s

1. F. Erik, T. K. Sang and F. D. Meulder, Introduction to the CoNLL-2003 shared task: Language independent named entity recognition, in Proceedings of the CoNLL-2003, (Edmonton, Canada, 2003).

2. E. Black, F. Jelinek, J. L. erty, R. Mercer and S. Roukos, Decision tree models applied to labeling of text with part-of speech, in Darpa Workshop on Speech and Natural Language, (Harriman, NY, 1992).

3. A. Ratnaparkhi, A maximum entropy part-of-speech tagger, in Proc. of EMNLP'96., 1996.

4. J. Laffertey, A. McCallum and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in Proceedings of the 18th International Conference on Machine Learning, 2001.

5. D. Jurafsky and J. H. Martin, Speech and Language Processing (Prentice-Hall, 2000).

6. A. J. Viterbi, IEEE Transaction on Information Theory 13, 260 (1967).

7. T. Brants, TnT a statistical parts-of-speech tagger, in Proceedings of the sixth conference on applied natural language processing ANLP-2000, 2000 pp. 224-231.

8. H. Cunningham, Computing and the Humanities (2001).

9. D. Maynard, V. Tablan, K. Cunningham and Y. Wilks, Computing and the Humanities (2003).

10. S. Cucerzon and D. Yarowsky, Language independent named entity recognition combining morphological and contextual evidence, in Proceedings of the Joint SIGDAT conference on EMNLP and VLC, 1999.

11. D. D. Palmer and D. S. Day, A statistical profile of the named entity task, in Proceedings of ANLP-97,

12. A. Klementiev and D. Roth, Weakly supervised named entity transliteration and discovery from multilingual comparable corpora, in Proceedings of the COLING-ACL 2006, (Sydney, Australia, 2006).

13. D. Okanohara, Y. Miyao, Y. Tsuruoka and J. Tsujii, Improving the scalibility of semi-markov conditional random fields for named entity recognition, in Proceedings of the COLING-ACL 2006, (Sydney, Australia, 2006).

14. R. Yangarber, W. Lin and R. Grishman, Unsupervised learning of generalized names, in Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002), 2002.

15. H. Ji and R. Krishnan, Analysis and repair of name tagging errors, in Proceedings of the COLING-ACL 2006, (Sydney, Australia, 2006).

16. V. Krishnan and C. D. Manning, An effective two-stage model for exploiting non-local dependencies in named entity recognition, in Proceedings of the COLING-ACL 2006, (Sydney, Australia, 2006).

356

SCOPE: Shape Complexity of Objects using Isothetic Polygonal Envelope

Arindam Biswas and Partha Bhowmick

Bengal Engineering and Science University, Shibpur Howrah, India


Bhargab B. Bhattacharya

Indian Statistical Institute Kolkata, India


In this paper, we present a novel approach to measure the shape complexity of objects using the isothetic polygonal envelope. The novelty lies in the fact that to derive the shape complexity, neither the exact contour extraction nor a complete enumeration of the object is required. The complexity measure is derived by merging the consecutive concavity and convexity of the isothetic envelope in multiple tiers to capture the spatial complexity, normalized, in each step, by the polygon perimeter. Since we extract the isothetic polygon against an imposed background grid, it is robust to noise in the object. As it is inherently amenable to multiscale approach, by varying the grid size, the analysis of the complexity of the different objects is shown. We have used different arbitrary shapes in experimentation and the results conform to structural complexity of the objects.

Keywords: Shape Complexity, Object Shape, Pattern Recognition, Digital Geometry, Isothetic Polygon.

1. Introduction

Shape complexity has been a very important measure in areas such as computer vision,1 satellite imagery,2

geographic information systems,3 and medical imaging.4 A shape complexity measure also plays a crucial role in designing computationally efficient shape classification algorithms.5'6

A number of prior works on shape complexity are present in the literature. G. Toussaint7 has proposed a method based on polygon triangulation. In another approach,8 the sinuosity of the polygon boundary is measured to depict the shape complexity. Shape complexity is also calculated in terms of entropy of the curvature of the object contour.9 In yet another approach, the shape context of each selected point on the boundary is calculated.10 In fact, a large number of works has been done using either the boundary of the silhouette image, disregarding the hole or internal boundaries, or with a set of points on the extracted edge of the objects. Also, methods have been proposed based on Hausdorff distance,11 eigenvector or modal-matching based approach,12 Kolmogorov complexity.13 Since these methods are formulated in the Euclidean domain, they consist of complex calculations involving floating point operations, and are therefore, computationally expensive.

On the contrary, we propose a novel algorithm

to capture the complexity of an image in the digital plane using some digital geometric properties14'15of an isothetic polygon. The algorithm is based on a recent work,16"18 which uses only comparison and addition operations in the integer domain, and is therefore, very fast, efficient, and robust. Shape complexity measure in this algorithm is derived using the properties of the tight isothetic polygon that contains the object. Since a square shape is the simplest figure in the realm of 2D digital geometry, our shape complexity technique evaluates a square as the simplest possible object. That is, the shape complexity measure of an object (polygon) is zero if it is a square. Further, in our method, the complexity measure also takes into account the complexity due to the presence of internal contours in an object, which is not considered in some of the existing works.19,20

The paper is organized as follows. The proposed method is described in Section 2. Section 3 deals with the experimental results. In Section 4, we conclude with the discussion on future prospects of the work.

2. Proposed Work

An isothetic polygon is a polygon whose edges are axis parallel. The algorithm TIPS16 extracts the tight polygonal cover of any object imposed on a background grid. The grid size can be varied to




Arindam Biswas, Partha Bhowmick and Bhargab B. Bhattacharya 357

Fig. 1. Isothetic polygonal cover V for object A, where V is a set of polygons consisting of exactly one outer polygon and hole polygons, if any. The object is tightly enclosed in the region obtained by subtracting the hole polygon(s) from the outer polygon.

achieve a degree of compactness of the isothetic cover.

2.1. Extraction of Isothetic Cover

Given an object A denned in the two-dimensional digital plane Z2 , and a set of uniformly spaced horizontal and vertical grid lines, Q = (H,V), where ri and V represent two sets of equispaced horizontal and vertical grid lines respectively, the isothetic polygonal envelope, V, is such that the following conditions are satisfied:

(1) the object A will have exactly one outer polygon and may have some hole polygons;

(2) each point p of A, lies inside V; (3) each vertex of V is a grid point; (4) area of V is minimized.

The isothetic polygonal envelope V for a given object A is shown in Fig. 1.

2.2. Proposed Method

Once the isothetic polygon describing the object is extracted, it is encoded with a string, s, which consists of the different types of grid points (including the vertices) depicting the isothetic cover. There can be three types of such grid points on the polygonal envelope, namely 1, 2, and 3. If i denotes the type of the grid point on the polygon, then it indicates that the internal angle of the polygon at that grid point is given by i x | . Clearly, type 1 and 3 denote the vertices with internal angles 90° and 270°

respectively, while type 2, with internal angle 180°, indicates a simple edge point where the polygon edge propagates in the same direction. The three types of grid points are illustrated in Fig. 2. Thus, occurrence of 3 in s indicates a concavity, whereas 1 signifies a convex contour of the polygon, and thereof, of the object. Also, 2 indicates a straight edge of the polygon, thereby contributing no significant information regarding the shape complexity.

1

2

2

1

,

2 2 2 2 1

! ,

1

1

2 1

7

Fig. 2. Grid point types 1, 2 and 3 on the polygonal envelope (shown partially).

In an isothetic polygon, it can be shown that each concavity (3) on the polygon contour matches with a convexity (1), leaving four 90° vertices,21 i.e. (1111). Thus each polygon, encoded in string s, can be reduced to (1111) in case of an outer polygon, and to (3333) in case of a hole polygon (as shown in Fig. 1) by replacing the consecutive occurrences of 3 followed by 1. For example, a square will have the polygon code s = (1111).

The string s is reduced by applying the following reduction rules:

(1) Edge Reduction Rule: 2 + —* e (2) Vertex Reduction Rule: (31)+ —> e

The edge reduction rule does not contribute to the complexity of the polygon, so these reductions are not included in the total number of reductions. This is evident from Fig. 2 in which the top edge of the isothetic polygon consists of several consecutive 2s that indicates no significant variation on the contour of the object. On the contrary, in the right side of the polygon, the number of consecutive 2s is less and also, the occurrence of 31s is frequent, which

358 SCOPE: Shape Complexity of Objects using Isothetic Polygonal Envelope

iiil I I I I I 1 1 I I I I I I I 1 I I I I Itn.

1 2 3 4 5 6 7

Fig.

String (s) 12(31)312(31)1(31)21(31)1232(31)32(31)3(31)123 1734(31)1 1733(31) 1632(31) 153(31) 14(31) 1111

3. The polygon is encoded by the string s = l2(31)3 l2(31)l(31)2 l (31)l232(31)32(31)3(31)l23(i tn. 1), after the application of the edge reduction rule on s. The string s is reduced to 1111 subsequently.

indicates a complex nature of the associated part of the contour.

Let there be m polygons in V, and each polygon encoded in string Sj. Then each Si will be reduced to either (1111) or (3333), i.e. to a substring of length 4, say, in at most n iterations. Then the shape complexity (SC), which is given by

n m . . . E E r{fc)

QS1 fc=l 1=1 1 3 0 - n m ~ >

E E L?]

k=\ i=l

where, r\ indicates the number of reductions of string Si and L\ denotes the length of the string Si, both in the fcth iteration, is computed by the algorithm SCOPE as follows.

Algorithm SCOPE 1. Extract the isothetic polygonal cover V of A 2. Encode V to string s = (s i)^=1

3. Apply rule (1) on s to remove all 2s from s 4. r ^ 0 , L < - 0 5. Until each substring Si has length 4

5.1. Apply rule (2) on Si

5.2. r <— r + Y,ri

5.3. L ^ L + E ^ i 6.SC=i

In the expression of the shape complexity SC, as each r\ < L; for i = 1 . . . n, SC is always less than 1. It should also be mentioned here that in the above algorithm we have assumed one outer polygon, however for the hole polygons also the number of reductions and the corresponding string lengths are taken into account while calculating the final shape complexity. It may be noted here that while merging the concavities and convexities, the string s may be treated as circular.

2.3. Demonstration

Fig. 3 shows an object, the corresponding isothetic polygon, and the way the string is reduced in each iteration. For example, in iteration 1, ri =8 and Li=36. This is also clear from the demonstration that, higher the complexity of an object, larger is the number of iterations needed for the reduction.


We have tested the algorithm on two sets of images: 1) test images, which are arbitrarily drawn objects, and 2) logo images, obtained from Prof A. K. Jain on request. As shown in Fig. 4, the complexity of a square is 0, whereas that of a convoluted shape (f) is 0.74. It is evident from the shape of the isothetic polygon of the square image that there are no 270° vertices, so there are zero reductions, thereby assigning zero complexity to it. Also, for the logo images, as shown in Fig. 5, the shape complexity increases with the increasing structural complexity. This method is robust as compared to the methods where the shape contour is used, because the contour extraction is prone to noise in the image. The complexity of the algorithm is output sensitive, and depends upon the number of grid points lying on the isothetic polygon corresponding to the object.

4. Conclusions and Future Directions

In this paper, we have presented a shape complexity measure of objects in the digital geometry domain without using any Euclidean measure. The computations being in the integer domain, apart from one

Arindam Biswas, Partha Bhowmick and Bhargab B. Bhattacharya 359

(a) 0.00 (b) 0.07 (c) 0.08

(d) 0.15 (e) 0.22 (f) 0.74

Fig. 4. The shape complexity values for a set of synthetic images.

dB^Bk ( d ) 0 .130 ( e ) 0 .150 (f) 0 .167

Fig. 5. The shape complexity values for a set of synthetic images.

division operation a

sion, the method is of the grid size the ject is obtained, this treatment. Also as a of questions arising ital geometry, such

in the shape complexity expres-very fast. As with the variation desired compactness of the ob-readily lends itself to multiscale corollary to this work, a number in the emerging domain of dig-as, how many distinct isothetic

a The 0-1 scale of SCOPE needs the division, which can be avoided if we consider the upper limit of the measure as max

53 Lk for a given (normalized) background grid,

polygons can be drawn on a given grid plane, how difficult it is to generate the exhaustive set of isothetic polygons in the n x n grid, etc. Another interesting problem may be to generate the string (s) that represent the isothetic polygon(s) of a given complexity. We will present our observations in the forthcoming papers on these areas.

360 SCOPE: Shape Complexity of Objects using Isothetic Polygonal Envelope

References

1. K. Siddiqi and B. B. Kimia, Parts of visual form: Computational aspects, in Proc. IEEE CVPR, 1993.

2. L. A. Oddo, Global shape entropy: A mathematically tractable approach to building extraction in aerial imagery, in Proc. 20th SPIE AIPR Workshop, 1992.

3. T. Brinkhoff, H.-P. Kriegel and R. Schneider, Measuring the complexity of polygonal objects, in Proc. Third ACM International Workshop on Advances in Geographical Information Systems, 1995.

4. R. M. Cesar and L. Costa, Review of Scientific Instruments 68, 2177 (1997).

5. A. Tsai, W. M. Wells, S. K. Warfield and A. S. Will-sky, Medical Image Analysis 9, 491 (2005).

6. Z. Barutcuoglu and C. DeCoro, Hierarchical shape classification using bayesian aggregation, in IEEE SMI'06, 2006.

7. G. Toussaint, Visual Computer 7, 280 (1991). 8. B. Chazelle and J. Incerpi, ACM Transactions on

Graphics 3, 135 (1984). 9. D. L. Page, A. Koschan, S. R. Sukumar and et al.,

Shape analysis algorithm based on information theory, in IEEE ICIP'03, 2003.

10. S. Belongie, J. Malik and J. Puzicha, IEEE Trans. PAMI 24, 509 (2002).

11. D. Huttenlocaher, G. Klanderman and W. Ruck-lidge, IEEE Trans. PAMI 15, 850 (1993).

12. S. Sclaroff and A. Pentland, IEEE Trans. PAMI 17, 545 (1995).

13. Y. Chen and H. Sundaram, Estimating the complexity of 2d shapes, in Multimedia Signal Processing Workshop 2005, 2005.

14. R. Klette and A. Rosenfeld, Digital Geometry (Morgan Kaufman, San Francisco, USA, 2005).

15. B. Yip and R. Klette, Pattern Recognition Letters 21, 1275 (2003).

16. A. Biswas, P. Bhowmick and B. B. Bhattacharya, TIPS: On finding a tight isothetic polygonal shape covering a 2d object, in SCIA 2005, 2005.

17. P. Bhowmick, A. Biswas and B. B. Bhattacharya, Isothetic polygons of a 2d object on generalized grid, in PReMI 2005, 2005.

18. A. Biswas, P. Bhowmick and B. B. Bhattacharya, MuSC: Shape codes multigrid shape codes and their applications to image retrieval, in CIS 2005, 2005.

19. D. Sharvit, J. Chan, H. Tek and B. Kimia, J. Visual Communication and Image Representation 9, 366 (1998).

20. Y. Gdalyahu and D. Weinshall, IEEE TPAMI 21 , 1312 (1999).

21. S. Sur-Kolay and B. B. Bhattacharya, Inherent non-slicibility of rectangular duals in VLSI floorplanning, in FSTTCS, 1988.

361

Segmental K-Means Algorithm Based Hidden Markov Model for Shape Recognition and its Applications

Tapan Kumar Bhowmik

IBM India Pvt. Ltd. Millennium City, Salt Lake, Kolkata - 700091, India.

E-mail: tbhotvmikOin.ibm.com

Swapan Kumar Parui

Computer Vision and Pattern Recognition Unit Indian Statistical Institute, Kolkata-700108, India.


Manika Kar and Utpal Roy

Dept. of Computer and System Sciences Visva-Bharati, Santiniketan

Birbhum-731235, India. E-mail: [email protected], [email protected]

In this paper, we propose a hidden Markov model (HMM) for object recognition on the basis of the shape of its boundary. The boundary can be looked upon as a sequence of certain elementary shapes. These elementary shapes are in fact the states in the model. An HMM is quite appropriate in applications where the shape of the boundary varies significantly from one sample to another within a class. For the task of learning of the HMM parameters, the segmental /f-means algorithm is used. This algorithm was basically developed for speech recognition and was not fully connected. Here it is modified as a fully connected HMM. The state distribution is assumed to be multivariate Gaussian. The proposed scheme has been tested on two databases of handwritten numeral and character shapes of Bangla script. The results are quite encouraging.

1. Introduction

Shape based object recognition is an important problem in computer vision. Features, on the basis of which such recognition is done, belong to two broad groups, namely, external and internal. External features are those derived from the boundary of an object while internal features are based on the interior points of the object. However, the object boundary is very effective in several applications and different techniques have been proposed in the literature to make use of the information that the boundary contains.1

In the present paper, the object boundary is thought of as a sequence of certain elementary shapes which are not deterministic but have certain distributions. A natural recognition scheme in such a framework is a hidden Markov model (HMM) where the elementary shapes represent the states. In applications where the object boundary has a lot of variation due to some reason, an HMM is quite effective in the recognition tasks. In the present study the proposed HMM based recognition scheme has been tested on

two databases of handwritten numeral and character patterns of Bangla script.

Though the initial applications of an HMM were in speech recognition,2 its performance in shape recognition has been found to be quite encouraging.3

In the present study, the states of the HMM are not determined priori, but are determined during the parameter estimation process. The different number of states for different character HMMs have been identified automatically during learning by avoiding the complexity of EM based method. Initially, the parameters of the HMM are estimated from iif-means clustering algorithm and then the parameters are re-estimated through segmental A"-means algorithm.

2. Introduction to Hidden Markov Model

A discrete-time hidden Markov model (HMM) is a probabilistic model that describes a random sequence O = Oi,02, • • • ,OT as an indirect observation of an underlying (hidden) random sequence Q = Q11Q2, • • • ,QT, where the hidden process is

http://tbhotvmikOin.ibm.com




362 Segmental K-Means Algorithm Based Hidden Markov Model for Shape Recognition and its Applications

Markovian, even though the observed process may not be so. A discrete-time HMM A = (ir, A, B) is defined by the following elements: S = {s\, S2, • • • , s/v} is the set of states, where N is the number of states in the model. We denote the individual symbols as V = {vi,V2,--- ,VM} where M is the total number of distinct observation symbols per state. The initial state distribution is given by 7r = {7Tj} where 7Tj = P(qi = Si), 1 < i < N where P indicates probability. The state transition probability distribution is given by A = {ai:j} where atj - P[qt+\ = Sj/qt = st], 1 < i,j < N. The observation symbol probability distribution in state j is given by B = {bj(k)} where bj{k) = P\vk at t/qt = Sj], 1 < j < N, 1 < k < M. Ot will denote the observation at time t.

3. H M M for Object Recognition

Let the boundary of an object be represented by a sequence of observations, defined as O = 0 i , 0 2 , - - - >OT where Ot is the observation vector at time t. The object recognition problem can then be regarded as that of computing argmaxi{P(wi/0} where wi is the i-th object, we have P{wi/0) =

P(0) — T,f(o/wj)p(Wj) u s i n S cayes ruie.

Thus, for a given set of prior probabilities P(wi), the most probable object depends only on the likelihood P{0/wi). To reduce the complexity of estimating the joint conditional probability P(Oi,02, • • • , Or/wi), we use a set of models Aj corresponding to object Wi i.e. P{0/Wi) = P(0/Xi).

Thus, in the classification stage, given an unknown sequence, O = Ox, O2, • • • , OT the probability P(0/Xi) is computed for each model Aj, and O is classified in the class whose model shows the highest likelihood P(0/Xi).

For a given A, an efficient method to find P{0/X) is the well known forward-backward algorithm described below.

Forward Algorithm:

Consider the forward variable at(i) defined as: at(i) = P(Oi,02,---Ot,qt = Sj/A), i.e., the probability of the partial observation sequence up to time t and the state Si at time t, given the model A. at(i) can be computed inductively as follows:

(1) Initialization: for 1 < i < N a\(i) = 7Tj6j(Oi)

(2) Induction: for t *N

"t+iO") =

1 , 2 , - T - l ; 1 < j < N

J2 <xt{i)a-ij i = l

bj(Ot+1)

N (3) Termination: P(0/X) = £ aT(i).

i=\

Similarly, we can compute the same probability by backward algorithm,2 using the backward variable Pt(i) = P(Ot+i,Ot+2, •• ,Or,qt = Sj/A), the probability of the observation sequence from t + 1 to T given the state Si at time t and the model A.

4. Feature Extraction

For feature extraction the outer boundary of an object image is traced from a starting point Po and moving in the anti-clockwise direction until Po is reached again, forming a closed curve C. The point Po is selected as the north-west bottom most object pixel. The curve is first computed in the form of a chain code given in Fig. 1. For example, the chain code of the curve C in Fig. 2 is 455443434332221118818777766765. Now, the length corresponding to these codes is not the same. The even codes represent a length 1 while the odd codes a length \/2. Now, we will define a derived chain code in which each code represents nearly an equal length unlike in the chain code above. Since \/2 is very close to 1.4, we repeat each even code 5 times and each odd code 7 times. For example, from portion of the chain code 455443 in Fig. 2 we get 444445555555555555544444444443333333 • •• as the derived chain code. It is clear that in the derived chain code each code has nearly the same length. Now, L equidistant points P,(i = 1,2, • • • , L) on the curve are obtained as follows. Po is set to (0,0). The codes in the derived chain code are divided into L equal segments. When the number is not divisible by L, the segments are kept as equal as possible. For example, from 35 we get 6,6,6,6,6,5 as the segment lengths. P\ is obtained from the first segment, P2 is obtained from the second segment etc.4 Let Oj, i = 1,2, ••• ,L be the angles that the lines Pi-iPi make with the x-axis. Since C is a closed curve, 0° < 0* < 360°. For the closed curve C in Fig. 2, 0$ (for L = 10) are 237.0°, 285.2°, 289.9°, 327.1°, 19.9°, 57.0°, 74.8°, 135.0°, 157.5°, 180.0°.

Tapan Kumar Bhowmik, Swapan Kumar Parui, Manika Kar and Utpal Roy 363

6. HMM Parameter Estimation

6 «- -* 2

i \ \,

5 4 3

Fig. 1. Eight directional codes

In this paper, for every sample image in a particular class, a feature vector of length L is extracted. This feature vector is divided into M equal segments each with length D. So, L is determined in such a way that in which L = M * D. In our experiment for a

!+®Sj where, given D, M is computed as M = C is the length of the derived chain code of an image and H is the height of the image. That is, from every sample image in a particular class we get M elementary strokes say, O = Oi, O2, • • • , O M each of length D. The stroke frequency distribution of 1200 samples is taken from the character images zero is shown in the Fig. 4.

: 1_JLX

Fig. 2. A closed curve C indicating the boundary

5. H M M Model Selection

There are several types of HMM models used in pattern recognition. One of the most popular models is left-right Bakis model.5 This model is used when the underlying state sequence index increases (or stays the same) with the time. However, this model is not applicable in the present task. Here we use fully connected HMMs in which every state of the model could be reached (in a single step) from every other state of the model as shown in the Fig. 3.

Fig. 3. Fully connected HMM with 3 states (so to S2)

Fig. 4. Stroke frequency distribution

For estimating the initial parameters of a particular class HMM, all the elementary stroke feature vectors of dimension D have been clustered by Af-means algorithm. The distance in A'-means algorithm is measured by:

dk = £ min(| 0i - 0[k) |,360°- | fc - 0Jfc) |),

(k)

where 6\ is the i-th component of the center for k-th cluster.

After that the mean vector and covariance matrix of each cluster are estimated, which are the initial mean vectors and covariance matrices of an HMM. Treating each cluster as a state, the initial state probabilities and state transition probabilities are computed in the following way:

No. of occurrences of 0\ G Sj ^ = — A T 7 7 7 T V1)

I\o. of occurrences of U\ No. of occurrences of Ot € s, and Ot+\ G Sj Vi

No. of occurrences of Ot G Sj Vt (2)

where, 1 < i,j < N.

364 Segmental K-Means Algorithm Based Hidden Markov Model for Shape Recognition and its Applications

These initial parameters do not give the final HMM. For estimating the final HMM we apply segmental if-means algorithm.6 The algorithm is given below.

The Segmental K-means Algorithm:

(1) For each sample with observation sequence O = 01 ,02 , • • • , OM, compute the observation symbol probability as: For 1 < i < N,

bt(Ot) e x P { - l ( O t - / J i )

T £ - 1 ( O t - ^ ) } (27r)D/2 |Ei|V2

where each state is assumed to have a Gaussian distribution.

(2) For each observation sequence find out the corresponding optimal state sequence using Viterbi algorithm.2 By assigning each observation Ot to the corresponding Viterbi state we form a new cluster for all observations. The mean vectors and covariance matrices of the new clusters have been calculated in the following way: For 1 < i < N, compute Mi = •& E Ot,

Si i l f (Ot -Mi) T (Ot -Mi) Ot€si

where JVj is the number of observations corresponding to the state Sj.

(3) From optimum state sequences we again calculate initial state probabilities and state transition probabilities by equation (1) and (2). The above procedure is repeated until an observation is reassigned to a new state in Step 2.

optimum number of states for A0 we have calculated likelihood LQ = ^P{0/\Q) for all character images of zero using forward algorithm with different number of states. The distribution of likelihood value for different number of states of the character images zero is shown in the Fig. 5.

1 2 3 4 5 6 7 8 9 1011 121314 1511

life of SUM*->

Fig. 5. Likelihood distribution

The optimum number of states has been chosen based on the highest value of LQ.


Initial experiments have been carried out using both Bangla numerals (10 classes) as well as basic characters (41 classes). The number of different symbols in Bangla basic character set is 50. But in terms of distinct shapes, there are only 41 classes instead of 50 classes. The number of optimum states and recognition rate for Bangla numerals are shown in Table 1. The recognition rates for Bangla basic characters is found to be quite satisfactory. For 41 basic characters, the recognition rate has been obtained in the range of 62.3% to 85.7%.

7. HMM Topology Selection

The HMM topology, defined in this context as the number of states, the number of mixtures per state and transition between states, directly influences the modeling capacity of the model. Its choice is crucial for achieving a high performing system. In literature, various approaches are found, for example, state-splitting techniques7or state reducing techniques,8 using likelihood as the optimization criterion, Bayesian techniques.9 But there is no way to make sure that any particular approach is making the optimal model. In our experiment we have considered a single distribution for a state. Let Ao be the HMM model produced by segmental K-va.ea.ns algorithm for the character image zero. To choose the

Table 1. als

Character

0 1 2 3 4 5 6 7 8 9

Recognition rates

Model

Ao Ai

A2

A3 A4

A5

A6

A7

A8

A9

States

13 12 16 6 6 16 11 11 13 18

for Bangla numer-

Accuracy(%)

96.7 89.1 95.2 88.8 95.6 94.3 89.2 95.7 95.9 92.1

References

1. S. Loncaric, Pattern Recognition 3 1 , 983 (1998).

http://K-va.ea.ns

Tapan Kumar Bhowmik, Swapan Kumar Parui, Manika Kar and Utpal Roy 365

2. L. Rabiner, Proceedings of the IEEE 77, 257 (1989). 3. H. Xeu and V. Govindaraju, IEEE Trans, on PAMI

28, 458 (2006). 4. S. K. Parui and D. D. Majumder, Pattern Recogni

tion Letters 3 (1983). 5. R. Bakis, Continuous speech word recognition via

centisecond acoustic states, in Proc. ASA Meeting (Washington, DC), 1976.

6. R. Dugad and U. B. Desai, A Tutorial on Hidden

Markov Models, Technical Report No. SPANN-96.1. 7. H. Singer and M. Ostendorf, Maximum likelihood

successive state spliting, in Proc. of ICASSO, 1996. 8. A. Stolcke and S. Omohundro, Hidden markov model

induction by bayesian model merging, in Advances in NIPS, (5)1993.

9. D. Li, A. Beim and J. Subrahmonia, Hmm topology optimization for handwriting recognition, in Proc. of ICASSP, (3)2001.

PARTK

Speech and 1-D Signal Analysis

369

Automatic Continuous Speech Segmentation Using Level Crossing Rate

Nagesha and G. Hemantha Kumar

Department of Studies in Computer Science University of Mysore, Mysore-5"i'0006, INDIA


In this paper, a new algorithm to automatically segment a continuous speech signal into phonemes is presented. The proposed method is based on the Average Level Crossing Rate (ALCR) of the speech signal, which is motivated by auditory models.1 It portrays that time information is more suitable to handle non-stationary signals. Based on this assumption, we proposed a non-uniform quantization function which dynamically assigns the levels depending on the importance of the given amplitude range. The parameters, to calculate the importance of given amplitude segment approximately, are estimated by incomplete beta function and logarithmic rules. Experimentation results conducted on TIMIT database show good approximation of segmentation obtained from the TIMIT labeling. The proposed method has a success rate of about 79% for 20ms tolerance using logarithmic rule method.

Keywords: Speech segmentation; Level crossing

1. Introduction

Whenever an auditory nerve system is modeled, a problem of frequency and amplitude measurement arises. However, Level Crossing approach has emerged as a new, theoretical and powerful framework for frequency and amplitude analyses of speech signals, since it preserves magnitude and phase information. Several schemes have been investigated for segmentation of continuous speech signals based on either Zero Crossing or Level Crossing techniques.2"5

Among them U-LCR (Uniform-Level Crossing Rate) and NU-LCR (Nonuniform-Level Crossing Rate), proposed by Anindya,2 are robust enough to segment the speech signals in noisy environments. However, there are certain problems that arise when segmenting consonants. The main type of consonants called stop consonants, composed of / t / , /d / , / p / , / b / , /k / , and / g / occur frequently in natural speech. A stop generally contains a weak closure and a burst. The segmentation performance of stop consonants is found to be poor with U-LCR and NU-LCR schemes, specifically. This is due to the fact that Noise Robustness Scheme ignores the amplitude range over which noise PDF (Probability Density Function) is maximum. Noise PDF is high near the zero crossings. Hence, important information, which could have been used in segmenting stops, is lost. Thus, noise robustness schemes NU-LCR and U-LCR are not the ideal solutions for this type of situation.

Enhancing the speech signal reduces the noise level. This approach has the capability to process the stops and bursts as well. Another advantage of using

signal enhancement strategy is that the segmentation performance is increased under noisy conditions.

From the viewpoint of speech signal processing, the formulation of multiple 'Level Crossing' can provide intensity information, which may be useful for speech segmentation. However, determining the values of the number of levels properly is very important as it has a huge impact on the performance. Unfortunately there is no theory available to determine those values. In this paper, we have proposed methods based on Incomplete Beta Function and logarithmic rules, to decide the number of levels depending on the signal Cumulative Density Function (CDF).

These methods of estimating the levels based on incomplete beta functions (IBF) and logarithmic rules (LR) lead to an improved segmentation of speech signals. The validity of using these rules is experimentally proven by showing that speech segmentation can be performed with Non Uniform-Level Crossing Rate as compared to Uniform-Level Crossing Rate. The proposed method uses a computationally efficient noise estimation algorithm for speech enhancement. The advantage of using this speech enhancement method over other methods is that the computational complexity of the algorithm is very low and also in this approach speech enhancement is an integral part of noise estimation algorithm.

This paper is organized as follows, section 2 defines the Average Level Crossing Rate. Section 3 gives the analysis of proposed level computation algorithms. Section 4 presents the proposed method in detail and, section 5 presents our experimental re-


370 Automatic Continuous Speech Segmentation Using Level Crossing Rate

suits and analysis. Conclusions are drawn at the end.

2. Definition of Average Level Crossing Rate

Level Crossing Analysis represents an approach to interpretation and characterization of time signals by relating frequency and amplitude information. Measurement of the rate of Level Crossing of a signal is defined as the number of crossings per unit time of a level. This yields the frequency associated with that level. The expected crossing rate of level / by a signal s(t) is given by6'7

/

oo

\x\pa,s(l,x)dx (1)

•oo

where ps,s(l, x) is the joint probability denisity function of signal s(T). The number of crossings during time T is

J V , ( 8 ; 0 = ^ C , ( » ; / n i ) (2)

where,

c,(s-i )-ix f°rs(t^rT)ls^T-1\<0) U{s,lni)-i 0 Qtherwise

and n subintervals are defined by

Im={^rT^T) i = 1,2,3, . . . ,n

Average Level Crossing aims at smoothing the crossing rate obtained for each member of a set of levels over duration S and converting the two-dimensional level crossing profile to a one-dimensional profile which is an ensemble of all levels. We define Average Level Crossing during time T as

ALCR{T)= / / Ci(3;Ini) (3) .7/3=1 Ja=T-S

where L is the cardinality of set of levels. Since the number of crossings is a non-negative

integer, the observed Average Level Crossing rate obtained from Eq. (3) can be graphically represented as a smooth non-negative curve which can be used for labeling the phoneme boundaries.

3. Level computation rules

The lack of data to decide the exact number of levels for a given amplitude range, creates problems concerning the selection of number of levels. In such

cases, an expert will have to assume the levels. For this reason, the flexible incomplete beta distribution, capable of attaining a variety of shapes and logarithmic curve profile could be used in level crossing applications. Because of its extreme flexibility, the distribution appears ideally suited for the computation of the levels for a specific amplitude region of a speech signal.

3.1. Incomplete Beta Function

The Beta function is a continuous distribution defined over a range. Additionally, both of its end points are fixed at exact locations and it belongs to the flexible family of distributions. A generalization of the incomplete beta function is defined by8

B(z,a,0)= [ u0"1 (1 - u0-1) du (4) Jo

la a+1

. ( l - f l " ( " - f t ) . , • w ^

The Incomplete beta function is defined by

I(z,a,P) = ' ' (6) B[a,P)

a>0,P>0,0<z>l

Eq. (7) has the limiting values Io(a,P) = 0 and I\ (Q, P) = 1. The shape of the incomplete beta function obtained from Eq. (7) depends on the choice of its two parameters a and p. The parameters are any real number greater than zero; depending on their values, the incomplete beta function generated will have the 'U', the 'J', the 'triangle' or the general 'bell' shape of the unimodal function. Estimating these parameters is a challenge since these parameters control the number of levels for a given amplitude range of a speech signal. Since incomplete beta function expected to describe the significance of the amplitude range, we select a and P such that incomplete beta curve resembles the behavior of speech signal amplitude.

The subjective information needed to determine the two incomplete beta parameters a and P, that will describe a unique beta curve, is derived from

tatempitt* btla tytrclKWi plot

7 0 , , . 1 1 . , 1 1 1

WI

SH- /

I ^^"^"^

°0 ~ t 0 20 ~» « 50 " « 70 80 90 Stope f^nyfc r. from [0j?ttJ >

Fig. 1. Incomplete beta function plot of a sample CDF for 64 levels.

the speech signal CDF using maximum likelihood estimation approach. The estimated parameters a and (3 are along with a linear vector are used to compute l(z,a,0). The fitted incomplete beta curve acts as the look-up table for assigning the number of levels. The shape of the resultant curve is rotated S which has high slope in the corners. Slope of the curve increases steadily in the beginning of the curve. After a certain period the slope decreases and remains constant. Slope increases again when it reaches the ending point of the curve. The selection of the rotated S shaped curve is based on the assumption that less number of levels are needed in the amplitude regions where amplitude activity of the signal is less and more number of levels are needed in the amplitude region where signal amplitude is high. Fig. 1 shows the incomplete beta function plot as a function of slopes and levels where substantial increase in levels is observed for higher slopes.

3.2. Logarithmic rule

Logarithmic model is used in auditory systems because ear responds logarithmically to acoustic power. The proposed method is based on the fact that shape of the logarithmic curve is slowly varying. Hence presentation of slopes on a logarithmic scale is useful when levels have to be increased steadily corresponding to the increasing slopes. We have observed that the logarithmic scale fits the model well and approximation is acceptable. For smaller slopes less number of levels are assigned and the number of levels increases with the slopes. Adopted profile of the log-

Nagesha and G. Hemantha Kumar 371

arithmic curve is shown in Fig. 2 for 64 levels.

.aW'tare *-»*» TVV- (*jti«*

Fig. 2. Logarithmic rule for 64 levels.

3.3. Adapting the rules for level computation

The dynamic amplitude range is converted into dynamic slopes, so that, for different slopes of amplitude regions from CDF, we assign the number of levels using incomplete beta function or logarithmic rule. The maximum number of levels possible for a unit amplitude segment is fixed to 16, 32, 128 and 256. This has been empirically chosen after conducting several experiments with different threshold values and different sets of speech signals. By using incomplete beta functions, the problem of estimating the number of levels for different types of phonemes has been resolved. Furthermore, level allocation rules will be more suitable for deciding the number of levels, since it assigns the number for speech signals dynamically.

4. Segmentation algorithm

We briefly describe the above proposed schemes to segment the speech signal. (1) If the input speech signal is a noisy signal, en

hance it. (2) Find the PDF of speech signal s\n] using signal

histogram. (3) Estimate the number of levels for amplitude

range [-1, 1] using proposed incomplete beta function or logarithmic based approach.


(4) Find the amplitude regions which are dynamic using speech signal CDF.

(a) In order to locate the dynamic segments of amplitude, we find the derivative of CDF of the speech signal. If the difference between the rates of change of a particular amplitude segment and the previous segment is less than the threshold then we append the current segment to the previous amplitude segment. Otherwise, we create a new amplitude segment. This procedure can be interpreted as clustering of amplitude regions with almost similar rate of change.

(b) In addition to locating the dynamic amplitude segments, we also have to fit a straight line in the region. This is to predict the value of rate of change, approximately for the region. The slope of the fitted line best describes the behavior of the amplitude segment.

(c) Select the number of levels for each segment using the look-up table formed by the incomplete beta function or logarithmic rule. The look-up table gives the levels for every slope in the range [0, 90]

(d) Calculate the average level crossing for each level founded by incomplete beta function or logarithmic rule, based on the proposed average level crossing approach.

(e) Smoothen the average level crossing data using smoothening filters. Smoothening helps to locate the local minima precisely and can reduce the level of noise without biasing the value obtained. The local minima represents the segmentations of the given continuous speech signal

5. Experimental results and analysis

The performance of the proposed method was evaluated experimentally using a selection of 100 continuous speech signals from TIMIT database. The total number of speakers was twenty, of which ten were male and the remaining were female speakers. The noise was taken from the NOISEX'92 database. Additionally, the classification performance of the proposed method is compared with TIMIT manual labeling itself for a more meaningful comparison. The proposed method was combined with computationally efficient Energy and Zero Crossing based Speech

Enhancement Algorithm.9 The speech enhancement algorithm estimates the background noise during the pauses of the speech signal. The ratio between energy and zero crossing is used to det ect the pauses and is calculated as

r(m) = 2Z^(»["+VD i £~=i |s9n(x[n+4p])-Sgn(x[(n+l) + 4p] ) |

where x[n] is the discrete time reconstructed signal using multi-band spectral subtraction method in which 4 KHz bandwidth is divided into 8 equally spaced bands with 16 STDFT bins in each band N=256 is the number of samples per frame, m is the frame number and

sgn(x[n]) = { 1 x[n] > 0 -1 x[n]<0

The following parameters are used in noise enhancement algorithm S = 1.26, a = 0.1. For smoothening the weighting coefficients used are 0.0357, 0.2411, 0.44464, 0.2411, and 0.0357. To fit a line for the amplitude segment in (Step 4.b), we have used robust fit line fitting algorithm.

Sometimes the speech signal is over segmented with false insertions; this can be avoided using appropriate smoothening technique. For the purpose of this study, we have applied Savitzky-Golay filter.8

Savitzky-Golay is a particular type of low-pass filter, well adapted for data smoothening. Rather than having their properties defined in the Fourier domain, and then translated to the time domain, Savitzky-Golay filters derive directly from a particular formulation of the data smoothening problem in the domain. The idea of Savitzky-Golay filtering is to find filter coefficients Cn that preserve higher moments. The idea is to approximate the underlying function within the moving window, not by a constant but by a polynomial of higher order. Coefficient Cn is given by

Cn = Zm=0{ATA-1} Om"

where, M in i, is the degree of the polynomial namely a0 + a\i + 1- aMiM to the values / _ n i , • • • , fnR

and TIL <n < TIR.

Experiments are conducted on clean and noise signals from TIMIT and NOISEX'92. An interesting pattern of behavior that was found analyzing the results is shown in Table 1. For small tolerances (5-10ms) and clean signal higher levels give better results. For higher tolerances (>35 ms, although not all

Nagesha and G. Hemantha Kumar 373

Table 1. Percentage of segmentation errors for tolerance values (5, 10, 20, 40 ms) with various levels (16, 32, 128, 256). Noisy signal is generated by adding clean speech signal taken from TIMIT database with F-16 cockpit noise taken from NOISEX'92 database.

Levels IBF(5ms) LR(5ms) IBF(lOms) LR(10ms) IBF(20ms) LR(20ms) IBF(40ms) LR(40ms)

Clean Signal

Noisy signal

SNR=25 dB

16

32

128

256

16

32

128

256

29.11

34.13

34.52

36.12

25.92

30.04

29.30

25.71

31.88

39.12

40.65

41.45

27.02

32.20

31.44

24.17

45.46

52.89

55.60

58.97

45.41

47.87

46.20

42.03

47.55

53.06

57.38

60.08

46.33

47.54

48.34

44.07

62.13

69.51

73.18

78.06

61.33

60.64

60.41

58.34

63.44

72.03

72.63

79.05

64.91

63.11

64.19

63.73

69.58

76.92

79.27

83.43

68.71

71.98

71.67

74.89

70.22

78.18

78.31

84.09

68.16

72.44

73.38

78.64

&wg£C}i i«:i»LV;;'<nin < •j*z.n>-t.i>

o 2>£o «u3 y*v ;*) v*m YMS wnra raw

/i<iU!k''t*fe!m"< m PROPOSE® * woo

0 * l U 4Uf ,*, >i -JUO 1G0G0 W W WG88 ! 8 »

<—•.!»•—4i.fj TT.kw i! i«Q maposeo METHOD 3 „ .

, ...1 JL J .Ti... ,. .. J me mm mm JSBO tmm iiooo

Fig. 3. Example of segmentation for the clean TIMIT speech signal SA1.WAV with incomplete beta function.

data is shown) and clean signals fewer levels produce best results. However, when the signal is noisy better results are obtained for small tolerances with minimum number of levels. Although somewhat surprising, these results contradict the general assumption that more levels are best for phonetic segmentation. This phenomenon is the result of sensitivity of higher levels to noise conditions. In noisy conditions higher levels induces more false insertions and deletions.

Proposed methods estimates extra segmentation points compared to the manual TIMIT labeling of speech signals under various noisy conditions. Also, smaller duration segmentation points are ignored by the proposed methods. Performances of the proposed

algorithms under different noise levels are shown in Table 2. The experimental results in Table 2 indicate that the major drawback of incomplete beta function is the extreme sensitivity to the near zero crossing locations compared to the logarithmic approach. For instance, segmentation of a speech signal with 25 dB of F-16 cockpit noise using incomplete beta function has more insertions than logarithmic approach. This phenomena is a direct consequence of allocation of more number of levels to the near zero crossing locations in incomplete beta function approach. This is an illustration of the fact that under noisy conditions the performance of the segmentation algorithm decreases with more number of false insertions and deletions. Fig. 3 shows the segmentation points for a speech signal using the proposed method as well as manual TIMIT labeling with incomplete beta function rule.

The proposed method has a success rate of about 79% when the input speech signal is clean with 20 ms tolerance using logarithmic approach. The quality of enhanced speech contributes to the performance during noisy conditions. It also depends on the smoothening algorithm which helps in labeling the local minima, because over segmentation reduces the performance. Comparing the proposed method with TIMIT phonetic labeling, the difference in segmentation is less than 45% above lOdB SNR. Segmentation error rate is 49% at 5dB SNR. In addition, an overall 8% insertion and deletions have also been observed. This result certifies that proposed method is robust to noise.


Table 2. Segmentation results for tolerance level 20 ms and for various noise levels (clean, 5dB, lOdB, 20dB). (Clean speech signal taken from TIMIT database is corrupted with F-16 cockpit noise taken from NOISEX'92 database).

SNR

Clean

5dB

lOdB

20dB

IBF(5ms)

62.13

38.37

56.11

60.08

LR(5ms)

63.44

41.17

60.45

61.12

IBF(lOms)

69.51

36.61

52.23

58.94

LR(10ms)

72.03

38.91

54.19

61.55

IBF(20ms)

73.18

31.44

52.10

55.36

LR(20ms)

72.63

41.88

58.13

62.91

IBF(40ms)

78.06

28.69

50.33

58.77

LR(40ms)

79.05

35.49

54.27

63.43


We introduced new level allocation methods for

speech segmentation which are adaptive as well as

robust to noise. Also, a new continuous speech seg

mentat ion algorithm based on average level crossing

has been introduced. The experiments conducted on

TIMIT database shows tha t it can lead to a bet ter

characterization of time domain speech signals. In

the future, the focus of our research will be on the

effect of level values in the presence of additive noise.

R e f e r e n c e s

1. O. Ghitza, IEEE Transactions on speech and Audio Processing 2, 115 (1994).

2. S. Anindya and T. V. Sreenivas, Automatic speech segmentation using average level crossing rate information, in Proc. IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP'OS), Mar 2005.

C. Panagiotakis and W. Tziritas, IEEE Transactions on Multimedia 7, 155 (2005). F. Daaboul and J. Adoul, Parametric segmentation of speech into voiced -unvoiced-silence intervals, in Proc. IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP'77), May 1977. R. James and W. Z. Victor, Multi-level acoustic segmentation of continuous speech, in Proc. IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP'88), April 1988. D. Middleton, IEEE Transactions on Recollections Information Theory 6, 1367 (1988). R. J. Mitchell and R. C. Gonzalez, Multilevel crossing rates for automated signal classification, in Proc. IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP'78), 1978. W. H. Press, T. s. Saul A, T. V. William and P. F. Brian, Numerical Recipes in C++, 2nd edn. (Cambridge University Press, 2002). V. G. Reju and T. Y. Chow, A computationally efficient noise estimation algorithm for speech enhancement, in Proc. IEEE Int. Conference on Circuits and Systems, December 2004.

375

Automatic Gender Identification Through Speech Analysis

Anu Khosla* and Devendra Kumar Yadav

Scientific Analysis Group, DRDO, Metcalfe House, Delhi 110054


This work presents an automatic, language independent, gender identification technique based on analysis of speech signal. Two methods were considered for this: one involving pitch frequency threshold and the other using artificial neural networks with Mel Frequency Cepstral Coefficients (MFCC) as features. The final classifier combines both the above mentioned techniques. The pitch is determined by computing sub-harmonic to harmonic ratio and a probabilistic neural network is trained using mel-frequency cepstral coefficients. The classifier was tested on sentences in five regional indian languages, namely, Hindi, Bengali, Manipuri, Urdu and Kashmiri apart from English. Scores of the order of 97% for the test sets were obtained for speech of 4 sec duration.

Keywords: MFCC, Pitch, Probabilistic Neural Network

1. Introduction

Gender Identification1 based on the voice of a speaker, identifies if a speech signal is uttered by a male or a female speaker. Automatically detecting the gender of a speaker has several potential applications. Generally the parameterization technique used for speaker dependent or independent recognition is the same. However, in the case of speaker independent speech recognition the performance of the recognizer improves if separate male and female acoustic phonetic models are adopted.2 Gender dependent models are more accurate than gender independent ones for applications like Automatic Speech and Speaker Recognition3 as they can improve the performance by limiting the search space.

In fact, Gender Identification is generally treated as a pre-processor for, or a by product of, these problems therefore there are very few published papers on Gender Identification. The pitch information4 is normally used for the problem of gender identification. However, pitch estimation5 relies considerably on the speech quality. A general audio classifier approach using MFCC features and Gaussian Mixture models (GMM) as a classifier was followed in,6 with 73% of classification accuracy which is not promising. This paper describes a novel approach for voice-based Gender Identification independent of language, by combining pitch based approach with ANN based technique.

Section 2 describes the speech analysis required for gender identification and the proposed classifier is described in Section 3. Section 4 gives the details of the training and testing of the classifier and the

results have been discussed in Section 5.

2. Speech Analysis

This section describes the speech analysis7 required to extract the features required for Gender Identification. The potential general features that can be used are the pitch and MFCC. The pitch levels are what help a listener correctly identify a speaker's gender and MFCCs are features which represent the physiological voice parameters in a very efficient way. The mel bank simulates the critical band filters of the hearing mechanism.

2.1. Pitch Estimation

The most obvious difference between the male and female voice is fundamental frequency, or pitch. Due to higher mass of a male's vocal folds, the average speaking fundamental frequency for males varies between 100 — 200 Hz while the average for females varies between 200 — 300 Hz. Pitch if accurately estimated can therefore be a very good discriminating feature between genders.

Accurate pitch determination is one of the most difficult problems in speech analysis. Automatic pitch determination algorithms make various errors, such as pitch doubling and pitch halving. In the current algorithm, instead of looking for one single peak in the frequency domain, a Subharmonic-to-Harmonic ratio (SHR) is computed to determine the pitch.8 First, the linear frequency scale is transformed into a logarithmic scale, and peaks are found


376 Automatic Gender Identification Through Speech Analysis

in the spectrum. The sum of the pitch amplitude within the pitch range is computed and also sum of the peaks at half the peak frequency. The pitch is computed based on the ratio of these amplitudes. Then sub-harmonics are examined and if SHR is less than a certain threshold value, it indicates that sub-harmonics are weak in amplitude and peak frequency is taken as pitch otherwise half the frequency corresponds to the pitch. Based on the pitch value the decision is made for gender.

Gender Identification experiments were performed using pitch estimates for each frame of speech and then taking a majority logic decision for gender.

2.2. ANN model

Artificial Neural Networks (ANNs),9 have gained prominence in the area of pattern recognition, due to relatively simple implementation, inherently parallel algorithm (making parallel implementation a natural progression), robustness to noise and self-learning ability. The Probabilistic Neural Network (PNN)10

is based on well-established statistical principles derived from Bayes decision strategy. An advantage of the PNN is that it is guaranteed to approach the Bayes optimal decision surface provided that the class probability density functions are smooth and continuous. The only drawback with PNN when compared to other networks is the extra computation cost.11

The activation function used is spherical Gaussian radial basis functions centered at each training vector. The likelihood of an unknown vector belonging to a given class can be expressed as

d(x)=Ci, if.

Mi

fi(x) (2n)r>/2aPMi ^

-êxp -{X - Xjj) (X - Xjj)

2<r2

(1) where i is the class number, j is the pattern num

ber, Xij is the j t h training vector from class i, x is the test vector, Mj is the number of training vectors in class i, p is the dimension of vector x, a is the smoothing factor (the standard deviation,) and fi{x) is the sum of multivariate spherical Gaussians centered at each of the training vectors x^j for the ith class probability density function (pdf) estimate. Classification decisions are consequently made in accordance with the Bayes strategy for decision rule, which is

fi(x) > fk(x) for k ^ i (2)

where Cj is the class. For speech the temporal order of the vectors is

discounted and utterances are taken as a collection of vectors. Each of the vectors in the training set is then labeled with the class of the utterance, as in a standard PNN either belonging to either class 1 or 2. The maximum likelihood of over all the vectors is summed and the most probable class is the classification of the utterance. Classification of unknown utterances is then a generalization of the classification procedure of the classic PNN. Typically, a single unknown class input vector has its class likelihood calculated from all the training vectors in the PNN.

3. Proposed Classifier

The proposed classifier combines the two techniques, described above, for gender identification so that the errors in both of them are averted. The block diagram of the proposed classifier is given in Fig 1.

Input Speech

Feature Extraction

SHR based Pitch Extraction

Training of PNN Model

Classifier

Fig i . Stock .-feigrani of The proposed classifier

The input speech is divided into frames of 10 msec

Anu Khosla and Devendra Kumar Yadav 377

and FFT is performed. The peaks of FFT are used to compute the SHR. To compute the MFCC, mel warped filter bank is applied to the frequency domain before taking the log and DFT.

The pitch is computed for every frame and then a majority logic decision is taken for 4 sec. segment. If the pitch is below a threshold value the speaker is male and if it is above the speaker is considered to be female.

The MFCC feature vector is the input to the neural network with 80 neurons in the first layer. In PNN when an input is presented, the first layer computes distances from the input vector to the training input vectors, and produces a vector whose elements indicate how close the input is to a training input. The second layer sums these contributions for each class of inputs to produce as its net output a vector of probabilities. The output of the second layer picks the maximum of these probabilities, and produces a 1 for that class and a 0 for the other classes.

The classifier combines the inputs from both ANN and pitch to classify the speaker as male or female. If the pitch value is greater than 210 Hz or is below 180 Hz the weight given to the pitch based classifier is more whereas for pitch values between 180 - 210 Hz the weight of PNN output is higher.

4. Training and Testing Of the Classifier

Initial classification was tried using the pitch based and the ANN based approach separately for different frame sizes. With pitch based approach 70% correct classification could be obtained with just 500 msec of speech and as the duration was increased classification scores improved.

Different ANN models were tried for classification. The multilayer feed forward network with 12 neurons in the first layer with tansig as its activation function was tried. Adaptive learning rate with steepest descent algorithm was used for training the network. This model performed well for speech with duration of 4 sec. but probabilistic model performed better for the same duration.

PNN gives 93.33% correct classification when two speakers were used for training the network. However the result improves to 95.55% with four and 96.77% with six speakers for training the network. Whereas the multi layer feedforward network with back propagation gives 75% correct classifica

tion with same number of speakers for same duration The final version fused the pitch based approach

with the probabilistic neural network based classifier. The classifier was trained using 4 sec. speech of six speakers for each class. The recording was done in normal office environment at 16 KHz sampling rate. The utterances were in Hindi language.

Testing was done for 250 utterances of 4 sec. each spoken by 120 different speaker in Hindi, English, Urdu, Manipuri, Kashmiri and Bengali languages. The results are shown in Table 1.


The pitch method results in 90% accuracy rate for the gender whereas the PNN gives 96.77% with six speakers. When both pitch based classifier and ANN based classifier was fused then 97.5% correct classification was achieved for Gender Identification. Table 1 shows the final results obtained.

Gender Male Female

Male 98.5% 3.5%

female 1.5% 96.5%

Table 1. Confusion Matrix

It was observed that the results were poor for slow speakers. This is normally expected as the pause region does not provide sufficient information regarding the speaker. By using a pre-processor to remove the pause regions prior to the classifier the result may improve further.

The performance of the classifier may deteriorate in highly noisy environment as accurate determination of pitch may not be possible and MFCC may also be affected. To cater for such situations the classifier would have to be trained accordingly.

Acknowledgments

We are thankful to our Director, Dr. P. K. Saxena and Divisional Head Dr. S. S. Bedi for encouraging us to carry out this work and allowing us to present this paper.

References

1. Parris E. S., Carey M. J.: Language Independent Gender IdentificationProceedings of IEEE ICASSP pp.685-688, 1996.

378 Automatic Gender Identification Through Speech Analysis

2. Rivarol Vergin, Azarshid Farhat and Douglas OS-haughnessy Proceedings of IEEE ICASSP, pp 1081-1084, 1996.

3. L. Rabiner and B. Juang Fundamentals of Speech Recognition, Prentice Hall, 1992.

4. Xuejing Sun and Yi Xu: Perceived Pitch of Synthesized Voice with Alternate Cycles Journal of Voice, Vol. 16, No. 4, pp. 443 459, 2002.

5. L. R. Rabiner and R. W. Schafer Digital Processing of Speech Signals, Prentice Hall, 1992.

6. Tzanetakis G., Cook P: Musical genre classification of audio signals IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, July 2002.

7. J. R. Deller, J. G. Proakis and J. H. L. Hansen Discrete-Time Processing of Speech Signals, Prentice Hall,1993.

8. D. OShaughnessy Speech CommunicationHuman and Machine. Addison Wesley, 1987.

9. Simon Haykin Neural Networks: A Comprehensive Foundation", Macmillan College Publishing Company, 1994.

10. Specht D. F: Probabilistic Neural Networks IEEE Transactions on Neural Networks, Vol 3, p.p. 109-118, 1990.

11. Yu Hen Hu and Jenq Neng Hwang Handbook of Neural Network Signal Processing, CRC Press

379

Error-Driven Robust Particle Swarm Optimization for Fuzzy Rule Extraction and Structure Estimation

Sumitra Mukhopadhyay

Institute of Radio Physics and Electronics University of Calcutta


Ajit K. Mandal

ETCE Department Jadavpur University

Kolkata- 700032, India E-mail: ajit. k. mandal@ieee. org

In this paper we develop a Hierarchical Self-Organized Network that simultaneously evolves the structure and parameter of the Fuzzy rule-base from input-output data. Error-driven Robust Particle Swarm Optimization (ERPSO) learning has been employed for fine-tuning of the net parameters. Multidimensional Cross-over Vector has been used to achieve a trade off between residual error and the number of fuzzy rules. Experiments conducted with standard benchmark problems shows the effectiveness of the method with a small number of rules along with a comparable estimation error.

Keywords: Error-driven RPSO; Multidimensional Cross-over Vector

1. Introduction

Function approximation is a method that is used in evaluating the relationship between the input -output data pair and also in determining the best-fit model for describing the relationship in the field of pattern recognition, diagnosis and expert system designing and in many disciplines of engineering. A neural network (NN) can be used in this regard as it has the property of highly accurate approximation of non-linear mapping embedded in input-output data pair.

To design a NN from scratch, the data can be clustered by a set of hyperboxes and fuzzy clustering classification is used to avoid misclassification.1'2

Fuzzy ID-3 based method3 and orthogonal transform together with clustering are also widely used. Zarandi et al.4 has shown that the input selection, knowledge representation and approximate reasoning together in a systematic approach can be used as an efficient tool of rule extraction. But apart from extraction of rules, redundant rules can also be deleted to generate the optimal set of rule base.5

Apart from development, there are numerous combinations of learning schemes such as the hybrid, back propagation (BP) learning or evolutionary computation (EC) techniques that has been used as a training scheme for the above networks. But the

gradient descent technique in BP of errors, suffer with a slow speed of learning and it is susceptible to be trapped at a local minima. A number of researchers have attempted to use genetic algorithm (GA) along with fuzzy clustering or SVD-QR6 to find sets of weights, but the problem is not well suited to crossover. In contrast to GA, the PSO technique introduced by Kennedy and Eberhart,7 can be used as an efficient learning technique for the adjustment of weights and parameters of feed forward neural network. The scheme has been used for two objective algorithms based on PSO,8 automatic rule extraction and pruning.9 The above-mentioned PSO algorithm modified extensively can also be used along with Radial Basis Function Network (RBFN) for the development of the Hierarchical Self Organized Neural Network.10-12 Here the node recruiting is done based on some initial accommodation boundary and the boundary is modified at each iteration.

In this paper, a hierarchical algorithm is implemented to evolve the structure of RBFN from scratch with optimal number of nodes. Number of particles are scattered initially in the domain in search of the solution. During the learning phase, a particle leaving the flock is forced towards the target. The node recruiting method is directed in terms of a multi-dimensional crossover vector of the in-


380 Error-Driven Robust Particle Swarm Optimization for Fuzzy Rule Extraction and Structure Estimation

put fuzzy set as the accommodation boundary for the corresponding node. Apart from that the proposed RPSO,10 has been extensively modified to Error driven RPSO(HERPSO) by introducing an residual error function in the final stage of learning phase. Since all the particles spread the entire domain and evolve starting from a single node based on the minimization of RMS of the residual error and the best one is selected, it is expected that the probability of getting number of optimal fuzzy rule will be global one for a prescribed residual error and membership function(MF).

2. Network Architecture: Brief Review Of R B F N

The weighted Fuzzy NN (FNN) considered here is an adaptive multi-input and single-output RBFN.11

The system is five layered and here, layer 2 represents a Bell shaped MF of the form given below,

1 + ((* - c)Ar)2b v ' The basic architecture of the RBFN, presented

above12 can be divided based on the inference scheme (Type I, II, III ) and training of the neural network has been done using RPSO technique.10

3. ERPSO As a Learning Rule

The PSO7 is an optimization technique that resembles a school of flying birds.

3.1. Brief Review Of RPSO

The random terms in the learning equation of the existing PSO,7 were eliminated in RPSO,10 to accommodate the ill-defined and time varying problems where the algorithm fails to converge to the optimal solution at all or in finite number of iterations. The modified law of RPSO10 is given by:

and vid = w x vid + c\ x (pid ~ xid) + c2 x (pgd - xid)

Xid = Xid + vid (2)

3.2. Proposed Error Driven RPSO (ERPSO)

In some of the real life problem with voluminous data, it has been studied that RPSO cannot reduce the error between the desired and actual output below a particular threshold. Initially during the learning phase the error reduces monotonically but after

certain iteration the error in consecutive cycles remain fixed in a very small value. Introduction of random term in the learning in this regard is not a very good solution, as the output of the system becomes unpredictable, i.e., some time the result improves but for other times it becomes considerably poor. As the system is not getting enough driving force at the final stage of correction of parameters thus we propose to introduce a function of error which will dominate mostly in the final stage as required. Thus we desire to achieve a perfect balance between initial adaptive mode and final error driven mode of learning of parameters.

vid = wx vid + ci x (pid - Xid) + c2 x (pgd - xid) and

Xid = Xid + Vid (3)

where, w = wmax - ' ' - r " " " ) 1 + ^ J-* max '-'•max

3.3. ERPSO Based Self Organized Learning

A hierarchical self-organized scheme, which is the modification of HiSSOL,11 is developed to generate and train a population of RBFN. Initially Q number of particles are initialized surrounding the whole domain. The hidden nodes are recruited based on the sample data introduced and the corresponding multi-dimensional accommodation boundary of each node. Each boundary is the crossover point of the fuzzy set corresponding to the node. In contrast to the choice of boundary criteria,10 whenever nodes are recruited in the proposed method, it is assured that the corresponding data set is out side the crossover vector of all the existing nodes of the net. Also the ERPSO algorithm updates the boundary. As the network evolves from a single node that minimizes RMS error, it is expected that it will lead to the optimal number of nodes and hence optimal number of extracted rules for a convex data set for prescribed residual error and MF.

3.4. Hierarchical ERPSO [HERPSO] Algorithm

Here the general hierarchical scheme of training the NN is outlined below: Let, Q = the number of particles, Lmax = number of learning cycle, Uq = the

Sumitra Mukhopadhyay and Ajit K. Mandal 381

hidden nodes of the qth particle, n = training data, p — the number of output. Stepl: Initialize Q particles with no hidden node. Step2: Set the parameters, weights, error thresholds,

gbest and the pbest of each particle. Step3: Set L = 1; Step4: Set q = 1; Step5: Set j = Uq. At the Lth learning cycle, the

error of the qth particle is calculated.10

If the system is moving away of the solution then reinitialize the particle to its starting position. Determine pbest and the maximum error contribution among the n data points for the particle.

Step6: If Eg > Ethreshoid, then go to Stepl . Otherwise go to Step8.

Stepl: If there is a hidden node such that the training pattern generating maximum error falls inside its hypersphere, then go to Step8. Otherwise,create a new hidden node and go to Step9. The parameters of the new node is assigned.10

Step8: Modify the structure of the NN using struct_mod subroutine,10 if the particle size is different from the gbest particle size. If Error > Ethreshoid<l ow> then f {error) = Ethreshoid<iow>, Otherwise /(error) = Error. Parameters are trained based on ERPSO algorithm.

Step9: Uq = j . If equal(g, Q) =TRUE then go to SteplO ; Otherwise q = q+1 and go to Step5.

SteplO: Compare evaluation with group's previous best pbest [gbest]. Set L = L+l;

Stepll: If Error < Ethreshoid or L > Lmax then stop and select the best particle from the population. Otherwise go to Step4.

4. Analysis Of Performance

In this section the effectiveness of the proposed learning technique is demonstrated using benchmark problems. In all the simulations, we use symmetric triangular MFs in the consequence of Type III RBF based system. 20 particles are used with variable velocity bound . Cj s are adapted according to the w of equation (3) excluding the error-driven part.

4.1. Nonlinear Static System

Initially we try our algorithm with a nonlinear static system with two inputs x\, Xi and a single output

y-

y = (1 +a;J~2 + x ^ 1 , 5 ) 2 , where 1 < X\,X2 < 5. Prom this system equation, 50 input-output data pair is obtained. The non-linear characteristic and rule base of the system are shown in Fig. 1 and Fig. 2. Table 1 illustrates a comparative study.

4.2. Box And Jenkins Gas Furnace

The second example is a chemical plant given by Box and Jenkins gas furnace.5 The gas furnace has one input u(t), and one output y(t), which are the input gas rate and the % CO2 in outlet gas, respectively. There are 296 pairs of data points. We used the same setting as in some other research in which y (t— 1) and u(t — 4) are used to predict y(t). Thus, we have 292 data points. Bell-shaped membership function has been used, with the parameters initialized with random value, Cm a x=2.14, Cmin=0A, Wmax=0.8 and Wmin =0.1. Fig.3 and Fig. 4 show the results of identification by the proposed HERPSO algorithm and Table 2 shows a comparative analysis.

Table 1. Comparison of HERPSO with other model for Nonlinear Static System .

Model Name Input Rules P.I.5

HERPSO n , X 2 3 0.0157

Sugeno and Yasukawa17 X\,X2 6 0.079

4.3. Human Operation At A Chemical Plant

In this example we deal with a model of an operator's control of a chemical plant.17 The plant is for producing a polymer by the polymerization of some monomers. Since the start-up of the plant is very complicated, a man has to make the manual operation at the plant. There are five input candidates which a human operator might refer to for his control, and one output, i.e., his control. We obtain 70 data points of the above six variables from the actual plant operation. The reader is referred to previous paper17 for the detail definition of the system. Fig. 5 and Fig.6 shows the corresponding system output

382 Error-Driven Robust Particle Swarm Optimization for Fuzzy Rule Extraction and Structure Estimation

,-m"

i 1 ! '\ 1 I.'-

1 j | i i

i ' • • i > >

system is 0.0126.

i i in 1 >

• M

4

\ * . « >

" ~*v-

- -

• ?

< . . •

""*-—L_

1

_ T

* • » d *

" * r B t a ' J 3* *

,

r J C « •> > i

J _

! — r

. . i t .

j

— * .= *.-« BS

........

Fig. 1. Type III RBF based system : (a) growth of neuron (b)squared error during training (c)Prediction perfor-mance[desired (-), actual (—) ] .

1

OS

<3H

114

0 1

0,

# m t i

> * ..... ,

, 4 . . . •,,.

.

3 5

i r .

J M

*> 0

JO a

/ \ \ V \ -1

•> f

„. . . . . . . . . . . ^ J

s

3B>

7 &

U -1

a :

»

rCi

i l,

,' 1 I \ i '

1 £

Fig. 4. Two fuzzy rules for the Box and Jenkins gas furnace.

Fig. 2. Three fuzzy rules for Nonlinear Static System.

•mca

maa

MGO

0.

jfk^-J it l'.'fr * ^ •»J , „ . !.>-_ i ' j j a n „!> j» a.. us m

t ' / ^

i

; i

f f r T""

!

i H I ^3 < I * - ' r n . i i

<r» i i v.~

t3B SSQ 3W 3W

Fig. 3. Type III RBF based system : (a) growth of neuron (b)squared error during training (c)Prediction perfor-mance[desired (-), actual (—) ].

Fig. 5. Type III RBF based system : (a) growth of neuron (b)squared error during training (c)Prediction perfor-mancefdesired (-), actual {—~-) ].

C5

Ci

I I

,' 1

- J o L

'-\i : H 5 , i- =•

fK-lO 3 I "J

a- • J - l «

• • 'i _ > «

B "

1 '

v ~-~

'• a

1 1

. [ .

i i n r <<J r i •)

I-) I •< L •

and fuzzy rule base for the system. The P.I. of the Fig. 6. Three fuzzy rules for the human operated model.

Sumitra Mukhopadhyay and Ajit K. Mandal 383

Table 2. Comparison of HERPSO with other model for the Box and Jenkins Gas Furnace.

Model Name Input Rules P.I.

HERPSO

CPFIS 5

Tong's Model1 3

Pedryc's Model14

Xu's Model15

Sugeno and Tanaka16

Sugeno and Yasukawa17

Wang, Langari18

Wang, Langari19

Lin and Cunningham2 0

Kim et al.21

Kim et al. citeref22

Zhang and Kandel2 3

!/(*-

yit-

y(t-

vit-

y(t-

yit-

y{t-

u(t-

yit-

u(t-

yit-

yit-

u(t-

v(t-

y{t-

u ( t -

u(t-

y(t-

vit-u(t-

v(t-

y(t-

y(t-

u(t-

-1)

- 1 )

-1 )

-1 )

-1 )

-1)

-3 )

" 1)

- 1)

-3 )

" 1)

-3)

- 1)

- 1)

" 1)

-3 )

- 6 )

- 1)

-3)

-1 )

- 1)

- 1)

-3)

-2)

,u(t-

,u(t-

, u ( t -

,u(t-

, « ( * -

,v(t-

,u(t)

, « ( t -

.«(*-

,yit-

,u(t)

,«(*-,u(t-

,!»(*-

,u(t~

,u(t)

,u(t-

.«(*-

,y(t~

.«(*-,u(t-

-4 )

-4 )

-4 )

-4)

-4 )

-2) ,

- 2 )

-4)

-2)

-2 )

-4)

-2)

-5 )

-2)

-2)

-4)

2)

-1)

-3)

2

2

19

81

25

2

6

2

5

4

2

11

2

0.0579

0.163

0.469

0.32

0.328

0.068

0.19

0.066

0.158

0.071

0.055

0.108

0.058


The concept has been verified using few benchmark

problems. Results of the problems are presented in

Table 1 and Table 2, which reveal tha t the number of

extracted fuzzy rules has been significantly reduced

using E R P S O .

R e f e r e n c e s

1. H. Kuo et al.,A clustering assisted method for Fuzzy rule extraction and pattern classification, in Procs. IEEE Int. Con}. , 679 (1999).

2. H. Zhu et al.,Feature Region Merging Based Fuzzy Rules Extraction for Pattern Classification, in Procs. IEEE Int. Con}, on Fuzzy Systems , 696 (2003).

3. M. Umano et al.,Extraction of Quantified Fuzzy Rules from Numerical Data, in Procs. IEEE Int. Conference , 1062 (2000).

4. M. F. Zarandi et al.,A systematic approach to fuzzy modeling for rule generation from numerical data, in Procs. IEE Int. Conference , 768 (2004).

5. T. K. Yin, A characteristic-point-based fuzzy interference system aimed to minimize the number of fuzzy rules, IEEE Trans. Fuzzy syst 12, 250 (2004).

6. C. Wong et al.,Fuzzy rule extraction by a hybrid method for pattern classification, in Procs. IEEE Int. Conference , 1798 (2001).

7. Y. Shi,Particle Swarm Optimization, Electronic data systems, IEEE Neural Network Society , (2004).

8. M. Ma et al.,Fuzzy rule extraction by two-objective particle swarm optimization and application for taste identification of tea, in Procs. Int. Conference on Machine Learning and Cybernetics , 5690 (2005).

9. C. Zhang et al.,Particle swarm optimization for evolving artificial neural network, in Procs. IEEE Int. Conf. , 2487 (2000).

10. S. Mukhopadhyay et aZ.,Fuzzy rule extraction using Robust Particle Swarm Optimization, in Springer LNCS on ISNN-2006 3971, 762 (2006).

11. B. K. Cho et aZ.,Radial basis function based adaptive fuzzy systems and their application to system identification and prediction, Fuzzy Sets and Syst. 83, 325 (1996).

12. S. Mukhopadhyay et al.,Extraction of Optimal Number of Fuzzy Rules by Evolution, in IEEE-FUZZ-2006 , (2006) (accepted).

13. R. M. Tong.The evaluation of fuzzy models derived from experimental data, Fuzzy Sets and Syst. 4, 1 (1980).

14. W. Pedrycz,An identification algorithm in fuzzy relational systems,iFuzzy Sets Syst 13, 153 (1984).

15. C. W. Xu et al.,Fuzzy model identification and self-learning for dynamic systems, IEEE Trans. Syst, Man, Cybern. 17, 683 (1987).

16. M. Sugeno et al,Successive identification of a fuzzy model and its applications to prediction of a complex system, Fuzzy Sets Syst. 42, 315 (1991).

17. M. Sugeno, A fuzzy-logic-based approach to qualitative modeling, IEEE Trans. Fuzzy syst 1, 7 (1993).

18. L. Wang et al.,Building sugeno-type models using fuzzy discretization and orthogonal parameter estimation techniques, IEEE Trans. Fuzzy Syst. 3, 454 (1995).

19. L. Wang et al.,Complex systems modeling via fuzzy logic, IEEE Trans. Syst,Man, Cybern. B 26, 100 (1996).

20. Y. Lin et al.,A new approach to fuzzy-neural system modeling, IEEE Trans. Fuzzy Syst. 3, 190 (1995).

21. E. Kim et al.,A new approach to fuzzy modeling, IEEE Trans. Fuzzy Syst. , 5 (1996).

22. S. Kim et al.,A polynomial fuzzy neural network for identification and control, in Procs. Biennial Conference NAFIPS-1996 , 768 (2004).

23. Y. Q. Zhang et al.,Compensatory Genetic FNN and Their Applications, in Procs. Singapore: World Scientific , 30 (1998).

384

H M M Based POS Tagger and Rule-Based Chunker for Bengali

Sivaji Bandyopadhyay and Asif Ekbal

Computer Science and Engineering Department Jadavpur University

Kolkata, India Email:[email protected], ekbaLasi}[email protected]

The present work describes a Part of Speech (POS) tagger based on the Hidden Markov Model and a rule-based chunker for Bengali. The POS tagger has been trained on a manually tagged corpus and it has demonstrated 87.26% accuracy. The performance of the system has been compared with the popular language independent T n T tagger with the same training and test sets. TnT tagger has demonstrated 85.85% accuracy. The chunker for Bengali has been developed using rule-based approach since adequate training data was not available. The rule-based chunker has been tested on the same test set and demonstrated 97.52% accuracy in chunk boundary identification only and 96.9% accuracy in chunk boundary identification and chunk labelling.

Keywords: Hidden Markov Model (HMM), Part of Speech (POS), Chunker, Corpus, Viterbi algorithm

1. Introduction

Part of Speech (POS) tagging is the process of assigning a part of speech to each word in a corpus. The annotations attached to the words in the corpus are known as tags. Tags are also usually applied to punctuation markers. If punctuation marks are attached with a word, the tokenization procedure is to be invoked before the POS tagging process. The POS tagger for a particular language requires at least one tagset, the set of potential POS tags that can be assigned to the words in the language. There may be more than one tagset for a language. There are a small number of popular tagsets for English, many of which evolved from the 87-tags tagset used for the Brown corpus. Some of the best known tagsets in English are Penn Tree bank (45 tags), Lancaster UCREL C5 (61 tags) and Lancaster C7 (145 tags). But there is no such standard tagset for Indian languages. In this work we have used a tag set having 27 different tags, developed by International Institute of Information Technology, Hyderabad, India a

for Indian languages and which is still in the process of being standardized. Some of the tasks for which POS tagging has proved useful include machine translation, information extraction, information retrieval and higher-level syntactic processing.

The task of POS tagging can be considered as a classification problem since the issue is to assign a sequence of tags that is optimal for a set of given words. Most tagging algorithms can be broadly clas-

a http://ltrc.iiit.net/nlpaLcontest06/iiit-tagset_ guidelines.pdf

sified into two categories namely, rule-based tagging and stochastic tagging. Rule-based taggers generally involve a large database of hand-written disambiguation rule which specify, for example, that an ambiguous word is a noun rather than a verb if it follows a determiner. Stochastic taggers generally resolve tagging ambiguities by using a training corpus to compute the probability of a given word having a given tag in a given context. Here, a stochastic POS tagger has been developed. There have been enormous attempts for English POS tagging employing machine learning techniques like transformation-based error-driven learning,1 decision trees,2 maximum entropy methods,3 conditional random fields4 etc.

Chunks are non-overlapping regions of text, usually consisting of a head word (such as noun), adjacent modifiers and function words (such as adjectives and determiners). Chunking is a kind of shallow syntactic analysis where certain sequences of words in a sentence are identified as phrases of various types (noun phrase, verb phrase etc.). Earlier references to chunking and shallow parsing can be found in5 and.6

Chunking task can be formulated as tagging task. By formulating the task of NP (Noun Phrase)-Chunking as a tagging task, a large number of machine learning techniques have become available to solve the problem of chunking. The NP classification task can be extended to other types of chunks and with some effort even to finding relations.7

Other than Hindi,8'9 no significant published works have been found as yet on POS tagging and chunking in other Indian languages.



http://ltrc.iiit.net/nlpaLcontest06/iiit-tagset_

Sivaji Bandyopadhyay and Asif Ekbal 385

The rest of the paper is organized as follows: Section 2 describes the Bengali tagset used in this work. Section 3 deals with the Hidden Markov Model (HMM) based POS tagging. Section 4 deals with rule based chunking. Section 5 shows the results and some discussions on the results of POS tagger and chunker. Finally Section 6 concludes the paper.

2. Bengali Tag set

All the tags used in this tagset are broadly classified into three types. There are some tags that have been adopted with some minor changes in the Penn tagset. They are grouped into one group. The second category of tags is of those that are a modification over the Penn tagset. The last group is of all those tags that are not present in the Penn tagset. They have been designed to cater to some phenomena that are specific to Indian languages.

• Group 1: NN-Noun, NNP-Proper Noun, PRP-Pronoun, VAUX-Verb Auxilliary, JJ-Adjective, RB-Adverb, RP-Particle, CC-Conjunction, UH-Interjection, SYM-Special Symbol.

• Group 2: PREP-Postposition, QF-Quantifiers, QFNUM-Quantifiers Number, VFM-Verb Finite Main, VJJ-Verb Non-Finite Adjectival, VRB-Verb Non-finite Adverbial, VNN-Verb Non-Finite Nominal, QW-Question Words.

• Group 3: NLOC-Noun Location, INTF-Intensifier, NEG-Negative, NNC-Compound Nouns, NNPC-Compound Proper Nouns, NVB-Noun in Kriyamula, JVB-Adjective in Kriyamula, RBVB-Adverb in Kriyamula, INF-Verb infinitival, QF-Quotative (additional tag).

Some examples of the tags are as follows: PREP - All Indian languages have the phenomenon of postpositions (e.g. tomar / PRP janno/ PREP). QF - All quantifiers like 'korn' 'feesi' etc. will be marked as QF (e.g. anek / QF lok). QFNUM - Any word denoting numbers (cardinal or ordinal) will be tagged as QFNUM (e.g. dashta / QFNUM lok). VJJ - Unlike Penn tagset, all non-finite verbs, which are used as adjectives, will be marked as VJJ (e.g. jar lekha / VJJ bai). VRB - Unlike Penn tagset, non-finite forms of verbs that are used as adverbs will be tagged with a different tag VRB (e.g. cheleta (khete khete) / VRB ghare

ello.) VNN — This tagset will mark gerunds as VNN (e.g. ekbar bari jaoa / VNN darkar). NLOC - This is an entirely new tag introduced to cover some phenomena of Indian languages (e.g. tabiler opore / NLOC baiti rakha achee). INTF - This tag is not present in the Penn tagset. Words like lkhub\ 'kom1 etc. will be tagged as INTF. NNC - This tag has been introduced in order to identify un-hypenated compound words as one unit (e.g. kendriyo / NNC sarkar / NN). NNPC - All words in a compound proper noun will be marked as NNPC excluding the last one (e.g. atal I NNPC behari / NNPC bajpayee / NNP). NVB - Noun in kriyamulas are verbs formed by combining a noun with a verb (e.g. bichar / NVB kara). JVB -Adjectives in kriyamula are verbs formed by combining an adjective with a verb (e.g. provabito / JVB kara).

RBVB - Adverbs in Kriyamula are verbs formed by combining an adverb with a verb. Many Indian languages like Hindi have use of this type of verb. But, this type of verbs is not present in Bengali (e.g. yaHa to jarUra / RBVB HE / VFM). INF- Any verb of infinitival sense will be marked as INF (e.g. maachh kinte / INF bazare gelo).

There are some phenomena in Bengali that need to be dealt with separately in a Bengali tagger and just changing or adding tags cann't handle these. Reduplication is such a phenomenon in Indian languages where the same word is written twice for emphasis (e.g., choto choto ["small" "small"-very small], lal lal ["red" "red"-deep red]). There are two ways in which these words can be written. In one way, they are separated with a space and sometimes they are separated with a hyphen. When these words are written with a space in between, the same tag is used for both the words (e.g., dhire/RB dhire/KB, choto/J J choto/33, gali/NN gali/NN ["slowly slowly", "small small", "lane lane"]). But when they are written with a hyphen, they will be tagged as one word (e.g., choto-choto/33 ["small-small"]). Exceptionally if the hyphen is written with a space then they will be marked with the same tag as in the earlier case (e.g., choto/3 J - choto/33). As any other language, Bengali too has many loan words. Such foreign words will be tagged as per the syntactic function of the word in the sentence.

386 HMM Based POS Tagger and Rule-Based Chunker for Bengali

3. Hidden Markov Model Based POS Tagging

A POS tagger based on Hidden Markov Model (HMM)10 assigns the best sequence of tags to an entire sentence. Generally, the most probable tag sequence is assigned to each sentence following the Viterbi algorithm.11 The task of Part of Speech (POS) tagging is to find the sequence of POS tags T = ti,t2,ts,.. .tn that is optimal for a word sequence W = w\, W2, w3 ... wn. The tagging problem becomes equivalent to searching for argmaxTPiT) * P(W\T), by the application of Bayes' law (P(W) is constant). The probability of the tag i.e., P(T) can be calculated by Markov assumption which states that the probability of a tag is dependent only on a small, fixed number of previous tags. We have used tri-gram model, i.e., the probability of a tag depends on two previous tags, and then we have, P(T) = P(ti) x P(i2 |*i) x P(t3\ti, t2) x P(t4 | t2 ) t3) x . . . x P(tn\tn-2, *n-i)- An additional tag '$' (dummy tag) has been introduced in this work to represent the beginning of a sentence. So, the previous probability equation can be slightly modified as: P(T) = P(h\$) x P(t2 |$ , t i ) x P(t3\h,t2) x P(*4|*2,*3) x . . . x P(tn\tn-2,tn-i). Due to sparse data problem, the linear interpolation method has been used to smooth the tri-gram probabilities as follows: P ,(t„| t„_2,*n-l) = AiP(t„) + X2P(tn\tn-1) +

^3P{tn\tn-2,tn-i) such that the As sum to 1. The values of As have been calculated by the following method:12

(1) set Ai = A2 = A3 = 0 (2) for each

tri-gram {ti,t2,t3) with freq(t\,t2,t3) > 0 depending on the maximum of the following three values:

•case : l{A'%)-~»-- increment A3 b y freq(ti,t2,t3)

* c a s e : ({7refc)-T) ) : increment A2 by freq(ti,t2,t3)

• case: ^ ' " f ^ A - ^ : increment Ai by freq(t1,t2,t3)

(3) normalize Ai, A2, A3.

Here, N is the corpus size, i.e., the number of tokens present in the training corpus. If the denominator in one of the expression is 0, then the result of that expression is defined to be 0. The - 1 in both the

numerator and denominator has been considered for taking unseen data into account.

By making the simplifying assumption that the relation between a word and its tag is independent of context, we can simplify P(W\T) as the following equation: P{W\T) « Piw^ti) X P(ttfa|t2) X . . . X P{Wn\tn). The emission probabilities in the above equation can be calculated from the training set as: Emission Probability: P(wi\U) = frf?£ffi.

3.1 . Context Dependency

To make the Markov model more powerful, an additional context dependent feature has been introduced to the emission probability in this work that specifies the probability of the current word depends on the tag of the previous word and the tag to be assigned to the current word. Now, we calculate P(W\T) by the following equation P(W\T) « P(wi\$,ti) x P(w2\ti,t2) x . . . x P{wn\tn-\,tn)- So, the emission probability can be calculated as

T>, u ^ •> freq(ti-i,ti,Wi) P(wi\ti-i,ti) = - T - A

freq(ti-i,ti)

Here also the smoothing technique is applied rather than using the emission probability directly. The emission probability is calculated as: P'(wi\U-i,ti) = 9\P(wi\ti) + 02P(wi\ti-i,ti), where 9i, 02 are two constants such that all 0s sum to 1. The values of 0s should be different for different words. But the calculation of 0s for every word takes a considerable time and hence 0s are calculated for the entire training corpus. In general, the values of 0s can be calculated by the following method like As:

(1) set 0i = 02 = 0 (2) for each bi-gram (ii , t2) with /reg(£i,£2) > 0

depending on the maximum of the following two values:

• c a s e : UuJ£(ff-~i)): Cerement 02 by freq(ti,t2)

• case: ( f^_A~ : increment 0\ by / reg( i i , i 2 )

(3) normalize #1, 02.

Now, the emission probability and transition probability have been joined together to set up the modified Hidden Markov model as shown in Figure 1.


P(w._jti_3?ti_2)

-©-

P(^-llM-2,ti-l)i P(w i |t i_1t |)

•©-

Viy

p ( w i + i l V i + i >

J t i+ l

P(ti_l|ti_3, t i_2) P(tj | t i_2 , t i_! ) P( t j+ 1 \t{.h t ; )

Fig. 1. Hidden Markov Model (Modified)

3.2. Viterbi Algorithm

Now it is known how to derive the probabilities needed for the Markov model, and how to calculate P{T\W) for any particular (T, W) pair. But what is really needed is to be able to find the most likely T for a particular W. The Viterbi algorithm11 allows us to find the best T in linear time. The idea behind the algorithm is that of all the state sequences, only the most probable of these sequences need to be considered. The trigram model has been used in the present work. The pseudo code of the algorithm is shown bellow. for i = l to Number_oLWords-in_Sentence

for each state c eTag_Set for each state b sTag-Set

for each state a e Tag_Set for the best state sequence ending in state a at time (i — 2), b at time (i — 1), compute

the probability of that state sequence going to state c at time i.

end end

end Determine the most-probable state sequence ending

in state c at time i. end

So if every word can have S possible tags, then the Viterbi algorithm runs in 0(S3 x \W\) time or, linear time with respect to the length of the sentence.

3.3. Handling the Unknown Words

Handling of unknown words is an important issue in POS tagging. For words which have not been seen in the training set, P(wi\ti) is estimated based on features of the unknown words, such as whether the word contains a particular suffix. The list of suffixes has been prepared. At present we have 435 suffixes; many of them usually appear at the end of verb, noun and adjective words. A null suffix is also kept for those words that have none of the suffixes in the list. The probability distribution of a particular suffix with respect to specific POS tags is generated from all words in the training set that share the same suffix. Apart from suffix analysis, two other features have been included that tackle tokens of digits and symbols.

4. Chunker

In this work, following chunks suggested by HIT, Hyderabad, India have been used:

NP- NP stands for noun phrase. This chunk will include minimal noun phrases and prepositional phrases [e.g. ((bhalo) / J J (chele) j NN)NP]. VG- This stands for verb group. A verb will include the main verb and it's auxiliaries, if any [e.g. ((kinte) I INF (chai) / VFM) / VG].


JJP- The adjectival chunk will be marked as JJP. This phrase will consist of all adjectival chunks including the predicating adjective [e.g. ((cheleti) / NN (bhalo)/ JJ) / JJP]. RBP- This chunk will include all pure adverbial phrases [ e.g. cheleti (dhree / RB dhree / RB) / RBP chale galo ].

CCP- A conjunctional chunk will be marked as CCP. It consists of conjunctions when they appear to join more than one sentence [ e.g. tumi chale gale ( aar / CC) / CCP tar parey to jato gandogol ]. BLK- Tokens that do not fall into any of the above categories are tagged as BLK [ e.g. amol (:/ SYM) / BLK tomar nam ki ? ]

For the chunker, the rule-based approach has been used due to unavailability of large chunked corpus. The proposed chunking algorithm is divided into two phases.

• Chunk boundary identification • Chunk labeling

4.1. Chunk Boundary Identification

To identify the chunks, it is necessary to find the positions where a chunk can end and a new chunk can begin. These positions are also marked. The POS tag assigned to every token by the POS tagger is used to discover these positions. The chunk boundaries are identified by some handcrafted linguistic rules that check whether two neighboring POS tags belong to the same chunk or not. If they do not, a chunk boundary in between the words is assigned. Procedure of Chunk Boundary Identification is given below: For i = 2 to No.of.Word (i — 1) position will be a chunk boundary if all of the following cases don't satisfy: a)«i_i G {INTF} and U G {INTF, JJ, PRP, QF, QFNUM, NN, NNC, NNP, NNPC} b)tj_i G {JJ, QF, VJJ} and U G { JJ, PRP, QF, QFNUM, NN, NNC, NNP, NNPC} c)ti_i G QFNUM and U e {PRP, QFNUM, NN, NNC, NNP, NNPC, VNN} d)ti_i e VFM, VAUX, VJJ, VNN and U G {NEG, RP} e)ti_i G {NEG} and U G {VFM, VAUX, VNN, VJJ, VRB, JVB, NVB} t)U-\ € {NNC, NNPC} and U £ {NN, NNC, NNP, NNPC} g)t4_i G {NN, NNP, PRP, VNN} and U G {PREP,

RP, NLOC} h)t4_i G {RB} and U e {RB} i)ti_i G {JVB, NVB} and U G {VFM, VAUX} j)ti_i G {PREP, NEG, NLOC, RP} and U G {RP} End

Last word of the sentence is also marked as a chunk boundary.

4.2. Chunk Labelling

After chunk boundary identification, the chunks are labelled. The components (i.e., POSs of tokens) within a chunk help to assign the label on the chunk.

Following rules have been used to label the chunks: a). A chunk will be labeled as NP if it contains at least one noun. b). If a chunk contains an adjective but not any noun then it will be labeled as JJP. c). A chunk will be labeled as VG if it contains at least one verb. d). A chunk will be labeled as RBP if it contains at least one adverb. e). A chunk will be labeled as CCP if it contains only a word tagged as CC or QT. f). Any other chunk will be labeled as BLK.

Table 1. Performance of the POS Tagger

Training Set No. 1

2

TRNT

40956

67539

TST

5967

5967

UTST

1078 (18.07%)

806 (13.5%)

CTT

5097

5207

Accuracy

(%) 85.42

87.26


The POS tagger is initially trained with the help of relatively smaller tagged training corpus. The POS tagger has been tested on the manually tagged corpus that is used as the Gold standard to evaluate the POS tagger. The output of the POS tagger has been compared with the manually tagged version of the test set. The POS tagger is able to assign tags to unknown words and hence the precision and recall figures of the POS tagger are same and can be considered as the accuracy of the tagger. Table 1 shows the results of the POS tagger by increasing the size of the training corpus and keeping the test corpus


same. Following abbreviations have been used in Table 1: No. of tokens in the training set (TRNT), No. of tokens in the test set (TST), No. of unknown tokens in the test set (UTST), No. of tokens that are tagged correctly (CTT). It is evident from the table that increasing the size of the training set, the accuracy of the POS tagger can be improved.

The popular language independent TnT tagger has been trained on the same training corpus (2) and then run on the same test corpus. It has demonstrated an accuracy of 85.85%. Accuracy of our proposed POS tagger increases compared to TnT tagger as more context information have been incorporated.

Error analysis of the POS tagger has been done with the help of confusion matrix. A confusion matrix for an N-way classification task is an N-by-N matrix, C, where the cell C(x, y) contains the number of times (in percentage) an item with correct classification x was classified by the model as y with respect to the total error. The row labels indicate correct tags, column label indicate the taggers hypothesized tags, and each cell indicates percentage of the overall tagging error. For the test set, a part of confusion matrix is shown in Table 2. Since nouns appear most frequently in a corpus, unknown words have a tendency of being assigned noun tags (NN, in most cases) through probability calculation, which is reflected in Table 1. A close scrutiny of the confusion matrix as shown in Table 1 suggests that some of the probable tagging errors facing the current POS tagger are NNC vs NN, JJ vs NN and NNP vs NN. A multiword extraction unit for Bengali would have taken care of the NNC vs NN problem. Similarly, a named entity recognizer would have taken care of the NNP vs NN problem. The problem of JJ vs NN is hard to resolve and probably requires the use of linguistic rules.

The rule-based chunking system is evaluated with the same manually tagged test set. After removing the chunk labels and chunk boundary markers (keeping the manually assigned POS tags intact) of the manually chunked corpus, we have applied our chunker and obtained the following results: No. of chunks in the test set= 4643 No. of identified chunk boundary = 4528 Percentage of accuracy for boundary identification = 97.52% No. of identified chunks with proper label = 4499 Percentage of accuracy = 96.9%

6. Conclusion and Future Works

Here, in this paper we have developed a POS tagger based on HMM and a rule based chunker. The performance can be improved by incorporating more features for the unknown words and by enhancing the size of the training corpus. We would like to use Maximum Entropy Markov Model (MEMM) and Conditional Random Field (CRF) in our future works.

Table 2. Confusion Matrix

JJ NN NNC NNP VFM

JJ

-2.64 1.84 0.12 0.34

NN

7.70

-8.39 2.87 1.15

NNC

0 1.38 -0 0

NNP

0.12 1.61 1.45

-0

VFM

0.34 0.46

0 0 -

References

1. E. Brill, Computational Linguistics 21, 543 (1995). 2. E. Black, F. Jelinek, J. Lafferty, R. Mercer and

S. Roukos, Decision tree models applied to labeling of text with part-of speech, in Darpa Workshop on Speech and Natural Language, (Harriman, NY, 1992).

3. A. Ratnaparkhi, A maximum entropy part-of-speech tagger, in Proc. of EMNLP'96., 1996.

4. J. Lafferty, A. McCallum and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in Proceedings of the 18th International Conference on Machine Learning, 2001.

5. S. P. Abney, Pursing by Chunks, in Principle-Based-Parsing: Computational and Psycholinguistics, eds. R. C. Berwick, S. P. Abney and C. Tenny (Kluwer, Dordrecht, 1999), pp. 257-278.

6. L. A. Ramshaw and M. P. Marcus, Text chunking using transformation-based learning, in Proceedings of the third Annual workshop on very large corpora, 1995.

7. S. Buchholz, J. Veenstra and W. Daelemas, Cascaded grammatical relation assignment, in Proceedings of the EMNLP / VLC-99, 1999.

8. S. Singh, K. Gupta, M. Shrivastava and P. Bhat-tacharyya, Morphological richness offsets resource demand-experiences in constructing a pos tagger for hindi, in Proceedings of the COLING/ACL 2006, 2006 pp. 779-786.

9. A. Singh, S. Bhendre and R. Sangal, HMM Based Chunker for Hindi, in Proceedings of the IJCNLP-05, 2005 pp. 126-131.

10. D. Jurafsky and J. H. Martin, Speech and Language Processing (Prentice-Hall, 2000).


11. A. J. Viterbi, IEEE Transaction on Information in Proceedings of the sixth conference on applied nat-Theory 13, 260 (1967). ural language processing ANLP-2000, 2000 pp. 224-

12. T. Brants, TnT a statistical parts-of-speech tagger, 231.

391

Non-Contemporary Robustness in Text-Dependent Speaker-Recognition Using Multi-Session Templates in an One-Pass Dynamic-Programming Framework

V. Ramasubramanian, V. Praveen Kumar and S. Thiyagarajan

Siemens Corporate Technology - India Siemens Information Systems Ltd., Bangalore - 560100, India

{ V. Ramasubramanian, V.Praveenkumar} ©Siemens, com, [email protected]

Non-parametric pattern recognition approaches enjoy the advantage of circumventing the difficulties of using an optimal paramteric form for the unknown underlying class distributions. Further, use of non-parametric training data has the added benefit of modeling the unknown class distribution in toto, thus implicitly accounting for every underlying variability in the class data. In keeping with this principle, we have proposed a variable-text text-dependent speaker-recognition system based on the one-pass dynamic programming algorithm using multiple templates of the words of a speaker, so as to represent the speaker distribution and intra-speaker variability effectively. In this paper, we show the advantage of using such multiple templates drawn from multiple sessions for achieving robustness to non-contemporary test data. This is an important problem in practical bio-metric applications, where the user features vary with time significantly, in what is referred to as inter-session variability. We evaluate this multi-session template algorithm on non-contemporary test data from a multi-session database and show that a definite and significant performance improvement is achievable with the use of multi-session templates.

Keywords: Text-dependent speaker-recognition; Non-contemporary robustness; One-pass DP algorithm; Multiple templates; Inter-session variability

1. Introduction

One of the most important problems in any biomet-ric system is the variability that the bio-metric signal (face, fingerprint, voice, iris, retinal scan etc.) undergoes over time and which causes the system to perform poorly when used after a significant time gap after the training session. The main reason for the poor performance is the inherent mismatch between the models obtained for a person from the training data at a particular time and the 'test' signals at a later time. This is commonly referred to as testing on 'non-contemporary' data, implying that the test data comes from a time which is not contemporary, i.e., not at the same time as the training data.

While this is the general scenario in all biomet-ric applications, the use of voice as a biometric faces quite severe problems with such non-contemporary testing, as the voice characteristics of a person shows significantly more variation over time than some other biometric signals such as iris, retinal scan or finger prints. In speaker-recognition literature, this is commonly referred to as 'session variabilities' and has attracted sufficient attention from researchers in the past few years. Particularly, there are several approaches to handle this problem for text-independent speaker-recognition, such as feature warping, mapping etc., which attempt to compensate for session

variability during feature extraction or score normalization. More recently, there have been attempts to using session GMMs1 and use of speaker-adaptation techniques,2 wherein a session-independent speaker model is combined with a session-dependent offset of model means.

However, there has been practically no efforts to handle session variabilities in text-dependent speaker-recognition, though this mode of operation has proven to be more viable for realizing practical systems.3 In this paper, we propose a novel method for making a text-dependent speaker recognition system robust to session variability. This method is based on the one-pass dynamic programming (DP) algorithm proposed by us recently for variable-text text-dependent speaker-recognition,4,5 where it uses multiple templates of each word in the vocabulary (that makes up the passwords of the system) to handle intra-speaker variability. We have adopted this algorithm in the following way to deal with the problems of session variability.

For handling session variability, the mutliple templates are drawn from multiple sessions so as to capture the inter-session variabilities. The one-pass DP algorithm provides for a selection of a suitable template from these multiple templates which best matches the input speech. By this, test data which is non-contemporary is provided with a choice of tern-


392 Non-Contemporary Robustness in Text-Dependent Speaker-Recognition Using Multi-Session Templates ...

plates, of which one of them would provide the optimal match through the multiple - template one-pass DP algorithm, thereby leading to improved speaker - recognition performance.

2. The multiple-template one-pass D P algorithm

The proposed multi-session template algorithm is based on the one-pass DP matching algorithm proposed by us recently,45 Here we briefly describe this system which is shown in Fig. 1. Here, each speaker has a set of templates for each word in the vocabulary. For example, for the word 'nine', there are four templates, Rgi, Rg2, R93, ^94- Given an input utterance, the feature extraction module converts the input speech utterance into a sequence of feature vectors. We used the mel-frequency-cepstral coefficients (MFCCs) as the feature vector. This feature vector sequence corresponds to the input 'password' text (say, the digit string 915 in the figure) and is presented to the forced alignment module. At the same time, the corresponding concatenated set of multiple reference templates for '9', ' 1 ' and '5 ' along with the inter-word silence templates are also presented to the forced alignment module.

Feature vector sequence 0= Oj 02 . . . o t . . . o T

Input FEATURE EXTRACTION

• DP Forced Alignment

Match Distance

- D p D f O . T x t l S j )

Password text of input utterance ' RsMR91R92 R 9 3 R 9 4

Txt="9l5"

Set-up one-pass DP recursions

R s i l R l l R 1 2 R 1 3 R 1 4

R s i l R 51 R 52 R 5 3 R54Rsil

0 U(

1 W

5 W,

R 0 1 R 0 2 R 0 3 R 0 4

R ] l R 1 2 # R l 3 R 1 4

R 5 1 R 5 2 R 5 3 R 5 4

9! 92 93 94

\ Concatenated

Reference Templates

Multiple word-templates of Speakers j

Fig. 1. Variable-text speaker-recognition algorithm based on one-pass DP matching with multiple templates

The one-pass DP algorithm matches the feature-vector sequence O against the multiple-template and inter-word silence based word-model sequence for speaker-5i. The resulting match score D, = D(0,Txt\Si) is the optimal distance between the input utterance O and the word-templates of

speaker Si corresponding to the password text 'Txt'. This score is used in different ways in the three speaker-recognition systems, namely, the closed-set speaker-identification, speaker-verification and open-set speaker-identification.5

Fig. 2 illustrates the use of multiple templates in the one-pass DP forced alignment between the input utterance (on the x-axis) and the word-templates (on the y-axis). The same example password of '915', as in Fig. 1, is used. Even though multiple templates are being used for all the words, only the multiple templates of the word 1 are shown on the y-axis for the sake of clarity. From the best warping path obtained by the one-pass DP algorithm in this example, it is seen here that the template 2 of word 1 (#12) has been chosen as the best matching template for that part (word T ) of the input utterance.

R5

•E

| R M •s | R u

1 1 RM a S R . . .

R.

f

^

9 1 5 Input test utterance

Fig. 2. One-pass DP matching between test utterance and multiple training templates corresponding to password text

3. Multi-session templates for non-contemporary robustness

The central principle behind making the text-dependent speaker-recognition system robust to session variability (i.e., non-contemporary testing) is to use multiple templates drawn from multiple sessions spread over time. These templates are called 'multi-session templates'. For instance, let the multiple templates shown in Fig. 2 be drawn from 4 different sessions at various times. Let the test data be from a session which is non-contemporary to all these 'training' sessions. If the number of multi-templates is adequate and represents all possible variations that can

V. Ramasubramanian, V. Praveen Kumar and S. Thiyagarajan 393

occur over time for a speaker, then the above one-pass DP matching will select the optimal template from the 4 templates which best matches with the input test utterance as reflected in the final optimal warping path of the one-pass DP match. Thus, in keeping multi-session templates, the algorithm is equipped to cope with non-contemporary test data.

3.1. Non-contemporary performance evaluation

In order to demonstrate the effectiveness of using multi-session templates for non-contemporary test data, the multi-template algorithm was evaluated on a multi-session database of 14 speakers (7 male and 7 female) collected over a period of 9 weeks at intervals of 1 week. The database has a structure similar to the TIDIGITS database, in that the vocabulary is the same, (oh, 0-9) and the data consists of 3, 4, 5, and 7 digit strings each with 11 utterances per speaker. The training templates were extracted automatically (also using the multiple-template one-pass DP algorithm5) and up to 5 training templates per word per speaker per session was made available. The test data per session consists of the 3, 4 and 5 digit strings totalling 33 utterances per speaker.

The multi-session template algorithm was evaluated for both closed-set speaker-identification and speaker-verification. Specifically, for both the systems, the following experiments were conducted to bring out the effectiveness of the multi-session templates:

(1) Experiment 1 (Contemporary training and testing): 5 training templates were drawn (per word per speaker) from the same session as the test data for all the 9 test sessions.

(2) Experiment 2 (Non-contemporary testing): 5 training templates were drawn (per word per speaker) only from session 1, i.e., the training session is fixed as the first session and test was conducted on the 9 test sessions.

(3) Experiment 3 (Non-contemporary testing with Multi-session training templates): 5 training templates were drawn (per word per speaker) with one template from each of the first 5 sessions. Test sessions were the same, i.e., all the 9 sessions.

While Expt. 1 is the best scenario where the

training and test sessions always match and should offer the best performance. Expt. 2 is the worst scenario where there is a maximum mismatch between training and test conditions and should offer a performance profile which degrades with increasing sessions. Expt. 3 uses multi-session templates from the first 5 sessions and is thus equipped to handle non-contemporary performance, thereby with a better performance than Expt. 2.

Fig. 3 shows the closed-set speaker-identification (SID) performance (% SID accuracy) for the above 3 experiments. It can be easily seen that the above pattern is clearly observed, where the Expt. 1 has the best performance, with practically a constant SID accuracy over session index. While Expt. 2 shows a severe fall in performance, Expt. 3 shows a significant improvement over Expt. 2 with an increased robustness to session variability. This clearly validates that multi-session templates are indeed beneficial for handling non-contemporary testing or session variability.

8 8 ' JHo. ot templates: 5 | '

86- -B-Expt- 1 Contemporary training and test V Expt - 2 Non-contemporary training and test -g- Expt - 3 Non-contemporary test with multi-session training templates

840 1 2 3 4 5 6 7 8 9 10 Test session index (week number)

Fig. 3. Closed-set speaker-identification performance with multi-session templates

Fig. 4 shows the speaker-verification performance in terms of the probability of false acceptance (Pfa) and probability of false rejection (p/ r) at the EER (equal-error-rate point) for all the three experiments. Here, a low EER point {pfa,Pfr) values close to (0,0) represents a good performance. Thus, it can be clearly observed that, as in closed-set speaker-identification, the multi-session templates in Expt. 3 has helped improve the performance of the speaker-verification system with respect to the poor performance of Expt. 2 which shows a severe degradation with increasing session index.

To obtain an insight into how the above multi-session template based one-pass DP algorithm really

394 Non-Contemporary Robustness in Text-Dependent Speaker-Recognition Using Multi-Session Templates ...

Prob (False acceptance) In °/ Prob (False rejection) in %

-^-Expt.1 (Contemporary training and test) * Expt.2 (Non-contemporary training and test) •®- Expt.3 (Non-contemporary test with multi-session training templates)

No. of templates'. 5

3 4 5 Session Index

-®-Expt.1 (Contemporary training and test) * Expt.2 (Non-contemporary training and test) a-Expt.3 (Non-contemporary test with multi-session training templates^

Session Index

Fig. 4. Speaker-verification performance

provides an improved performance, we obtained the

usage statistics of the individual templates of a multi-

template set used in Expt . 3. We show this in Fig. 5.

Here, we plot the number of times a template (from

one of 5 templates) is selected by the one-pass D P

algorithm (as described in Sec. 2) for various test

sessions (Session 1, 3, 5, and 9). Since Expt . 3 uses

5 templates drawn from each of the first 5 sessions,

it can be expected tha t Session 1 test da ta will most

often select template index 1 and Session 3 test da ta

will most often select template index 3 and so on.

This is exactly what can be observed in the bar-

graph up to session 5 test da ta . For the session 9

test da ta , all t he 5 templates (from the first 5 ses

sions) are equally far, and hence are selected more or

less equally. This pa t te rn of template usage clearly

validated the effectiveness of using multi-session tem

plates in dealing with non-contemporary test da ta .

Multiple template usage statistics

f

• S e s s i o n - 1 HHSesslon-3 HSession-5 t i l Session-^

I Fig. 5.

Template Index

Usage statistics of multi-session templates

The key to acheiving high performance with this

method is to use a multi-session templates from suf-

(pfa,Pfr) with multi-session templates

ficiently well spaced sessions so as to gain in cover

age of the variability tha t a speaker can exhibit over

time. For instance, constant upda te of the templates

based on the confidence of match during test ses

sions can serve as a promising way to maintain a set

of multi-session templates tha t is adapted to the test

sessions.

4. Conc lus ions

We have proposed a variable-text text-dependent

speaker - recognition system based on one-pass dy

namic programming tha t is robust to session vari

ability using multiple templates drawn from mul

tiple sessions. We have evaluated the multi-session

template algorithm on non-contemporary test da t a

from a multi-session database and shown tha t the

proposed algorithm derives a definite and significant

adavantage with the use of multi-session templates.

R e f e r e n c e s

1. D. Irony, H. Aronowitz and D. Burshtein. Modeling intra-speaker variability for speaker recognition. Proc. Interspeech'05, pages 2177-2180, Lisbon, Sep 2005.

2. B. Baker, R. Vogt and S. Sridharan. Modelling session variability in text-independent speaker verification. Proc. Interspeech'05, pages 3117-3120, Lisbon, Sep 2005.

3. V. Ramasubramanian and Amitav Das. Text-dependent speaker-recognition - A survey and state-of-the-art. Tutorial at ICASSP'06, Toulose, Prance, May 2006.

4. V. Ramasubramanian, Amitav Das, and V. Praveen Kumar. Text-dependent speaker-recognition using one-pass dynamic programming algorithm. In Proc. ICASSP'06, pp. I-901-I-904, Toulose, Prance, May 2006.

V. Ramasubramanian, V. Praveen Kumar and S. Thiyagarajan 395

5. V. Ramasubramanian et al. Text-dependent speaker- The Speaker and Language Recognition Workshop, recognition systems based on one-pass dynamic pro- Puerto Rico, June 2006. gramming algorithm. In Proc. IEEE Odyssey 2006:

396

Some Experiments on Music Classification

Debrup Chakraborty

Centro De Investigacion Y De Estudios Avanzados Del I.P.N. Av. Institute) Politecnico Nacional 2508,

Col. San Pedro Zacatenco, Mexico, D.F., C.P. 07360 E-mail: [email protected]

In this paper we present some experiments on music classification. We consider four categories of music and extract various features from them. These features have been previously used in certain other kinds of audio classification and speech recognition studies. We study the usefulness of such features in case of music classification. The current study uses three classification techniques: K-nearest neighbors, multilayer perceptrons and support vector machines. The results obtained by these methods are quite encouraging.

1. Introduction

The rapid increase in the speed and capacity of computers and networks we are every day being flooded with digital data of numerous types. Audio data in digital format is one type of data which has become very commonplace and many kinds of applications are commercially available with capability to handle digital audio data. There are numerous applications like personal music jukeboxes, digital music libraries etc. which require proper categorization of music. Also the internet is full of digital audio, and today's users who are accustomed to searching, scanning and retrieving text data through popular search engines may be frustrated by the inability to look inside audio objects. Conventional information retrieval techniques are generally meant for search of text documents. The classic IR problem is to locate desired text documents using a search query consisting a number of keywords. Audio being an opaque collection of bytes with just very primitive fields attached to it cannot be categorized using the classic IR techniques. The problem becomes more open-ended when one considers audio, such as music, which may have no speech.

While text retrieval and content based image retrieval has gained much importance in the current days and lot of research is directed in this direction, but activities in the field of audio retrieval and classification is relatively new. An important work has been recently done by Wold et al.17 which is represented by their audio retrieval system called "Muscle Fish". This work distinguishes itself from previous audio retrieval works as in4"6 in its capability to do content based retrieval. In Muscle Fish system, var

ious perceptual audio features like loudness, brightness, pitch, timbre are used to represent sound. A normalized Euclidean distance and a nearest neighbor (NN) rule2 is used to classify the query sound into one of the sound classes in the database. In another work by Liu et al.,11 similar features and subband ratios are used. The separability of different classes is evaluated in terms of intra and inter class scatters to identify highly correlated features; and classification is done by using a neural network. Foote7 used 12 mel-frequency cepstral coefficients (MFCCs) plus energy as the audio features. A tree structured vector quantizer is used to partition the feature vector space into a discrete number of regions or bins. Euclidean or cosine distance between histograms of sounds are compared and the classification is done by using the NN rule. In,12 a filter bank consisting of 256 phase compensated gamma phone filters proposed by Cook3 is used to extract audio features. A very recent work by Guo and Li10

uses a combination of perceptual features like total power, subband powers, bandwidth, brightness and pitch and the MFCCs. They use a support vector machine16 as their classifier.

In this paper we present some experiments on music classification using different machine learning tools. This study is different from other studies like10

in the sense that we attempt to classify music into categories. For our study we have selected music from four different categories namely Rabindra Sangeet (a form of vocal music initiated by Rabindranath Tagore, which is very popular among the Bengali speaking people), Indian Classical Instrument, Western Classical and English Rock. We used basically three tools, the k-NN classifier, the multilayered per-


Debrup Chakraborty 397

ceptron and the support vector machines for classification. The results obtained using all the three tools are quite encouraging.

2. Data Collection and Feature Selection

We collected music files of four classes namely, Rock, Rabindrasangeet, Indian classical and Western Classical. We collected these music from CD's. The normal attributes of CD music are 44.1 kHz, 128 kbps, pcm signed 16 bit stereo. We resampled the music-data at 8000 Hz and saved it as a pcm signed 16 bit,128 kbps,mono wave files. Then we selected & extracted one minute music clips from the resampled files and saved them thereby creating our database. The whole idea of resampling at a lower rate is to drastically reduce the file size so that in the testing phases we face less computational overhead. In this way we collected 40 rock music clips, 48 rabindrasangeet clips, 50 indian classical and 50 western classical music clips - to get a total of 188 music clips. This resampling brings down the quality of music, but on the otherhand the resampled frequency which we used is enough for a human listner to appreciate the music.

We first convert the wave file of each music clip into binary data representing the signal. As we have each music clip of 60 seconds duration, and the sampling rate is 8kHz, hence each clip will contain 480000 samples. We divide the total signal into small frames, where each frame contains 256 signals. The frames are so created such that two adjacent frames have 25% overlap between them. We consider a frame length of 256 as it is a power of two, and thus FFT algorithm can be used to calculate the discrete Fourier transform for each frame.

The various features can be computed on each frame. The definitions of the different features which we use are given in the following subsections. In the discussion below F(w) denotes the Fourier transform of the resampled music signal. The features that we discuss next have been previously used in10 for audio classification of various kinds of sounds (but not music).

2.1. Perceptual Features

Total Spectrum Power

The spectrum power(SP) is defined as

SP = / I^MI: Jo

duj (1)

where |F(w)|2 is the power at the frequency LJ and wo = 4000Hz is the half sampling frequency We use the logarithm of SP as the feature. Thus, spectrum power for us is:

P = log Q T \F(u)\2<k?) (2)

Subband Powers The frequency spectrum is divided into four sub-bands with intervals

rr» W 0 , rWo W o , rWo Wo, ,W0 , [0'T]' t_8"' T^'T'Y^T'^-

The logarithmic subband power (Pj) is used, where

Pj = log (3)

where Lj and Hj are lower and upper bound of subband j .

Brightness The brightness is the frequency cen-troid, defined as

U)c = S?°U\F{u,)\*<L,

(4) J - |F(u,)|*du;

Bandwidth Bandwidth B is the square root of the power weighted average of th e squared difference between the spectral components and the frequency centroid,

2.2.

V r^Mi9*"

Mel Frequency Cepstral Coefficients

MFCCs have been the dominant features used for speech recognition for some time. Their success has been due to their ability to represent the speech amplitude spectrum in a compact form. Each step in the process of creating MFCC features is motivated by perceptual or computational considerations. We examine these steps in more detail in the following paragraphs.

The first step is to divide the speech signal into frames, usually by applying a windowing function at fixed intervals. The aim here is to model small

398 Some Experiments on Music Classification

sections of the signal that are statistically stationary. We generate a cepstral feature vector for each frame.

The next step is to take the Discrete Fourier Transform (DFT) of each frame. We then retain only the logarithm of the amplitude spectrum. We discard phase information because perceptual studies have shown that the amplitude of the spectrum is much more important than the phase. We take the logarithm of the amplitude spectrum because the perceived loudness of a signal has been found to be approximately logarithmic.

The next step is to smooth the spectrum and emphasize perceptually meaningful frequencies. This is achieved by collecting all the spectral components into a few frequency bins. This is realized by passing the coefficients of the power spectrum through a triangular bandpass filter bank. The filter bank consists of K = 19 triangular filters. These filters are not equally spaced in the frequency, it has been found that the lower frequencies are perceptually more important than the higher frequencies. Therefore the filter spacing follows the so called 'Mel' frequency scale. The mel scale is based on a mapping between actual frequency and perceived pitch as apparently the human auditory system does not perceive pitch in a linear manner. The mapping is approximately linear below 1kHz and logarithmic above. The original frequency (/) and the Mel frequency (fiuei) is related by the following equation:

4J\ !uM) = 2595 log [l + j ^ j j I (6)

The triangular filters in the filter bank have a constant mel-frequency interval,and covers the frequency range of 0Hz to 4000 Hz.

Denoting the output of the filter bank by Sk (k = 1,2, K), the MFCCs are calculated as

Til K — II M ITT

n= 1,2,...,L IT

fc=i

n(k - 0.5)TT

k

(?) Where L is the order of the cepstrum.

3. Our Computed Features

We discuss the three sets of features here.

perp: This feature set contains the perceptual features. For each frame we calculate the total power, the 4 subband powers, brightness and frequency cen-

troid. Thus for each frame we get seven numerical values, which encodes the various perceptual features. We take the mean and standard deviation of each of these features over all frames. This gives rise to 14 features. We call this feature set as perp.

mfcc: We can calculate the MFCC for each frame. We consider the order of the cepstrum (L) as 5. Thus for each frame we get five numerical values which characterizes the MFCCs of five different orders. We calculate the mean and standard deviations of these 5 MFCCs over all frames, to get 10 features for each clip. We call this set of 10 features as mfcc.

perpmfcc: We augment the feature sets perp and mfcc to get a new set of features called perpmfcc. Thus, perpmfcc contains all the features mperp and mfcc and it thus contains 24 (14 -I-10) features.

We use these three feature sets for development of the classifiers.

4. Results

4.1. Results Using k-NN Classification

Table 1 gives the results of classification obtained by using the k nearest neighbor algorithm for various feature sets using different values of k. The results presented are of 10 fold cross validation. Table 1 reveals that the best performance is obtained by the mfcc features. The perpmfcc features does not perform as well as the mfcc features, and the perp features performs the worse. But, it is to be noted, that even using the perp features, we get an average performance just below 60%. Which is quite good. By using mfcc features we get an average performance of 68.45%. And the best performance obtained by this classifier is 89%, which really is a high performance.

The performance degrades with the increase in k in most cases. We get the best results by using k = 3. The better performance showed by the mfcc features may also be due to the smaller dimensionality of this feature set. It is known that the k — NN algorithm does no perform well on high dimensional data.

4.2. Results using MLP

Here we present the results on classification by using various features and various network configurations of an MLP. We used MLP's with sigmoidal activation

Debrup Chakraborty 399

No of NN

3 5 7 9 11

mean

(%) 57.92 53.00 58.85 59.44 58.39

Table 1. Results using

perp std

(%) 8.20 9.73 12.74 12.20 8.44

best

(%) 73.68 68.42 78.94 73.68 68.42

mean

(%) 68.45 64.76 64.30 62.66 62.47

K-nearest neighbour

mfcc std

(%) 13.99 9.58 13.72 15.16 16.63

best (%)

89.47 78.94 84.21 89.47 78.94

mean (%)

59.50 54.05 59.44 60.49 60.49

jerpmfcc std

(%) 9.60 10.24 11.94 11.74 7.19

best

(%) 73.68 68.42 73.68 73.68 68.42

Hidden Nodes

5 10 15 20

Table 2.

perp mean

(%) 69.17 70.51 73.72 74.48

std (%) 5.43 6.61 8.07 8.41

Results of classification using MLP

best

(%) 76.31 80.00 84.21 84.73

mfcc mean

(%) 68.19 68.91 68.14 69.21

std

(%) 8.15 8.34 8.59 9.77

best

(%) 78.23 81.05 78.42 82.10

perpmfcc mean (%)

70.79 74.16 74.93 77.39

std (%) 7.12 5.89 4.98 7.38

best

(%) 77.89 81.57 81.57 84.21

functions and used the Lavenberg-Marquartd algorithm to train them. We used an implementation of the neural-network toolbox of MATLAB 6 to implement it.

Table 2 shows the classification performance using MLP for various hidden nodes. The values in the table represents classification accuracy in percentage for 10 fold cross validation. For each validation set, we trained 20 MLP's and took the average performance of the best 10 among them as the accuracy for a single validation set. We did this for 10 independent validation sets and the values in Table 2, are the averages over the 10 validation sets.

Table 2 shows that perpmfcc performs better than the other two feature sets for almost all configurations of the network. Also, the performance of the mfcc features is almost comparable with that of perpmfcc. Also a gain in performance is observed with an increase in the number of hidden nodes. But, with the increase in the number of hidden nodes, the standard deviation of the performance also increases, which hints that the process gets more unstable with the increase in the number of hidden nodes.

The results with the K-nn classifier suggested that the mfcc produces the best characterization in terms of classification. An augmented version of mfcc, i.e., the perpmfcc features also could not increase the performance. This was quite counter intuitive, as the perceptual features do have some power of characterization as the results reveal, so it was expected that a feature set which used both the perceptual and the MFCC features will give furthur better

results. This was not seen in case of K-nn probably because of the inability of the k-nn algorithm to deal with high dimensional data. The results that we obtain using MLP supports our intuition that perpmfcc will have a better characterization than perp and mfcc will individually have.

MLP produces a better average performance than the K-nn method. But the best performance provided by K-nn is more than that we obtain using MLP.

4.3. Results using SVM

Table 3 show the results of SVM classifier using a gaussian kernel for the various features. We use the one versus all strategty for multiclass classification. Here too the best average performance is obtained by using the perpmfcc features. And the average performance is well above the other two classifiers discussed in the previous two chapters.

Table 3. Results of classification using Support Vector Machine

perp mfcc

Prmfcc

mean

74.0248 68.5759 76.6563

std

9.8731 9.1983 6.4517

max

84.2105 84.2105 84.2105

5. Conclusion and Discussions

In this paper we presented some experimental results involving classification of music. We restricted our

400 Some Experiments on Music Classification

study to classify 4 music categories namely Western Classical Instruments, Indian Classical Instrumental, Rock and Rabindra Sangeet. We performed our experiments on small music clips of 1 min. duration from the four categories. We have used two types of features, the perceptual features and the mel frequency cepstral coefficients for characterization of the clips. We used three types of classifiers: the k-nn, MLP and SVM. Our studies show that all these classifiers shows satisfactory classification rates on the features considered. The SVM performs the best among the three, followed by the MLP and the K-nn. Our studies proves that the perceptual features and the MFCCs can characterize music and these features can be used to design classification/retrieval systems. Also, any of the three classification strategies used here will serve as a good classifier.

There is plenty of scope for further experimentations on the issues of the best features. A possible good set of features would be the "chaos" based features, which may include the fractal dimension, the Lyapunov exponent, the embedding dimension1 etc. Experiments should be performed to know the feasibility of such features.

The current study involves calculation of features from uncompressed WAVE files. But, the popular file format of today is MP3, which uses some lossy compression algorithm. A challenge would be to develop algorithms for feature extraction in the compressed domain.

References

1. H. D. I. Abarbanel, Analysis of Observed Chaotic Data, Springer, 1995.

2. J. C. Bezdek, J. Keller, R. Krishnapuram and N. R. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing Kluwer, Massachusetts, 1999.

3. M. P. Cook, Modelling Auditory Processing and Organization, Cambridge, U.K., Cambridge University Press, 1993.

4. B. Feiten and S. Gunzel, "Automatic indexing of a sound database using self organizing neural-nets", Computer Music Journal, vol 18, no. 3, pp. 53-65, 1994.

5. B. Feiten and T. Ungvary, "Organizing sounds with neural networks", presented at the Proc. 1991 International Computer Music Conference, San Fran-sisco,CA, 1991.

6. S. Foster, W. Schloss and A. J. Rockmore, "Towards an intelligent editor of digital audio: Signal Processing Methods", Computer Music Journal, vol. 6, no. 1, pp. 42-51, 1982.

7. J. Foote et al. "Content bases retrieval of speech and audio", in Proc. SPIE Multimedia Storage Archiving Systems II, vol 3229, C. C. J. Kuo et al. Eds., pp. 138-147, 1997.

8. J. Foote, "An overview of audio information retrieval", ACM-Springer Multimedia Systems, vol 7, no 1, pp. 2-11, 1999.

9. W. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms, Prentice Hall, New Jersy, 1992.

10. G. Guo and S. Z. Li,"Content based audio classification and retrieval by support vector machines", IEEE Transactions on Neural Networks, vol 14, no. 1, pp. 209-215, 2003.

11. Z. Liu, J. Huang, Y. Wang and T. Chen, "Audio feature extaction and analysis for scene classification", in IEEE Signal processing society Workshop on Multimedia Signal Processing, 1997.

12. S. Pfeiffer, S. Fischer and W. E. Elsberg, "Automatic audio content analysis", University of Mannheim, Mannheim, Germany, Tech. Report 96-008, 1996

13. L. Rabiner and B. H. Juang,Fundamentals of Speech Recognition, Engelwood Cliffs, 1993.

14. S. V. Rice, "Audio and video retrieval based on audio content", White paper, Comparisonics, Grass Valley, CA, USA, 1998.

15. C. J. V. Rijsbergen, Information Retrieval, Butter-worths, London, 1979.

16. V. N. Vapnik, Statistical Learning Theory, New York, Wiley, 1998.

17. E. Wold, T. Blum, D. Keislar and J. Wheaton, "Content based classification, search and retrieval of audio", IEEE Multimedia Magazine, vol 3, pp. 27-36, 1996.

401

Text Independent Identification of Regional Indian Accents in Spoken Hindi

Kamini Malhotra* and Anu Khosla

Scientific Analysis Group DRDO, Metcalfe House, Delhi 110054


In this paper an approach to text independent identification of four regional Indian accents in spoken Hindi is proposed. The accents worked upon are Kashmiri, Manipuri, Bengali and neutral Hindi. A Gaussiuan Mixture Model (GMM) approach has been employed to avoid the need of speech segmentation for training. The results show that the GMMs lend themselves to accent identification task very well. In this approach only spectral features have been incorporated in the form of Mel Frequency Cepstral Coefficients( MFCC). The approach has a wide scope of expansion to incorporate other regional accents in a very simple way.

1. Introduction

Automatic Accent classification is a recently emerging research topic in speech processing which is evolving mainly as a tool for improving the performance of a speech recognizer. However a lesser dwelled upon application is improvement of speaker identification system. An accent classifier can be used effectively to narrow down the search within a speaker search data set and to more efficiently determine the identity of the speaker as accent is a very prominent characteristic of the speaker.

The performance of an automatic speech recognizer deteriorates substantially if the input is accented speech to which recognizer has not been trained. Until recently, the compensation required in speech recognizers to accommodate different accents of speakers has been viewed purely as a signal-processing problem of compensating for differences in the acoustic signals. However, the variation in the speech signal caused by different accents is fairly systematic and can be dealt with more powerfully using signal specific processing. Different accents give rise to several differences in the acoustic realization of a speech utterance at various levels starting from words, syllables and phonemes. These differences become obvious at the spectral and the prosodic level and can be well exploited for accent identification tasks.

Most studies in this area of automatic identification of non native accents have been done for speech recognition performance enhancement. These studies suggest that to improve speech recognition of non-native speech, it is necessary to develop effective automatic detection methods of dialect and ac

cent. John Hansen and Levent Arslan,1 at the Robust Speech Processing Laboratory of Duke University, made several important contributions to the study of foreign accent recognition. Their first efforts in the area involved adapting a source generator based algorithm for stress detection to the problem of accent recognition.2 The algorithm uses prosodic features (including fundamental frequency fO, dfO/ dt, energy, and smoothed energy) to build a 3-state Hidden Markov Model (HMM) codebook for each accent class. Later work showed that the second and third formant frequencies (f2 and f3) are also good sources of information for distinguishing accents. Most existing work on accent classification employs phoneme information. Miller and Trischitta used the average cepstral vectors of phonemes to distinguish different regional dialects in US.3 Angkititrakul and Hensen4

introduced the Stochastic Trajectory Model (STM) for each phoneme to classify different foreign accents of English. Fung and Wai Kat proposed a phoneme-class HMM for fast access identification, in which one HMM is shared by several phonemes within the same phoneme category.4 Lincoln, et al., described a phonotactic model for the classification of the accents.6

However, the phoneme approach suffers from drawbacks. Phoneme classification itself is a difficult task and requires training for different languages. The pure phoneme recognition rate (without the knowledge of grammar or vocabulary) is only about 60%. The phoneme based approach becomes viable for speech recognition kind of applications but for text independent applications like narrowing down a search for a speaker, GMM based techniques become more appropriate

402 Text Independent Identification of Regional Indian Accents in Spoken Hindi

Most of the studies cited above have tried to classify accented (Australian Arabic, Chinese, German, Japanese etc.) English, but not much work has been reported in context of regional Indian accents. In this paper a GMM based text independent technique has been proposed for identification of Hindi spoken in four regional Indian accents - Kashmiri, Bengali, Manipuri and neutral Hindi. A GMM based approach was chosen since the primary application of the classification is text independent speaker characterization. Text independence does not constrain the spoken text to be pre specified (i.e there is no previous knowledge as to what the speaker is about to say) and hence precludes the use of phoneme sequence specific approaches like Hidden Markov Models (HMM). The latter approaches are more suited for speech recognition application where there is a strong prior knowledge of the text being spoken. Also it has been shown7 that these complex models do not provide any advantage over the GMMs for speaker characterization tasks. Apart from above a spin off of using Gaussian Mixtures to model the accents is the reduction in computational complexity as the utterances are not required to be broken down into constituent phonemes. So the advantage of using GMMs are that they are computationally less expensive, are based on well understood statistical models, and, for text independent tasks are insensitive to the temporal aspects of the speech.

In the following paper, section 2 gives a back ground about the manifestation of accent in perceptual and acoustical domains. The same section also gives the proposed accent identification system and the use of GMMs in identification. Section 3 details the identification experiments and section 4 gives the results.

2. Accent Identification

Every individual develops a characteristic speaking style at an early age that depends largely on his language environment (i.e., the native language spoken), as well as the region where the language is spoken. Previous studies in language education have shown that one develops a speaking style while acquiring language skills up to the age of sixteen, which consist of phoneme production, articulation, tongue movement and other physiological phenomena related to

the vocal tract. In general, the speaker preserves this speaking style when speaking other languages. This gives rise to accent in a spoken utterance. Accent therefore can be defined as the relative prominence of a particular syllable or a word in pronunciation determined by a regional or social background of a speaker. The accent appears because speakers with foreign accents import some of the acoustic and phonological features from their first languages into the speech production process. Accented speakers will modify the articulation of the target language to a certain degree by substitutions and approximations from the phoneme set of their first language. These deviations from the phoneme set of the target language would be especially useful for the identification of the speaker accent.

The type of accent exhibited in foreign language pronunciation depends on a number of speaker related factors such as (i) the age at which a speaker learns the second language (ii) the nationality/region of the speaker's language instructor, and (iii) the amount of interactive contact the speaker has with native talkers of the foreign language. To achieve reasonable identification accuracy in accent identification it is necessary to understand how dialects vary. The language dialects vary in the following ways (i) Phonetic Realization of vowels and consonants (ii) Phonotactic Distribution (E.g rhotic in farm: /farm/ vs. /fa:m/ (iii) Phonemic system (the number of phonemes used) (iv) Lexical distribution (v) Rhymical Characteristics

• Syllable boundaries (e.g. sel#fish vs. self#ish) • Pace (average number of syllables uttered per sec

ond) • Lexical Stress • Intonation • Voice Quality (e.g. creaky voice vs breathy voice)

From the studies it can be inferred that the domains which will contribute to analysis and modeling of effective techniques for accent classification are prosody structure spectral acoustic structure

Kamini Malhotra and Anu Khosla 403

2.1. Accent Identification System

The following figure gives the block diagram of the proposed accent identifier.

Accent 1 p.

Accent 2

•

Accent 3

^ Accent 4

Speech Regions

Detectioa

•

MFCC Extraction

z: ^

GMM1 =•• -z

GMM2

GMM3

GMM4

Training Phase

V Test Speech

| J

Speech Region

Detection

MFCC Extraction

ML Classifier

Testing Phase

M Identified

Accent

Fig. 1. Proposed Accent Id System

The system is designed to operate in text independent mode so no specified text is required for the training phase. In the training phase GMMs are trained on features extracted from the data of the four accents. The input speech of each accent is processed for separating out speech regions from pause regions. The features are extracted from the speech regions only as the pause regions will not have any accent specific information of the speaker. Out of the two kinds of features mentioned above viz spectral and prosodic the earlier ones are used in this study. Mel Frequency Cepstral Coefficients are extracted from the training input to represent the spectral content. Gaussian mixture models are trained on these features and one GMM per accent is obtained. In the testing phase the same features are extracted from the test data and the likelihood of the test feature vector for each GMM is calculated. The accent of the GMM showing the maximum likelihood is the identified accent of the input speech.

2.2. Mel Frequency Cepstral Coefficients

The Mel Frequency Cepstral Coefficients is a very popular feature set that parameterizes the speech spectrum for speaker characterization tasks. It is

based on the nonlinear perception of frequency of sounds and places less emphasis on high frequencies and more emphasis on low frequencies. MFCCs are obtained by taking inverse DCT of the Mel warped log spectrum of the windowed input.

2.3. Gaussian Mixture Models and Maximum Likelihood

Gaussian Mixture Models have proven to be a powerful tool for distinguishing acoustic sources with different general properties. Their major advantage lies in the fact that, that they do not rely on any segmentation of speech signal and make them ideal for on line applications.

In accent identification task a GMM \A is created for each accent A. Under GMM assumption, the likelihood of a feature vector Vk extracted from model XA is represented by a weighted sum of TV Gaussian densities:

N

p{vk/\A) =^2wibi(vk) (1) i = l

Where bi{vk) are the component mixture densities and Wi are the mixture weights. The accent model XA is expressed by

XA = {wi,fj,i,ai} (2)

where, i is the mixture index, /ZJ is the mean of the ith density and Ci is its variance.

During identification, an unknown speech utterance is represented by a sequence of feature vectors. Then log-likelihood that the given sequence of feature vectors belongs to a model is calculated for each accent model. The log-likelihood, LA is defined as

K

LA = 5 3 l°9V{vkl>Â) (3) fc=i

Where k is the MFCC frame index, K is the total number of frames in an utterance. Finally, a maximum-likelihood classifier hypothesizes as the accent of the unknown utterance, where

A = argmaxLA, 1 < A < 4 (4)

3. Identification

The training data was a set of Hindi utterances of 40 speakers of each region Kashmir, Manipur, West Bengal and Uttar Pradesh. The data was collected

404 Text Independent Identification of Regional Indian Accents in Spoken Hindi

over by visiting the regions and selecting people who have had their basic up bringing in that region so as to ascertain the presence of the characteristic accent.

The following are the steps used to implement the accent identifier system in accordance with the block diagram of Figure 1 given earlier. (1) Speech regions were identified in the input utterance using autocorrelation and energy values (2) Feature Extraction 12 Dimension MFCC were obtained from the speech regions. (3) Training was carried out using 10 speakers for each accent (male and female both). 5 seconds of speech per speaker was used. So total training duration was 50 s for each accent (4) One GMM each with 128 Gaussians and diagonal covariance matrix was obtained per accent using the EM (expectation Maximization) algorithm. (5) Testing was carried out on 5sec segment of input speech of 40 speakers of each accent. The classification was based on Maximum Likelihood criterion.


Bar Chart of Identification Results

oc.

o

*H°G '

O J,

m A H

. n*:ii-~

- o-

KaSrrti Mnpun

Accented Input

Fig.2. Identification Results

The chart in figure No. 2 gives the identification scores. Kashmiri has the highest correct identification score (97%). The confusion is maximum for Bengali and Manipuri as the two regions share the same script and are geographically also very close. Kashmiri accent identification score tops the chart due to the language being entirely different and the geographical separation from the other regions.

The results show that the GMMs are suitable for accent identification task with a very simple implementation in text independent mode. The training data is a very crucial part of the system. The choice of subjects is also very crucial so as to show desired accents in their utterances.

The results can be improved by using gender specific GMMs and using prosodic features in conjunction with the spectral features. With training data for other regional accents, the approach can be expanded without much increase in the complexity.

Acknowledgments

We are thankful to our Director, Dr. P. K. Saxena and Divisional Head Dr. S. S. Bedi for encouraging us to carry out this work and allowing us to present this paper.

References

1. L. Arslan and J. Hansen:Language accent classification in American English Speech Communication, 18:353367, 1996.

2. L. Arslan and J. Hansen: Foreign Accent Classification using source based prosodic features Robust Speech Processing Laboratory, University of Colorado, USA.

3. D. R. Miller and J. Trischitta: Statistically Dialect Classification Based on Mean Phonetic Features Proc. of ICSLP, vol 4, Philadelphia, USA, pp. 2025-2027, 1996.

4. P. Angkititrakul and J. H. L. Hansen: Stochastic Trajectory Model Analysis for Accent classification Proc. of ICSLP, vol 1, Denver, Colorado, USA, pp. 493-496, 2002.

5. P. Fung and W. K. Liu: Fast Accent Identification and Accented Speech Recognition Proc. of ICASSP, Phoenix, Arizona, USA, 1999.

6. M. Lincoln etal: A Comparison of Two Unsupervised Approaches to Accent Identification Proc. of ICSLP, Sydney, Australia, December 1998.

7. Bimbot Fredrick etal: A tutorial on Text Independent Speaker Verification, EURASIP Journal on Applied Signal Processing, 2004.

8. C. Teixeira, I. Tancoso, and A. Ser-ralheiro: Recognition of non-native accents In Proc. Eurospeech97, Rhodes, Greece,pp 2375 2378, Sept 1997.

9. Xiaofan Lin, Steven Simske: Phoneme-less Hierarchical Accent Classification web.

10. K. Kumpf and R. King :Automatic accent classification of foreign accented Australian speech Proc. IC-SLP96, Philadelphia, PA, pp 17401743 , Oct 1996.

PARTL

Texture Analysis

407

An Efficient Approach for Texture Classification with Multi-Resolution Features by Combining Region and Edge Information Using a Modified CSNN

Lalit Gupta and Sukhendu Das


IIT Madras Chennai-600036, India.


In this paper, we propose an efficient approach for texture segmentation by integrating region and edge information. The algorithm uses a constraint satisfaction neural network (CSNN) for texture segmentation with additional edge constraints. Initial class probabilities (segmented map) and edge maps are obtained from the image using two stages of multi-channel, multi-resolution filters. The complementary information of the segmented map and the edge map are iteratively updated using a modified CSNN to satisfy a set of constraints to obtain superior segmentation results. The proposed methodology is tested on simulated as well as natural textures and provides satisfactory results.

Keywords: Texture, Dynamic window, Constraint Satisfaction Neural Network (CSNN), Fuzzy C-Means (FCM), Discrete Wavelet Transform (DWT), Discrete Cosine Transform (DCT)

1. Introduction

Texture plays an important role in low-level image analysis. It is the fundamental characteristic of natural images that, in addition to color, plays an important role in human visual perception and provides information for image understanding and scene interpretation.1 Texture segmentation deals with identification of regions where distinct texture exists.2 Image segmentation methods are based on two basic properties of pixels in relation to their local neighborhood: discontinuity and similarity. Pixel discontinuity gives rise to boundary-based methods, whereas pixel similarity gives rise to region-based methods. It is often difficult to obtain satisfactory results, by using only one of these methods in the segmentation of complex pictures such as textures.

Salotti and Garbay3 noted, that it is possible to improve the results by using the complementary nature of edge-based and region-based information. A large amount of work on the fusion of edge and region information4 have been reported in literature, but texture properties are not been considered, with a few exceptions.1 Munoz et al.4 presented a review on the strategies for the image segmentation combining region and boundary information. The paper has noticed the fact that not many algorithms has been presented for combining edge and region information for texture segmentation. Paper points out the advantages and disadvantages of the various approaches, for combining edge and region information

for segmentation. Munoz et al.1 proposed a strategy for texture segmentation which integrates region and boundary information. The algorithm uses the contours of the image in order to initialize, in an unsupervised way, a set of active regions. Then, regions compete for the pixels maximizing an energy function which takes into account both region and boundary information. The results are shown on three images composed of four texture regions each.

The work presented in this paper uses a sequential combination of features from discrete wavelet transform and discrete cosine transform for region based texture segmentation. This representation gives more discrimination ability in feature space. This region based information is combined with edge information using a modified CSNN. In addition to constraints based on region, edge constraints are also employed into the network. The rest of the paper is organized as follows: Section 2 describes the algorithm used to extract region based information. Section 3 gives the background of CSNN. Section 4 describes the proposed algorithm. Section 5 discusses experimental results and Section 6 concludes the paper.

2. Region segmentation map, using multi-resolution features

Initial map of region-based segmentation is obtained using methodology described in Fig. 1. First the image is filtered using a sequential combination of Dis-



408 An Efficient Approach for Texture Classification with Multi-Resolution Features ...

crete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT). The same DCT co-efficients as used by Ng et al.5 are used for filtering. The filter coefficients (responses) are post-processed by a set of local energy functions, which rectify the the filter responses and remove local fluctuations. These non-linear functions consist of two stages: (1) obtaining the magnitude followed by (2) smoothing by a large Gaussian function. The feature vectors computed from the local energy estimates are local mean, which represent local texture characteristics. These feature vectors are provided to a FCM classifier to segment the texture patterns in the image.

i Input image

Filtering

Discrete wavelet transfrom

' Discrete cosine

transfrom

' Non-linearity (absolute)

' Smoothing using Gaussian filter

1

Classification

Filter responses

Local energy

Local energy

Feature vectors

Segmented map

Fig. 1. Stages of processing for initial texture classification using features extracted with DWT and DCT.

3. Constraint Satisfaction Neural Networks

Constraint Satisfaction Neural network (CSNN), as proposed in,6 consists of K layers of IxJ neurons, each corresponding to a pixel, and where K is the number of segments (classes). Neurons in each layer hold the probability that the pixel belongs to the segment represented by the corresponding layer. As shown in Fig. 2, each neuron synapses to N neighboring neurons in 3D. Connectivity of neurons in the same and across layer neurons is shown in Fig. 2(b)-(c). Synaptic weights are set to update the probabilities towards convergence by inhibitory and exci

tatory actions.7 The label probabilities are updated after every stage of iteration in a winner-take-all style. The final segmented map is obtained by taking the label corresponding to the winner layer for each pixel in an image. Kohonen's self organizing maps were used by Lin et a/.6 to obtain the initial pixel class probabilities. The kth class probability at pixel (i,j), 0\jk, at iteration step t, is updated by a small amount, 6, if it is a winner class, otherwise it is decremented by S. Hijk represents the contribution from neighborhood and participates in adjusting the activation level of the neuron, which is given as:

Hhk= £ W W . W ^ w (1) <Vi6JVy

where Oyfc represent the output of (i,j)th neuron in the kth layer, and Ntj shows the neighborhood of the (i,j)th neuron. The weight between kth layer's (i,j)th neuron and /"* layer's (q,r)th is computed as

P V K with K and P representing the total number of layers (the number of segments) and the size of 2D neighborhood centered at (i,j) respectively. Weights in the CSNN can be interpreted as constraints. Weights are determined based on the heuristic that a neuron excites neighboring neurons representing the labels of similar intensities and inhibits other neurons representing labels of different intensities.6 The kth layer's (i,j)th neuron is updated in each iteration using a nonlinear updating rule as follows:

P o s ( 0 ^ + A O ^ )

Wt iJiQrikyl (2)

o% = K

where

AC*.fc = { 6 i£Htjk=maXi{Hfjl)

Pos(X)

otherwise

X i f X > 0 0 otherwise

(3)

(4)

(5)

We try to map the problem addressed in this paper as a constraint satisfaction problem (CSP) and model it using Constraint Satisfaction Neural Networks (CSNN) as proposed by Lin et al.6 We have used the similar CSNN structure with some modifications to use our set of constraints. We call our network as Constraint Satisfaction Neural Network

Lalit Gupta and Sukhendu Das 409

for Complementary Information Integration (CSNN -CII). CSNN-CII uses a dynamic window, which helps to iteratively integrate edge and region information for texture classification. The concept of dynamic window is discussed in the following section.

(i.j-l.k) (ij.k>

(ij+l.k-H)

Fig. 2. (a) 3-D CSNN; (b) the top view of CSNN network, showing connection in a single layer; (c) side view of the network, showing connections across layers.

3.1. Dynamic Window

The concept of Dynamic Window is shown in Fig. 3. Fig. 3(a) shows an image (with edge or boundary information) to be processed for segmentation using the window shown in Fig. 3(b). Fig. 3(c) shows that only the left part of the window (because center pixel of the window lies on left side) is used for processing. The window uses the prior knowledge from the edge map to adjust the window size. We have used this concept to integrate edge information in the segmented map. The CSNN neurons use the window size based on edge map available as an additional constraint for processing. The obvious advantage of using dynamic window at region boundaries is that only the neurons which corresponds to single classes will be processed and the neurons which may confuse the network would not be used for computation. The size of dynamic window (nxn) is considered as 17x17, which is obtained empirically.

Edge

Image

(a)

• • Image

(b) (c)

Fig. 3. (a) Image with an edge information; (b) dynamic window of size n x n ; (c) only the shaded part of the window is used for processing

4. CSNN for integration

We propose a modified CSNN (we call it as CSNN-CII) to integrate edge and region information. CSNN-CII differs from CSNN as:

(1) Each neuron in CSNN-CII contains two fields: probability and rank. Rank field stores the rank of the probability in a decreasing order, for that neuron. For example, in case of a three class problem let the probabilities that a pixel belongs to classes 1, 2 and 3 be 0.3, 0.6 and 0.1 respectively. In this case, rank fields for layers 1, 2 and 3 will have values 2, 1 and 3 respectively. Rank for (i,j)th neuron in kth layer is represented as Rijk, where 0 < Rijk < K +1.

(2) Weights in CSNN-CII are adjusted using the relation:

J_ / 2\Rijk-Rqrl\ P \ K

W, ij,qr,k,l (6)

where, Rijk is the rank of the (i,j)th neuron in the kth layer. This method of weight calculation helps in initializing neurons in the CSNN using the output of Fuzzy-C-means (FCM) algorithm. In contrast, the Lin et al. method6 uses an adhoc fuzzification of an initial map. Rank term achieves significance as FCM does not keep topological relationships while assigning probabilities, which are required for initializing CSNN.

(3) In addition to region based constraints CSNN-CII also employs edge constraints using a dynamic window, that helps in obtaining superior classification results. CSNN-CII uses dynamic window or dynamic neighborhood for computation. The number of neighbors to consider for computation are determined using edge information.

Nodes in CSNN-CII are connected in the same way as CSNN. Within each layer the nodes are con-


nected with excitatory connections to the neighborhood (17x17 dynamic window) of the pixel. Across layers nodes are connected using excitatory and inhibitory connections to the neurons corresponding to the neighboring neurons as shown in Fig. 2(c). CSNN-CII has same update rule as CSNN as described in Section 3. The output of any neuron, Oijk, depends on its previous state and the current activation.

4.1. Proposed Algorithm for modified CSNN

The steps of the proposed algorithm are described below:

4.1.1. Initialize CSNN-CII neurons

Initialize the CSNN-CII neurons using probabilities obtained from fuzzy c-means segmentation results. Neurons in the layers hold the probability that the pixel belongs with the probability to the segment represented by the corresponding layer. Rank field is initialized based on rank of the probability in the non-decreasing order.

4.1.2. Update the segmented map

Iterate and update the probabilities, edge map and determine the segmented map. The accumulated evidence from the neighbors of the kth layer's (i,j)th

neuron is computed as given in Equ. 4 with maximum neighborhood size of 17x17 neurons in each layer (considering dynamic window). Weights are computed as given in Eqn. 6. The state of each node is updated using Eqn. 3 with the value of 6 taken as 0.04. The class label for each pixel in the segmented map, Y, is assigned as follows:

Yf, = argmax(O^) (7)

for i = 1,..., / and j — 1,..., J.

4.1.3. Update edge map

Let B: edge map obtained using lower threshold and E: edge map obtained using higher threshold. Here edge maps are obtained using the method proposed by us in.8 M^: the set of pixels in the neighborhood of pixel (i, j) in the output image Y of size 2v +

1, excluding edge pixels in E. In this case we have chosen 2v + 1 = 9 and My is obtained as:

Mij = {Y$r I q~v<q<q + v,

r-v < r <r + v,E\T = 0} (8)

Edge map E is updated as:

(1 if££ = l E\+l = l l \ i Bij = 1 and min(M^) ? max(M&)

l_ 0 otherwise (9)

The criteria min(M„-) ^ max(M/,) ensures that, for the output image Y when the neighbors of a current (i, j)th pixel consist of more than one label, that pixel is considered to be an edge pixel. We confirm this by using the information from edge map B. Therefore based on information obtained using both segmented map and edge map B, edge map E is updated.

4.1.4. Convergence

Check the convergence condition, i.e., the number of pixels updated in Y, at each stage of iteration. If there is any update go to step 4.1.2 otherwise go to step 4.1.5.

4.1.5. Merge

Remove all the extra edge pixels in the edge map which are not on the boundary of the segmented map and add edge pixels in the edge map if they are not at the boundaries of segmented map. Finally, merge edge map and segmented map to get the final output. At this stage the edge pixels corresponding to texture region boundaries and class-labeled pixels corresponding to segmented regions of texture do not overlap. Hence a simple fusion of the two yields the final result.


The performance of the proposed method is tested on both synthetic and natural images. The initial segmented map is obtained using the algorithm discussed in Sec. 2 (see. flowchart in Fig. 1). Although output of fuzzy C-means is used for integration, the final segmented map is obtained from the winner label for each pixel.

Lalit Gupta mid Sukhendu Das 411

Table 1. Experimental results of classification: (a) Input images; (b) Segmented maps; (c) Edge maps (ref:8); (d) Results using proposed methodology.

(o

Table 2. Experimental results of classification: (a) Input images; (b) Segmented maps; (c) Edge maps (ref:8); (d) Results using proposed methodology.

Table 1 shows the results for three images, composed of four texture regions each, obtained by combining region and edge information. Row (a) in Table 1 shows the input images each composed of four texture regions. Segmented maps and edge maps obtained using algorithms described in Sec. 2 and,8

for the input images in row (a), are shown in rows (b) and (c) respectively. Row (d) in Table 1 shows the results obtained by combining region (row (b)) and edge information (row (c)) using the proposed methodology. Table 2 shows similar results, with images having five-textures, and curvilinear boundaries. It can be observed that results have improved significantly in all the cases due to the merging process, although the initial edge map and region based information were unsatisfactory in most of the cases.

Fig. 4(a) shows Synthetic Aperture Radar (SAR) image of a region near Kolar, India. Fig. 4(b) shows the segmented map for the SAR image. Fig. 4(c) shows the edge map obtained using a higher threshold. Fig. 4(d) shows the segmented output by combining region and edge information using CSNN-CII. It can be observed that segmented map have improved significantly by combining region and edge information using our proposed method.

6. Conclusion

In this paper, a method has been proposed to improve texture segmentation using CSNN, by combining edge and region information.

(<•) W

Fig. 4. (a) A SAR image of kolar region; (b) segmented map obtained using method presented in Sec. 2; (c) edge map obtained with the method proposed by Gupta et al.;8 (d) Final segmented map obtained by combining region and edge information using our proposed method.

This method uses basic structure of CSNN, but


uses dynamic window to simultaneously improve re

gion segmented map from edge map and edge map

from region information. This is the main contribu

tion of the paper.

References

1. X. Munoz, J. Freixenet, J. Marti and X. Cufi, Active regions for unsupervised texture segementation integrating region and boundary information, in 2nd International Workshop on Texture Analysis and Synthesis, (Copenhagen, 2002).

2. Forsyth and Ponce, Computer Vision (Pearson Education, Singapore, 2003).

3. M. Salotti and C. Garbay, A new paradigm for segmentation, in Proceedings: 11th IEEE International conference on pattern recognition, ICPR9S, (The

Hague, 1992). 4. X. Munoz, J. Freixenet, X. Cufi and J. Marti, Pattern

Recognition 24, 375 (2003). 5. I. Ng, T. Tan and J. Kittler, On linear transform

and Gabor filter representation of texture, in Proceedings 11th I A PR International Conference on Pattern Recognition, (The Hauge, Netherlands, 1992).

6. W.-C. Lin, E. Chen-Kuo and C.-T. Chen, Pattern Recognition 25, 679 (1992).

7. F. Kurugollu and B. Sankur, Map segmentation of color images using constraint satisfaction neural network, in International Conference on Image Processing, 1999.

8. L. Gupta and S. Das, Texture edge detection using multi-resolution features and self-organizing map, in Proceedings of 18th International Conference on Pattern Recognition, (Hong Kong, 2006).

413

Upper Bound in Model Order Selection of M R F with Application in Texture Synthesis

Arnab Sinha

Dept. of Electrical Engg. IIT Kanpur

Kanpur - 208016, UP, India E-mail: [email protected]

Sumana Gupta

Dept. of Electrical Engg. IIT Kanpur

Kanpur - 208016, UP, India E-mail: [email protected]

The upper bound of Markov Random Field (MRF) model order in the application of texture synthesis can be estimated from the structure of the texton, which is the fundamental period present in the sample texture. Inthis paper we have shown the equivalency between the minimum conditional entropy of the conditional probability density and the maximum pseudo-likelihood estimation for different model orders. We conclude that for periodic or semi-periodic textures, the minimum entropy, i.e., the maximum pseudo-likelihood for a particular texture, can be guaranteed by upper bounding the model order such that the neighborhood set covers the fundamental structure described through the fundamental period of the sample texture. The linkage of the maximum pseudo-likelihood with the minimum conditional entropy and the linkage of the fundamental period with the minimum conditional entropy provides us a concept of upper bound for neighborhood order in case of semi-periodic structures, like texture.

Keywords: MRF, texture synthesis, model order

1. Introduction

Texture synthesis is an important subject of both theoretical and practical interest. While there exists a number of different approaches, we have chosen to work on texture synthesis through nonparametric Markov random field (MRF), because of mainly two reasons. First, MRF assumes only the conditional independency about the underlying structure. Second, the results obtained [*] bear promises for a broad range of textures. However, there are two issues of critical concern in MRF approach. Ther are, i) the choice of the order of the MRF given the sample texture and ii) the synthesis of texture in real time. In this paper we have attempted to address the first issue.

MRF order selection is a well-known and important problem in the domain of texture synthesis. The visual quality of synthesized texture varies as the model order varies as shown in the figure 1. Kashyap and Chellappa in3 first proposed a method of order selection based on the model of linear combinations of gray levels plus Gaussian noise. Seymour and Ji in [4] derived three Bayesian selection criteria for the parametric MRF. In 2006 Talata el al, in [5] proposed an information criterion according to the pseudo-likelihood for the estimation of the order

of the nonparametric MRF. But, the effect of order selection in a practical problem like texture synthesis has not been reported in the literature. The upper bound defined in [5] is expressed as, log1 '2 \S\, where, d is equal to 2 for an image or texture and |.| is the cardinality. This

f\'"v • I T " ' -. [• W \ \ \ l i l I V H , • 3 • •"< • -

C . • " A

T..W •

• i :

• • ' } . - . 1 '

.a-; * t " '• i

?1 rV \ t 1*1

'*".'*

7v\ v,v

• i " <' \ r

1 1' -,

'l '

L. i \ .

(a) (b) (c)

Fig. 1. Effect of model order on texture synthesis: (a) The original D20 texture - Image region in the blue window is cropped as sample texture for synthesis purposes; (b) synthesized with 6 t h order MRF, (c) synthesized with 22"* (estimated upper bound) order MRF

upper bound is dependent on the size of the sample (|5|) and not on the sample itself. Again, this upper bound is small enough to characterize the sample texture for synthesis (for most real textures). As for example, consider a texture of sample size 128 X 128. For this example as the upper bound would be ap-



414 Upper Bound in Model Order Selection of MRF with Application in Texture Synthesis

proximately equal to two, this is not enough for texture synthesis, as seen from figure ( 1). The above mentioned approaches have one common disadvantage. They are all iterative methods and hence, computationally expensive. In this work we have introduced the equivalence between the pseudo likelihood estimate of the conditional probability density function and the minimum conditional entropy method of the corresponding conditional density function, given the 2-D texture image for the case of MRF. From the minimum entropy point of view we have shown that an upper bound for the order in case of a periodic signal can be extracted from its fundamental period. Therefore, we can maximize the maximum pseudo-likelihood estimate through the negihbour-hood structure which will cover the fundamental period of the given sample. Generally, we assume that texture is a semi-periodic signal, i.e., it is not completely deterministic in sense of periodicity; yet it has a fundamental period which is useful for describing the minimal structure or window required to specify the texture elements, i.e., textons. Once, the texton structure is recognized, one can propose a model to describe different structures within the textons for a given sample. This can be done easily through MRF model. Therefore, the proposed approach can be applied to texture synthesis.

The paper is organized as follows. In section 2, the preliminary definitions and MRF model structure have been described for non-parametric MRF. In section 3, the proposed estimation approach is explained. Section 4 focuses on results which support the proposed idea of upper bound of the model order estimation. In section 5, we conclude the paper.

2. Nonparametric Markov Random Field Preliminaries

MRF models have been used for different applications in different branches of science and engineering. In the present case, description of MRF is taken from the view point of lattice models. The lattice, X, is a collection of random variables (r.v. henceforth) at s = {i, j} e 5, where, i,j = 0 , 1 , . . . , M — 1. The random variables Xs e A, i.e., they belong to the same state space. The MRF assumption implies p(Xs = z s |X ( s ) ) =p(Xs = xs\xr;r e Ks) - the conditional probability of the r.v. at s given all other sites, (s) = S — s, is equal to the conditional probability of that r.v. given a neighbor set, Ns. The neigh

borhood system is defined as,

• si Ks, • s e Hr «• r e Hs

Let us now consider the nonparametric MRF model as described in f1]. Define, QN+I = {<?O,---,<?JV}, whereto € A,qUr e A,0 < nr < N,N = |Ns |,s € S. Let,QN = {q-i,... ,QAT}, and,

F(QN+l) = ^Z S(X*'Qo) I I S(xr,Qnr) s6S,N.,CS r€N»

N+l) (l)

9o €A

Here, F(QN+I) denote the frequency of occurrence of the set of levels {go, <Zi, • . . , <ZAT} in the random field X, S is the Kronecker function. The conditional probability function as defined for nonparametric MRF, can be written as,

F{xs\xr;r G Ha) — F(QN) Therefore, the pseudo-likelihood (PL hence

forth) as described in [6] can be written for the nonparametric MRF model as,

p i = nQN+l6A~+1-p(9oi^)ir(Q~+i)

3. Upper Bound of Model Order Selection

In the following subsection we show that there is an equivalency between the maximum pseudo-likelihood estimation (MPLE) and the minimum conditional entropy (MCE) for a MRF. In the nect subsections, we also show that if the underlying field has a fundamental period (like periodic signal), then we can find an upper bound for the MRF model order to describe the underlying signal.

3.1. Relation between MPLE and MCE

The log of PL can be written as,

LPL(X)= £ F ( Q „ + 1 ) l o g ( ^ ± i 2 )

Again the expression for the conditional entropy can be written as,

H(qo\qnr;nr € H) =

- J2 P(Qff+i)iog(P(qo\Qnr;nr€H)) (2)

If we write the joint probability of QN+I 6 A w + 1 , as follows,

Arnab Sinha and Sumana Gupta 415

P(QN+I) = F(QN+I)

E F(QN+I)

and take the conditional probability as defined earlier, then we establish the equivalency between the log of PL and the negative of the conditional entropy given by 2, since,

Y j F(QN+I) = constant

i.e., if we want to maximize the PL it implies that we have to minimize the conditional entropy of the given system with respect to the data and vice versa. In other words one can say that as the PL increases we are more certain about the underlying structure in the pattern of the data. This in turn implies that the entropy should decrease.

Now, we have to consider the effect of model order in conditional entropy and MPLE. Let us consider two models with the neighborhoods corresponding to N^1) and H'2 \ and without loss of generality, let us assume that PL ( 1 ) > PL ( 2 ) , i.e.,

LPLW (X) > LPL& (X)

«S?+i6A

>

Again,

E F(<0 Q^+1eAN<1)+1

E *"«0 therefore, the relation between the LP Us be

come,

6A/v<D + i \ *KQN ) /

E P(Q^!) log(-

>

'F(Q1NII)\

=>H(qP\$)<H{q™\q™)

3.2. Upper bound in model order for a periodic signal

In case of a one-dimensional periodic signal if we choose the nonparametric non-causal MRF model order in such a way that the cardinality of the neighborhood set with the corresponding site equals the

fundamental period of the signal, the conditional entropy becomes equal to zero. That is, if for two sites in the signal the neighborhood set is same, then it implies the corresponding values at those sites are also same. Let us assume that the period of a 1-D signal is ( i i , . . . ! „ ) . If we consider non-causal model, then, the neighborhood set with the corresponding site will be (xi,... xn), and the circular shifts of this set. Now, since its a periodic signal, in the worst case all the variables (xi,...xn) may be equal to each other except at least one, which will make each circular shift of the above set independent of each other. Therefore, we can easily conclude that for any two sites in the signal if the neighborhood is equal then the values at those sites are also equal (but the converse is not true).

In two dimensional case we can follow the same logical direction and conclude that if a field has a fundamental frequency, then, the upper bound of the order for that particular field can be calculated from the realization of the field. Texture field can be decomposed into a sum of two mutually orthogonal components: a purely indeterministic field (that can be modeled through MRF), and a deterministic component [7], where the deterministic component is due to the direction information and periodicity. The deterministic field can also be modeled through MRF with a suitable order (upper bounded by the process described above). Therefore, we can apply the MRF theory as a unified model for synthesis of textures.

4. Results

From Figs. 5 to 12 the original and synthesized (respectively) results are shown for different types of textures taken from Brodatz album [8]. For texture synthesis, we have used the algorithm described in I1]. Charalampidis in [2] proposed a method for describing the minimal structure of a sample texture. We have use his approach for recognizing the tex-tons and then calculated the maximum period from the geometrical structure. The upper bound of the order is calculated as half of this maximum period, rounded to the nearest higher integer. The estimated upper bounds calculated are given in the Table 1.

For performance evaluation we have calculated the neighborhood similarity measure between the original texture sample and the synthesized one. We have taken a rectangular window of size approxi-

416 Upper Bound in Model Order Selection of MRF with Application in Texture Synthesis

mately equal to or greater than the fundamental period. We define d as the distance between two window sets from the original and the synthesized texture samples as,

d(xi,yux2,y2,) = Y,i,j abs [W/jOn.yx) - W&te.i /a)]

, where W°j (x, y) is the window set corresponding to the original texture sample at (x, y) location and (i,j) the relative location within the neighborhood; similarly, Wfj(x,y) is defined for the synthesized texture sample. Now, define dmin(x,y) = min(d(x,y,x0,y0)):V(x0,y0) € 5°, where S° is the set of lattice points in the original texture sample. Now, we can define the neighborhood similarity measure between two texture samples as,

M = (l/\S°\)j:{Xty)€S,drnin(x,y) The similarity measures M between the original

texture and each of the synthesized textures for different model orders are plotted in the figures from 2 to 4 for two types of textures, D-13 (stochastic) and D-36 (almost periodic). From the figures we can easily conclude that the fluctuation of the similarity measure is not varying much after the estimated upper bound.

5. Conclusion

In this work we have used the equivalency of the nonparametric maximum pseudo-likehood estimate with the minimum conditional entropy for the same order and also for different model orders, for estimating the upper bound of the MRF model order in the application of texture synthesis. This is done with the help of the linkage between the minimum conditional entropy and the fundamental periodicity of the given texture sample. Optimum MRF order selection for modeling any underlying source is a well-known problem in texture synthesis, due to the trade off between the accuracy of the synthesised texture and the number of neighborhoods. The optimum selection of the neighborhood set is difficult

(practically inpossible) due to computational complexity and time required to solve the problem. In this paper, we have tried to solve this issue by estimating the upper bound of the model order through a fast algorithm like FFT. From the theoretical analysis and the results shown in the earlier sections, we conclude that for periodic or semi-periodic textures (most of the real-world textures lie in this class), we can estimate the upper bound of the model order and then synthesise the texture without much bothering about the optimum order. This could lead to a fast and accurate result at the same time. However, the computational cost of estimating the optimum order for a given texture is much greater than the computational cost of synthesizing the texture with the estimated upper bound of the model order.

Fig. 2. Variation of the neighborhood similarity measure with respect to the model order for D-13 (stochastic); the measure is remaining approximately constant after the estimated upper bound of the model order

Table 1. Estimated Upper Bounds (EUB) of different textures

Texture EUB (a) D - l l l 40 (d) D-33 29 (g) D-55 32

Texture EUB (b) D-13 50 (e) D-51 25 (h) D-84 19

Texture EUB (c) D-80 30 (f) D-104 14 (i) D-92 32

Arnab Sinha and Sumana Gupta 417

Fig. 3. Variation of the neighborhood similarity measure with respect to the model order for D-84 (almost periodic)

Fig. 4. Variation of the neighborhood similarity measure with respect to the model order for D-80 (almost periodic)

References

1. R. Paget and I. D. Longstaff, Texture synthesis via a noncausal nonparametric multiscale Markov random field, IEEE Transactions on Image Processing Vol.-7(6) , 925-931 (1998).

2. D. Charalampidis, Texture synthesis: textons revisited, IEEE Transactions on Image Processing Vol.-15(3), 777-787 (2006).

3. R. L. Kashyap and R. Chellappa, Estimation and choice of neighbors in spatial-interation models of images, IEEE Transactions on Information Theory Vol . -29( l ) , 60-72 (1983).

4. C. Ji and L. Seymour, A consistent model selection

procedure for Markov random fields based on penalized pseudolikelihood, The Annals of Applied Probability VoL-6, 423-443 (1996).

5. Imre Csiszr and Zsolt Talata, Consistent estimation of the basic neighborhood of Markov random fields, Annals of Statistics Vol . -34( l ) , 123-145 (2006).

6. S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence VoL-6, 721-741 (1984).

7. J. M. Francos and A. Z. Meiri and B. Porat, A unified texture model based on a 2-D Wold-like decomposition, IEEE Transactions on Signal Processing Vol.-41(8), 2665-2678 (1993).

8. P. Brodatz, Textures: A photographic album for artists and designers, Dover Publications, New York, 1966.

(a) (b)

Fig. 5. Di l l texture synthesis: (a) original texture sample, (b) neighborhood order=40

Fig. 6. D13 texture synthesis: (a) original texture sample, (b) neighborhood order=50

418 Upper Bound in Model Order Selection of MRF vrith Application in Texture Synthesis



r*f*»s


t£*M SI S | ^ ^ ? : 5 1 ®®s^t;

W2sw ^=TTL ll » !$§$ A S teLzl JF%& ^P^gl *s %

If ft'**f f»* fSFW IK (a) (b)


^ ; * " <


. «* * •. * f^^r ^ * v * *****

.#'.'. yw« ... *• \ : ; v !v^ *:. *^K

(a)

**. * ^ (b)


419

Wavelet Features for Texture Classification and Their Use in Script Identification

P. S. Hiremath and Shivashankar S.

Dept. of P. G. Studies and Research in Computer Science, Gulbarga University

Gulbarga, Karnataka, India. E-mail: [email protected],[email protected]

The problem of determining the script of a document image has a number of important applications in the filed of document analysis, such as indexing and sorting of large collections of such images, or a precursor to optical character recognition (OCR). This paper concerns the extraction of texture features and the use of such features for determining the script of a document image, based on the observation that text has a distinct visual texture. Texture features are extracted using wavelet decomposed images of an image and its complement, and their effectiveness is tested on 32 texture category. These features are then used to solve the problem of script identification in document image processing. The scheme has been tested on 8 Indian scripts and found to be robust to the skew generated in the process of scanning. The proposed system achieves an overall classification accuracy of 97.55 % on a large testing set.

Keywords: wavelets;texture;document analysis;classification;script identification

1. Introduction

Texture classification is a fundamental problem in an image analysis and computer vision. This process plays an important role in many applications such as biomedical image processing, automated visual inspection, content based image retrieval and remote sensing. It has been a focus of research for nearly four decades. Briefly stated, there are a finite number of training classes Cj, i — 1,2, • • • n. A number of training samples of each class are available. Based on the information extracted from the training samples, a decision rule is designed which classifies a given sample of unknown class into one of the n classes.

To design an effective algorithm for texture classification, it is essential to find a set of texture features with good discriminating power. A number of texture classification techniques have been reported in literature.1-3 The wavelet methods offer computational advantage over other methods for texture classification.4'5

A very important area in the field of document analysis is that of optical character recognition (OCR), which is broadly defined as the process of recognizing either printed or handwritten text from document images and converting it into electronic form. Although a great number of OCR techniques have been developed over the years, almost all existing work on OCR makes an important implicit assumption that the script and/or language of the document to be processed is known. In practice this implies human intervention in identifying the

script/language of each document. In our increasingly interconnected world, this is clearly inefficient and undesirable. For minimal human involvement, there should be an automatic mechanism where by the language of the input document is first identified and the appropriate OCR module is then selected. Since each script often has a distinctive visual appearance, a block of text in a language from the script (regardless of content) may be considered as a different texture. Therefore, the problem of script identification may be tackled by means of texture analysis ( in particular texture classification).6'7

In this paper, we first present a general algorithm for texture feature extraction. The algorithm is based on the wavelet filter paradigm. It uses the different combination of approximation and detail subbands of the wavelet decomposed image. In our algorithm, the decomposition is performed on the image and its complement. We use these features for classifying the textures choosen from Brodatz album.8

We then examine the usefulness of texture analysis in solving a practical problem in document analysis, namely, the identification of the script of a machine printed document. This allows the effectiveness of the proposed texture features to be tested in the practical application.

This paper is organized as follows: In the section 2, the proposed scheme is discussed. The texture training and classification are explained in section 3. In section 4, experimental results of the proposed method are discussed in detail. Finally, section 5 con-


420 Wavelet Features for Texture Classification and Their Use in Script Identification

eludes the discussion.

2. The proposed scheme

The proposed scheme is inspired from the observation that humans are capable of distinguishing between unfamiliar scripts just based on a sample visual inspection. We consider script identification as a process of texture analysis and classification. In general, a texture is a complex visual pattern composed of subpatterns. Although, subpatterns can lack a good mathematical model, it is well established that a texture can be analysed completely only if its subpatterns are well defined. Script patterns can be considered to be textures formed by several oriented linear subpatterns. It is useful to study the extent of classification possible and the accuracy attainable.

2.1. Proposed feature extraction algorithm:

(1) Input the image X and its complement X = 1-X (2) Apply DWT on X using Haar wavelet yields the

approximation co-emcient(A) and the detail coefficients: horizontal(H), vertical(V) and diago-nal(D).

(3) Consider the pair of images (A,H). For a pixel x in A and corresponding pixel y in H, their 8-nearest neigbours are as shown in Fig. 1

at

a-

a;

a;

X

a?

a

ai

as

iu

h-

hi

h

y

hT

h:

hi

hs

A H

Fig. 1. 8-nearest neighbors of x and y in A and H respectively

(4) Construct two histograms H\ and H2 for A based on the maxmin composition rule stated below: Let a = max(min(x,/ij), min(y,aj)) Then, x e H\ if a = min(x,/ij) and x G Hi if a = min(y,aj) It yields 16 histograms, 2 for each neighbour i = l , 2 , - - - , 8 .

(5) Construct the cumulative histograms for H\ and H2 each histogram obtained in step(4) and nor

malize. The points on the cumulative frequency curve are the sample points.

(6) From the sample points of each cumulative histogram obtained in step(5), compute the following features:

(a) Slope of the regression line fitted across the sample points.

(b) Mean of the sample points. (c) Mean deviation of the sample points.

(7) Repeat steps 3-6 for the pair of images (A,V), (A,D), (A,abs(V-H-D)) for feature extraction.

(8) Repeat steps 2-7 for the complementary image X

The above process generates 384 features (2 (images) x 4 (combinations) x 3 (features) x 16(histograms)) which constitute a feature vector f. The vector f is used for texture classification. If the image size is nxn and m nearest neighbors are considered, the complexity of the above algorithm is o(mn2).

3. Texture training and classification

Each texture image is subdivided into 16 equal sized blocks out of which 8 randomly chosen blocks are used as training samples and remaining blocks are used as test samples for that texture class.

3.1. Training

In the texture training phase, the texture features are extracted from the 8 samples selected randomly belonging to each texture class, using the proposed feature extraction algorithm. These features are stored in the feature library, which are further used for texture classification.

3.2. Classification

In the texture classification phase, the texture features are extracted from the test sample x using the proposed feature extraction algorithm, and then compared with the corresponding feature values of all the textures stored in the feature library using the distance vector formula,

D{M) = ÛifA^-MM)}2

P. S. Himmath and Shwashankar S. 421

where, N is the number' of features in / , fj(x) represents the j t h texture feature of the test sample x, while fj (M) represents the j t h feature of Mth texture class in the feature library. Then, the test texture x is classified using the K-nearest neighbors (K-NN) classifier.

In the K-NN classifier, the class of the test sample is decided by the majority class among its k nearest neighbors. A neighbor is deemed nearest if it has the smallest distance in the feature space. In order to avoid a tied vote, it is preferable to choose k to be an odd number. The experiments are performed choosing k =3 .

4, Exper imenta l resul ts

4 .1 . Experimental data

In order to assess the discrimination capability of wavelet based feature sets, we have performed experimental tests using the same data presented in.9The test defines 32 test categories from selected images of Brodatz album.8collection as showin in Fig 2. For the 32-category problem, texture samples are obtained from a 256x256 image with 256 gray levels. Each image is subdivided into 16 blocks of 64x64 pixels and each block is transformed into a new block by 90° rotation. This produced 1024 blocks. Half of the data set is selected by randomly choosing 8 blocks and the corresponding transformed blocks. This data is used to define classes and the remaining blocks to evaluate the classification process.

Fig. 2. Texture images from Brodatz album

4.2. Computation of features

The features are computed for each texture block separately by considering 3x3 neighborhoods. All the 384 features constituting a feature vector are used to achieve better classification accuracy. Texture classification is done using K-NN classifier. In the K-NN

classifier, all feature vectors are stored in the library i.e., library contains 512 feature vectors one for each training texture sample.

4 .3 . Classification results

FVom Table 1, it is observed that when classification is carried out by K-NN classifier, the average classification (i.e., percentage of correct classification) rate of 96.84 is achieved which is higher than the classification rate of 94.91 obtained in.10 The proposed method is simple and computationally less expensive, since its complexity is o(mn2).

Table 1. Average classification accuracies in percentage over 10 experiments for 32 texture category.

Classification rate (%) Si. No

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Texture

Bark Beachsand

Beans Burlap

D10 D l l D4 D5

D51 D52 D6

D95 Fieldstone

Grass Ice

Image09 ImagelS Imagel7 Imagel9

Paper Peb54

Pigskin Pressedcl

Raffia Raffia2 Reptile

Ricepaper Seafan Straw2

Tree Water

woodgrain

Mean Success rate

Montiel et al.

100 85 100 100

61.56 95.63 96.25 96.88 100 100 100 100

76.25 99.06 63.13 100

96.56 92.50 100

99.38 97.81 92.50 100 100 100 100 100

93.13 100

92.19 100 100

94.91

Propose

98.75 97.5 100 100

98.75 100 100 100 100 100 100 100 100 92.5

96.25 100

83.75 100 100 100 95

96.25 100 100 100

98.75 73.75 100 100

81.25 97.5

88.75

96.84

One application is discussed here to emphasize our proposed feature. Different scripts have distinctive visual appearance. Thus a block of text may

422 Wavelet Features for Texture Classification and Their Use in Script Identification

Table 2. Percentage of correct classification of script samples using the proposed texture features

SI. Nc

1 2 3 4 5 6 7 8

> Language

Kannada English Tamil Urdu

Telugu Bengali Hindi

Malayalam

Overall

Classification rate

99.6 100 97.2 100 99.6 96.8 99.6 97.6

97.55

ber of areas, the methods are being investigated. A

set of texture features using wavelets is described

in this paper. The effectiveness of the features has

been investigated in texture classification experi

ments. Script identification problem is illustrated as

a practical application of the proposed tex ture fea

tures. Experimental results have clearly shown the

potential of such a global approach.

A c k n o w l e d g e m e n t

The authors are grateful to the refrees for their crit

ical review and helpful comments.

nlnlp hi-, da broihCT-m-taw Hi tl«i, sulci, uiut t f i c hti' i «id \ Mud di' pmM >nd««>of

;,-straws f #= ,s(

Sot? eowl, J W ^ O J 1 ' |38fi«So,T «•.' itflStS^Jr'^

siajtf ism* •4-'<-**-B^. 1977- 198<krt2 .%&?*£.<?•

Fig. 3. . Script used for identification problem (Left to right row 1: Bengali, English, hindi, kannada, row 2: Malayalam, tamil, telugu, urdu).

be regarded as a distinct texture pat tern . This ob

servation motivates us at utilizing texture classifi

cation algorithm for script identification. A texture

based approach does not require connected compo

nent analysis. In this sense texture based approach

may be called global approach. The proposed tex

ture features are tested on script identification prob

lem. English and 7 Indian scripts (namely, Kannada,

Tamil, Urdu, Telugu, Bengali, Hindi and Malayalam)

are used in the experiments. The Fig 3. shows the

scripts used. The scripts are digitized at 150 dpi.

As in texture classification problem,half of the num

ber of samples are used for training and testing. The

script samples used for this experiment have 128 x

128 pixels. These are binarized using global thresh

olding. The performance of the proposed texture fea

tures for script identification is given in Table 2. The

overall correct classification for script identification

of 97.55 % is achieved by using the proposed texture

features. These results demonstrate the efficiency of

the feature set in script identification.

5 . Con c lu s ion s

The wavelet texture analysis is rapidly finding its

way to real world applications. In a growing num-

References

1. R. M. Haralick et al. .Texture features for image clas-sification/fiEJS1 Trans. System Man Cybernat. 8(6), 610-621 (1973).

2. T. R. Reed and J. M. Hans Du BufF,A review of recent texture segmentation and feature extraction techniques", CVGIP, Image Understanding 57(3), 359-372 (1993).

3. A. Bovik et al., Multichannel Texture Analysis Using Localized Spatial Filters", IEEE Transactions on Pattern Analysis and Machine Intelligence 12(1), 55-73 (1990).

4. M. Unser and M. Eden,Multiresolution feature extraction and selection for texture segmentation", IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 717-728 (1989).

5. A. Laine and J. Fan,Texture classification by wavelet packet signatures", IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1186-1190 (1993).

6. T. N. Tan,Rotation Inveariant Texture Features and Their Use in Automatic Script Identification ", IEEE Transactions on Pattern Analysis and Machine Intelligence 20(7), 751-756 (1998).

7. A. Buschet al.,Texture for Script Identification", IEEE Transactions on Pattern Analysis and Machine Intelligence 27(11), 1720-1732 (2005).

8. P. Brodatz,Textures: A Photographic Album for Artists and Designers", New York: Dover, New York , (1966).

9. K. Valkealahti and E. Oja,Reduced multidimensional co-occurrence histograms in texture classification", IEEE Transactions on Pattern Analysis and Machine Intelligence 20(1), 90-94 (1998).

10. E. Montielei al.,Texture classification via conditional histograms", Pattern Recognition Letters 26, 1740-1751 (2005).

423

Abhilash, R., 263 Aradhya, V. N. M , 40, 140 Arya, K. V., 186 Atif, J., 15

Bandhyopadhyay, S., 81 Bandyopadhyay, S., 349, 384 Bapi, R. S., 244 Basnet, R. B., 299 Bhagvati, C , 244 Bhattacharya, B. B., 343, 356 Bhattacharya, C , 203 Bhattacharya, U., 101, 129 Bhowmick, P., 343, 356 Bhowmik, T. K., 361 Biswas, A., 343, 356 Bloch, I., 15, 198

Cao, H., 135 Chaira, T., 226 Chakraborty, B., 35, 88 Chakraborty, D., 396 Chakraborty, G., 88 Chalasani, T. K., 309 Chanda, B., 73, 107, 180, i Chaudhuri, B. B., 73 Chaudhury, S., 268 Choras, R. S., 256 Choudhury, D., 66 Chowdhury, S. R, 107 Colliot, O., 198

Das, A. K., 107, 239 Das, S., 215, 263, 332, 407 Datta, K., 274 Deekshatulu, B. L., 244 Deepti, P., 263 Devarakota, P. R., 290 Dey, L., 268 Dhara, B. C , 180 Dutta, V. P., 203

Eajaz, S., 51 Ekbal, A., 349, 384

Florescu, I., 147

Ghosh, A., 193, 231 Ghosh, A. K., 281 Ghosh, S., 193 Ghosh, S. K., 129 Gilby, J., 208 Govindaraju, V., 135 Goyal, R., 268

AUTHOR INDEX

Greig, A., 208 Guha, P., 152 Gupta, L., 407 Gupta, P., 46, 186 Gupta, S., 268, 413

Hiremath, P. S., 56, 419 Hodgetts, M., 208 Hudelot, C , 15

Jawahar, C. V., 285, 309

Kadappa, V. K., 93 Kalra, M , 332 Kamberov, G., 147 Kamiya, Y., 338 Kar, M., 361 Khosla, A., 375, 401 Khotanlou, H., 198 Kimura, F., 123 Kothari, M., 193, 315 Kumar, G. H., 40, 140, 369 Kumar, P. S., 152 Kumar, R., 285 Kumar, V. P., 391 Kundu, M. K., 164 Kundu. S., 327

Lakshmanan, V., 26 Lekshmi, S., 170 Lingayat, N. S., 295

Madhavan, C. E. V., 157 Maity, S. P., 164 Majumdar, A. K., 249 Majumdar, J, 170 Majumder, P, 274 Malhotra, K., 401 Manabe, Y., 35 Mandal, A. K., 379 Mandal, S., 107, 239 Mardia, K. V., 3 Meher, S. K., 231 Mehta, S. S., 295 Merchant, F., 220 Mirbach, B., 290 Mishra, G., 112 Mitra, M., 274 Mukerjee, A., 152 Mukherjee, D. P., 175 Mukherjee, J., 249 Mukhopadhyay, S., 379 Mukkamala, S., 299

Nagesha, 369 Namboodiri, A. M., 309 Nandi, P. K., 164 Negi, A., 93 Noushath, S., 40, 140

Ogata, N., 88 Okuma S., 338 Ortega, K., 26 Ottersten, B., 290

Pal, T., 117 Pal, U., 117, 123 Pallavi, V., 249 Parui, S. K., 101, 129, 321, 361 Prabhakar, C. J., 56 Prakash, S., 215 Pujari, A. K., 244 Purkait, R., 46

Ramasubramanian, V. Rao, P. N., 244 Ray, A. K., 226 Ray, N., 175 Ribeiro, B., 299 Roy, A., 321 Roy, K., 117 Roy, U., 321, 361

Saha, S., 81 Saha, S. K., 239 Salvetti, O., 226 Sana, A., 46 Saxena, A., 315

391

Saxena, P. K., 112 Shah, S., 220 Shankar, B. U., 231 Sharma, N., 117, 123 Shaw, B., 101 Shivakumara, P., 40 Shivashankar, S., 419 Shobana, L., 51 Shubham, K., 268 Sinha, A., 413 Stolkin, R., 147, 208 Sung, A. H., 299 Sural, S., 249

Tabbone, S., 304 Terrades, O. R., 304 Thiyagarajan, S., 391 Tripathi, P., 73

Udupa, N., 157

Vaiveny, E., 304 Vanathy, B., 170 Vieira, A. S., 299

Yadav, D. K., 375 Yadav, P., 112 Yang, J., 62 Yang, L., 62 Yano, Y., 338 Yekkala, A. K., 51, 157

Zhang, Y., 62

• '' • 1 Ik IS Ill

World Scientific www.worldscientific.com 6377 he

l l l l l l l l l l l P A T T E R N RECOGNITION This proceeding of the 6th International Conference on Advances

in Pattern Recognition (ICAPR07) is the latest in the series of

ICAPR proceedings containing seventy-one (one Plenary Lecture,

two Invited Lectures and sixty-eight Contributory Papers) papers

on the state-of-the-arts of different facets of pattern recognition.

The ICAPR07 conferences have already curved out a unique

position within the pattern recognition community. Like previous

years, there was an overwhelming response for paper submission

and only a subset of the submitted papers are selected for final

publication as an anthology of research papers deliberating on

different open problems in the fields of image and video processing,

document analysis, multimedia object retrieval etc. to more

advanced topics of Biometrics (Speech and Signal Analysis). Some

of the papers focuses both on theory and application driven basic

research in the field of pattern recognition.

ISBN-13 978-981-270-553-2 ISBN-10 981-270-553-8

http://www.worldscientific.com

01.AdvancesinPatternRecognition

Documents

t t e i

i n s t i t u t e

t i s t i c

confident t h

publication of t h e

o n g taipei c h e n

h o n g

pattern recognition