Top Banner
VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION TRONG-TON PHAM 1,2 , LOÏC MAISONNASSE 1 , PHILIPPE MULHEM 1 , ERIC GAUSSIER 1 1. Computer Science Laboratory Grenoble-LIG 2. IPAL CNRS, I2R Grenoble, BP 51 1 Fusionopolis Way 38041 Grenoble Cedex, France 138632 Singapore In this paper, we describe an attempt to use a graph language modeling approach for the retrieval of images. Because photographic images are 2D data, we consider taking into account this specificity by representing relationships between image regions. Such a representation implies to use complex graph representations. Traditional sub-graph matching methods during retrieval do not scale-up, and language models have been quite successful for years now for text retrieval. That is why we propose in this paper to define an extension of language models for graphs, and to evaluate the interest of this approach on image indexing and retrieval. The results obtained show the potential interest by considering images or groups of images as queries or results. 1. Introduction After almost 20 years of research in the field of still image retrieval, the domain is still a great challenge for computer scientists. Problems that arise when considering image indexing and retrieval are related to the semantic gap and the way we can represent image contents. Our consideration in this paper is not to focus on symbolic representation of images, but to study the use of an extension of language models [8] in the context of graph representation of image content. Graph representations of images are able to tackle some of the complexity of the image content, and the language modeling of documents has been proved to be successful for text representation. So, we expect here to take benefit for both approaches in a way to provide accurate image retrieval systems. One important aspect of this paper is related to the fact that images are usually not unrelated to other images. For instance, all the digital cameras do now integrate the date and time of shooting and this information may be used to group them [12]. In addition, geolocalization information may be used [4]. Groups of images may also be of interest for image indexing and retrieval. In this case, mismatches may occur between the groups extracted during indexing and the groups used as queries. We show that the use of language modeling 1
10

VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

Mar 07, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

TRONG-TON PHAM1,2, LOÏC MAISONNASSE1, PHILIPPE MULHEM1, ERIC GAUSSIER1

1. Computer Science Laboratory Grenoble-LIG 2. IPAL CNRS, I2R Grenoble, BP 51 1 Fusionopolis Way

38041 Grenoble Cedex, France 138632 Singapore

In this paper, we describe an attempt to use a graph language modeling approach for the retrieval of images. Because photographic images are 2D data, we consider taking into account this specificity by representing relationships between image regions. Such a representation implies to use complex graph representations. Traditional sub-graph matching methods during retrieval do not scale-up, and language models have been quite successful for years now for text retrieval. That is why we propose in this paper to define an extension of language models for graphs, and to evaluate the interest of this approach on image indexing and retrieval. The results obtained show the potential interest by considering images or groups of images as queries or results.

1. Introduction

After almost 20 years of research in the field of still image retrieval, the domain is still a great challenge for computer scientists.

Problems that arise when considering image indexing and retrieval are related to the semantic gap and the way we can represent image contents. Our consideration in this paper is not to focus on symbolic representation of images, but to study the use of an extension of language models [8] in the context of graph representation of image content. Graph representations of images are able to tackle some of the complexity of the image content, and the language modeling of documents has been proved to be successful for text representation. So, we expect here to take benefit for both approaches in a way to provide accurate image retrieval systems.

One important aspect of this paper is related to the fact that images are usually not unrelated to other images. For instance, all the digital cameras do now integrate the date and time of shooting and this information may be used to group them [12]. In addition, geolocalization information may be used [4]. Groups of images may also be of interest for image indexing and retrieval. In this case, mismatches may occur between the groups extracted during indexing and the groups used as queries. We show that the use of language modeling

1

Page 2: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

2

may extend smoothly to multiple image queries and multiple image documents, and that such an approach is robust with respect to the differences between these groups of images.

The paper is organized as follows. In section 2, we present works related to our proposal, by discussing previous works on complex representations, and by describing existing works related to non unigram language models. In section 3 we propose the visual language model (VLM) used to describe the image content. Experiments on a corpus of more that 3500 images are presented and the results are discussed. We conclude in part 4.

2. Related works

Previous works have considered the use of spatial relationships between image regions. For instance, image descriptions expressed by 2D Strings used in the Visualseek system [15] reflect sequences of occurrences of objects along one or several reading directions. Retrieval on 2D strings is complex as substring matching is a costly operation. However, heuristics to speed up the process exist, as the one described in [1] which allows one to divide the processing time by a factor 10. Several works have considered relationships between image regions in a probabilistic models, e.g. through the use of 2D HMMs as described in [5] or [18]. However, these works focused on image annotation, and did not consider the relations during the retrieval process. Relationships between elements of an image may also be expressed explicitly through naming conventions, as in [10] where the relations are used for indexing. Other attempts have concentrated on conceptual graphs [9] for indexing and retrieval. However, taking into account explicit relationships may generate complex graphs representations, and the retrieval process is likely to suffer from the complexity of the graph matching process [10]. Part of our concern here is to be able to represent images using graphs, without suffering from the burden of computationally expensive matching procedures. One solution is then to take benefits from existing approaches in the field of Information Retrieval.

Language models for information retrieval exist from the end of the 90s [14]. With such works, the relevance status value of one document for a query is estimated using the probability of the document to generate the query. Such models can use unigrams (terms taken one by one) or n-grams (sequences of n terms) [16] [17], but they are not easily amenable to integrate explicit relationships between terms. Similarly, language models have been used for image indexing and retrieval [3] without incorporating relationships between

Page 3: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

3

image regions. Some works have proposed to extend language models with relationships, like [2, 7, 8], but they focus on text.

In this paper, we extend the work of [8] in a way to handle image representations using relations. In [8], neither concepts nor relations are weighted. We extend this approach to weigh concepts and relations, and show the use of this model on a corpus of images.

3. Language model for image graph representation

3.1. Modeling of images with visual graphs

Representing the image content as a graph allows one to capture the complexity of images. This complexity comes from the inner nature of the images, e.g. in the form of spatial relationships, but also from the difficulty of a computer system to analyze images content.

We consider that images are represented by a set of weighted concepts, connected through directed associations. The concepts aim at characterizing the content of the image, whereas the associations (like spatial relations) express the relations between concepts. As the same concept and the same association can occur several times in an image (for example when the concepts correspond to regions in the image, the concepts “sky” and “sea” may occur several times, and be several times associated with the relation “top of”), we consider that both concepts and associations are weighted according to the number of times they occur in the image. We assume here that the concepts characterize non overlapping regions, and will denote the set of concepts by C. The set of weighted concepts used to represent a given image will be denoted by WC. WC is defined on CxIN. As the weights capture the number of occurrences of the concepts in the image, each concept of C appears at most once in WC.

Each association between any two concepts c and c’ is directed (as spatial relations are and as is done for example in Acemedia [12]) and is represented by a triplet <L(c,c’),l,n(c,c’,l)>. L(c,c’) represents the fact that there exists, in the image, a spatial relation between the two concepts (in which case L(c,c’)=1) or not (in which case L(c,c’)=0). l represents the label of the association between the two concepts, and n(c,c’,l) represents the number of times the two concepts are connected in the image with label l. We consider here that the possible labels expressing a spatial relation between two concepts are top_of and left_of. The converse relations are implicitly captured as the associations we consider are directed. In the case where L(c,c’)=0, i.e. there is no spatial relation between the concepts, then l=∅ by definition.

Page 4: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

4

The will to model the absence of spatial relations (which corresponds in our setting to L(c,c’)=0 and l=∅) between concepts is motivated from two considerations. The first one comes from the fact that the absence of a relation may represent valuable information for matching images (such an information is for example used in several probabilistic models of IR, as the Binary Independence model [14] and the original language model [13]). The second motivation comes from the fact that each concept corresponds to at most one node of the image graph. Thus, making explicit the absence of a relation between two concepts, we are able to quantify to which extent two concepts occurring more than once in the image are really connected through a given spatial relation. The above considerations also imply that several, different triplets can be associated with two given concepts.

Finally, the graph representing an image is defined as G=<WC, WE>, where WC represents the set of weighted concepts defined previously, and WE the set of triplets <L(c,c’),l,n(c,c’,l)> defined for each concept pair in WC. L is an application from CxC to {0,1}, l is an element of {top, right_of, ∅} and n(c,c’,l) the number of times the particular association holds in the image.

3.2. A language model for visual graphs

Our matching function is based on the standard language modeling approach ([13]), extended so as to take into account the elements defined above. To differentiate the images taken into account during the matching process, we will refer to one of the images (or potentially to one of the set of images, see below) as the document, and to the other one as the query. In the remainder we assume that several images, and thus graphs, are used in the document. We will denote by SG this set of graphs. The probability for a query graph Gq (=<WCq, WEq>) to be generated by the document model Md is then defined by:

)().()( dEqdCqdq MWPMWPMGP = [*1]

For the probability of generating query concepts from the model of the document (first element of the right-hand side of equation [*1]), we rely on the concept independence hypothesis, standard in information retrieval. The number of occurrences of the concepts (i.e. the weights considered previously) are naturally integrated through the use of a multinomial model, leading to:

)()()( i

Cqi

cqn

WcqdidCq McqPMWP ∏

∝ [*2]

Page 5: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

5

where n(cqi) denotes the number of times concept cqi occurs in the graph representation of the query. The quantity P(cqi|Md) is estimated through maximum likelihood (as is standard in the language modeling approach to IR), using Jelinek-Mercer smoothing:

( ) ( )( )

( )( ) ⎟⎟

⎞⎜⎜⎝

⎛+−=

*.

*.1)(

C

iCu

d

idudi F

cqFF

cqFMcqP λλ [*3]

with Fd(cq), representing the sum of the weight of cq in all graphs from SG and Fd(*) the sum of all the query concept weights in SG. The functions FC are similar, but defined over the whole collection (i.e. over the union of all the graphs from all the documents of the collection). The parameter λu corresponds to the Jelinek-Mercer smoothing. It plays the role of an IDF parameter, and helps taking into account reliable information when the information from a given document is scarce.

We follow a similar process for the associations, leading to:

∏ =∝etssetoftripl

lcqcqndcdEq MWlxcqcqLPMWP ),',(),,)',(()( [*4]

where the product is taken over the set of triplets. The quantity P(L(cq,cq’)=x,l|WC,Md) can be decomposed as the probability of generating a particular type of association (x=0 or 1) and then as the probability of using a particular label l to annotate the association. This amounts to:

),)',((),)',((),,)',(( ddcdc MxcqcqLlPMWxcqcqLPMWlxcqcqLP =×===

The two quantities appearing in the right-hand side of the above equation are then directly estimated trough maximum likelihood. For the first quantity, we have:

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

¬+¬−+

+

¬+¬−+

==

),',(),',(),',().1(),',(.

),',(),',(),',().1(),',(.)1(

),))',((

RRRR

RRRR

cqcqFcqcqFcqcqFxcqcqFx

cqcqFcqcqFcqcqFxcqcqFx

MWxcqcqLP

CC

CCr

dd

ddr

dc

λ

λ

where Fd(cq,cq’,R ) represents the number of times the concepts cq qnd cq’ are related through a spatial relation in SG, whereas Fd(cq,cq’, R¬ ) represents the number of times they are not related through a spatial relation in SG. Again, a smoothing is used based on the whole collection (and the associated functions FC).

The second quantity is estimated in the same way. Note however that, for x=0, the only possible label is ∅, so that the probability of this label given x=0 is 1. For x=1, two labels are possible: l=top_of, l=left_of. The probability in this latter case is given by:

Page 6: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

6

⎟⎟⎠

⎞⎜⎜⎝

⎛+−==

,*)',(),',(

,*)',(),',()1( ),1)',((

cqcqFlcqcqF

cqcqFlcqcqFMcqcqLlP

C

Ce

d

ded λλ

The definitions of the functions Fd and FC are similar to the ones above but concern labels.

As we mentioned previously, we want to be able to assess the similarity of two images, but also of two groups of images. The above definitions encompass the case where the document comprises several images. For the case where the query also consists in a set of images, {Gq}, we define the probability to generate the query from the document model as the product of generating each graph in the set from the document model: { }( ) ( )∏=

Gqdqdq MGPMGP

The model we have just presented is inspired by the model defined in [9]. It differs however from it in that it considers weights on each concept and association. As the weights are integers, we relied on multinomial distributions for the underlying generative process. The consideration of real-valued weights would lead to consider continuous distributions instead of the multinomial one. We illustrate now the behavior of our model for image classification.

4. Experiments

4.1. The STOIC Collection

The Singapore Tourist Object Identification Collection for 101 scenes in Singapore is a collection of images of 101 popular tourist locations (mainly outdoor) in Singapore with a total of 3849 images. The main application for this collection is Snap2Tell [6] using for tourist information access.

The STOIC-101 collection is divided into 2 sets. The training set contains of 3189 images and the testing set contains of 660 images. This yields to the mean number per class for training is 31,7 images and for testing is 6,53 images. In the testing set, the minimum number of images for one scene is 1, and the maximum 21. On this collection, the ratio between the testing set size and the training set size per scene varies a lot from 12% up to 60%. A retrieval system on such collection must handle such variation.

For that purpose, the recognition system has to deal with some particular scenarios: 1. user query can be one image or multiple images of the same scene;

Page 7: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

7

2. recognition system index the STOIC collection in term of individual image or in term of a group of images in the same scene.

As reflected by these scenarios, table 1 shows the list of experiments we have performed on this collection using our visual language model. By training by image (I), we estimate for each image a VLM corresponding. Each model will be stored and matched with the query model. To generate a visual language model for a group of images, we considered all the images of the training set corresponding to the same scene belonging to the same group. We estimate for each scene a corresponding VLM. Similarly with the training process, we apply the same process for modeling the query by image (I) and by scene (S).

Training

Query

by IMAGE

(I)

by SCENE

(S)

by IMAGE (I)

by SCENE (S)

Table 1: Summary of experiments on STOIC-101 collection

4.2. Extraction of features and spatial relationships for images

Each image is divided into 5x5 patches. Then, we extract the visual features corresponding for this patch. Our objective is not to find the best feature for image representation but to compare different way of modeling of the image. Therefore, we use only HSV color features throughout all experiments for it efficiency of extraction (see [6]). Each channel (hue, saturation, value) of color is quantized by 8 bins. This yields a feature vector of 512 dimensions (8x8x8) for each patch.

Visual terms (called visterms in the following) and their relationships are built as follows: - Unsupervised learning with k-means clustering groups similar feature vectors,

extracted from patches, to the same cluster i.e. visterm from a vocabulary V. 500 of clusters have been chosen experimentally to represent 500 visterms.

- Image representation consists of composing the visterms appeared into image and the extraction set of spatial relationships R between them.

For our experiment, we choose to incorporate three spatial relationships for visterms: left_of, top_of and position_x_y in the image. The left_of (resp. top_of) relation captures the fact that some visterms are more horizontally (vertically) close than the others. The position_x_y (with 1≤x≤5; 1≤y≤5) relation

Page 8: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

8

denotes the location of one occurrence of a visterm. Based on this representation we build the final graph of one image where nodes from WC consist in visterms associated to their frequency in the image. For a relation between two concepts, the relation weight correspond to the number of times that the two visterms appear related in the images, the ‘non relation’ weight correspond to the number of non related concepts. Labels weights are also obtained with frequencies on one image.

4.3. Experimental results

The performance of scene recognition was evaluated based on accuracy rate per image and accuracy rate per scene which were formulated as follows:

S

S

s s

I

I

i i

NTP

NTP ∑∑ == == 11 sceneper Accuracy ,imageper Accuracy

where TPi, TPs refer to the number of true positives of images and the number of true positives of scenes respectively. NI is the total number of query images (i.e. 660 images) and NS is the total number of query scenes (i.e. 101 scenes).

To evaluate the importance of the spatial relationships, we measure the performance of our VLM against the unigram model (i.e. VLM model without relationships) under the same configuration than in part 4.3.2. Table 4 shows the improvement of the VLM with the embedded spatial relationships over the VLM unigram. The best improvement of 63.8% with VLM is obtained when we trained our model by image and queried by concept. We also notice that with the model unigram the recognition are quite average around 0.465 to 0.484 (except the last case). However, when using relations we improved the results from 13.8% to 63.8%. This fact confirms our thinking that if we add the appropriate relations in the appropriate manner to the image representation, it could improve significantly the performance of the final recognition system.

Train by Query by VLM Unigram VLM

I I 0.484 0.551 (+ 13.8%) I S 0.465 0.762 (+ 63.8%) S I 0.478 0.603 (+ 26.1%) S S 0.891 0.920 (+ 3.2%)

Table 4: Comparison of method VLM unigram vs. method VLM with spatial relationships

Page 9: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

9

5. Conclusion

We have introduced in this paper a new model for matching graphs derived from images. The graphs we have considered capture spatial relations between concepts associated with regions of an image. On a formal side, our model fits within the language modeling approach to information retrieval, and extends previous proposals based on graphs. On a more practical side, the consideration of regions and associated concepts allows to gain generality in the description of images, a generality which may be beneficial when the usage of the system slightly differs from its training environment. This is likely to happen with image collections, as users may, for example, use one or several images to represent a scene and query a given collection for further information, e.g. in the form of categorical data. Depending on the training (based on a single or a set of images for each category), the results of the system will vary.

The experiments we have conducted aim at assessing the validity of our approach with respect to these elements. In particular, we showed that integrating spatial relations to represent images led to a significant improvement in the results. The model we have proposed is able to adequately match images and sets of images represented by graphs.

In the future, we plan on using the graph model defined here with different divergence measures. The framework we have retained is based on the Kullback-Leibler divergence. However, the Jeffrey divergence has also been used with success on image collections, and could be used to replace the Kullback-Leibler one.

References

[1] Y. Chang, H. Ann and W. Yeh. A unique-ID-based matrix strategy for efficient iconic indexing of symbolic pictures. Pattern Recognition. Vol. 33, Issue 8, August 2000, pp. 1263-1276.

[2] J. Gao, J.-Y. Nie, G. Wu, and G. Cao. Dependence language model for information retrieval. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 170–177, New York, NY, USA, 2004. ACM.

[3] Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. SIGIR. (2003) 119-126.

[4] Kennedy, L., Naaman, M., Ahern, S., Nair, R., and Rattenbury, T. 2007. How flickr helps us make sense of the world: context and content in community-contributed media collections. MULTIMEDIA '07. ACM, New York, NY, 631-640.

Page 10: VISUAL LANGUAGE MODEL FOR SCENE RECOGNITION

10

[5] J. Li , James Z. Wang, Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach, IEEE Trans. on Pattern Analysis and Machine Intelligence, v.25 n.9, p.1075-1088, September 2003 .

[6] J. Lim, Y. Li, Y. You and J. Chevallet, Scene Recognition with Camere Phones for Tourist Information Access, in ICME 2007, International Conference on Multimedia & Expo, Beijing China, 2007.

[7] L. Maisonnasse, E. Gaussier, J. Chevallet, Revisiting the Dependence Language Model for Information Retrieval, SIGIR 2007, 2007.

[8] L. Maisonnasse, E. Gaussier, J. Chevallet, Multiplying Concept Sources for Graph Modeling, Lecture Notes in Computer Science, to appear, 2008.

[9] P. Mulhem and E.Debanne, A framework for Mixed Symbolic-based and Feature-based Query by Example Image Retrieval, International Journal for Information Technology, vol12, no1, pp74-98, 2006.

[10] I. Ounis, Marius Pasca: RELIEF: Combining Expressiveness and Rapidity into a Single System. SIGIR 1998: 266-274.

[11] Th. Papadopoulos, V. Mezaris, I. Kompatsiaris, and M. G. Strintzis, Combining Global and Local Information for Knowledge-Assisted Image Analysis and Classification, EURASIP J. on Advances in Sign. Proc. , 2007.

[12] John C. Platt, Mary Czerwinski, Brent A. Field, PhotoTOC: Automatic Clustering for Browsing Personal Photographs, Proc. Fourth IEEE Pacific Rim Conference on Multimedia, (2003).

[13] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 275–281, New York, NY, USA, 1998. ACM.

[14] Robertson, S. E. and Sparck Jones, K. 1988. Relevance weighting of search terms. In Document Retrieval Systems, P. Willett, Ed. Taylor Graham Series In Foundations Of Information Science, vol. 3. Taylor Graham Publishing, London, UK, 143-160.

[15] Smith, J. R. and Chang, S. 1996. VisualSEEk: a fully automated content-based image query system. MULTIMEDIA '96, 87-98.

[16] F. Song and W. B. Croft. A general language model for information retrieval. CIKM’99, pp. 316–321, New York, NY, USA, 1999. ACM Press.

[17] M. Srikanth and R. Srikanth. Biterm language models for document retrieval. In Research and Development in Information Retrieval, pp. 425–426. ACM, New York, NY, USA, 2002.

[18] Yuan, J., Li, J., and Zhang, B. 2007. Exploiting spatial context constraints for automatic image region annotation. MULTIMEDIA '07. ACM, New York, NY.