Keyword-Based File Sorting for Information Retrievalkrypton.mnsu.edu/~an5239ke/public/students/joshs11/thesis-balmai… · indexing and clustering techniques will improve the quality

Keyword-Based File Sorting for Information Retrieval

Balmain Beckford

Senior Thesis

Department of Computer Science

Minnesota State University, Mankato

December 27, 2010

Abstract

Keyword-based file sorting is the aggregation of related files into clusters based

on a similarity evaluation between files and the representatives within the clusters.

Keywords are the discriminating features of a file. These discriminating features are

based on the frequency of the keywords along with their weight value. Information

retrieval (IR) is a field of computer science that deals with processing text found in

files in order to more efficiently retrieve files that are the closest match to a users

query. Files can be indexed based on the words they contain as well as words that

match these words’ concept from a thesaurus. Keyword-based IR by itself displays

certain deficiencies in retrieving files. This thesis shows how incorporating the proper

indexing and clustering techniques will improve the quality and performance of IR

based on keywords.

1

1 Introduction

Keyword-based information retrieval is a technology that has been around for quite some time

but is still very useful in various search applications today. For effectiveness and simplicity,

most keyword-based information retrieval systems rely on extracting keywords from tags

associated with the file it is trying to retrieve. These tags can be generated manually by a

set of users or collaboratively through social tagging systems [27, 2]. Popular applications

today that utilize collaborative social tagging are Youtube which allow users to upload and

create tags for videos, Flickr which allow users to upload and create tags for photos [27] and

Amazon which allow users to upload and create tags for almost any item the user wants to

sell. Other applications that utilize collaborative tagging are blogs and wikis [27, 35]. This

method of tagging has been proven more intuitive than other tagging alternatives, because it

is easier for users to select a previously used tag the matches their item [27]. An advantage

of collaborative tagging is its ability to adapt to new vocabulary and word trends [27]. A

challenge that may arise from collaborative tagging is having a wide variety of tags or tag

redundancy and ambiguity due to unsupervised tagging [27]. Ambiguity may affect the

information retrieval process by causing files to be retrieved that were mistakenly placed in

the wrong category.

Keyword-based information retrieval can also be used in other areas to retrieve files

or documents. It can be used in software development to determine the traceability of a

particular piece of code [9, 32, 29]. Traceability is the ability to describe and follow the

requirement of a particular software design in both a forward and backward direction [9].

This is from the inception stage where the design of the software is conceived and throughout

the entire lifecycle of the design process. Information is retrieved from keywords associated

with objects in the software [9]. An object is a real world entity that is modeled by software.

In this field the similarity between files can be calculated by using a similarity index.

2

In addition, information retrieval based on keywords has been used in a number of other

specialized fields. Content-based information retrieval has been used in the medical field to

retrieve medical images based on the tags and description related to that image [10, 20, 24].

There has also been success in using this technology in journalism. It has been used in

the design of systems that retrieve multimedia news content from a database [26, 1, 16]. A

task that is usually done prior or during the retrieval process in most keyword based file

retrieval systems is sorting these files into categories based on similarities or relatedness.

This is to improve the accuracy of the retrieval system by helping it to return results that

are most relevant to a user’s query. This can be accomplished by using various clustering

methods like k-means clustering, hierarchical clustering and clustering by committee (CBC)

[31, 28, 27, 37, 33]. There is also novel clustering algorithm called domain similarity clustering

by committee (DSCBC) which is an extension of CBC [31, 33]. It is an approach that handles

the ambiguities that arise when dealing with adjectives and nouns that are polysemous.

Polysemous words are words that have different meanings or word senses when used in

different contexts and hence belong in different categories given a particular context.

Google1 has implemented a number of techniques available to users to improve their

search results. One of their techniques is allowing users to enter commands within their

searches. An example of this is, placing a tilde before a word where you would like synonyms

for that word to also be included in your search. For example, the search query ∼fast food

will display results that have words that are related to “fast food” as opposed to fast, meaning

quick, and any type of food. Results may include “junk, burgers, or french fries.” An n-gram

system is used to compare the probability of which word is most likely to appear with the

search phrase “fast food”. This allows users’ queries to be easily processed and more effective

in returning the related results.

Google has also updated their search algorithm to use location based semantic retrieval

1http://www.google.com/

3

to give results to the user about a particular query which is location sensitive. For example,

if a user searches for West Indian food and does not specify the location they may get results

that are highly ranked but are not relevant to the user’s query. For example, West Indian

cuisines from New York or Jamaica may show up in the search results when the user lives

in Mankato, Minnesota. Here, semantic information is used to cluster the documents with

the location the query is done from in order to return more appropriate results.

Clustering is used to save time for users. Recently, Facebook has implemented keyword-

based clustering, where it clusters similar postings within the news feed that were posted by

different users at different times. This saves space in displaying similar postings in the news

feed and avoids redundancy. Also, website aggregator digg.com gathers news stories from

different websites based on keywords found within their content or description. Users can

retrieve news stories given a search term. However, digg.com searches mainly by keywords

and not with semantics. Therefore, the retrieval process could be improved by incorporating

semantics in the retrieval process.

In terms of web searching, the precision of the results can be limited by a number of

factors. Bahatia and Khalid explain a number of factors that may hinder the user from

getting the correct results [6, 8]. These are the large amount of information that is present

on the web, keyword based web search usually has low recall because of the large amount

of information present and low precision because of the difficulty of getting relatedness.

Additionally, it is hard for search engines to return information based on user’s personal

preference or context of the words used in a query [6, 8]. In other words, it is hard to tell

what is on the user’s mind.

Although keyword-based sorting and information retrieval has been used for a long time,

there are improvements that can be made to improve the precision of the information re-

trieved. There have being attempts to reduce the effects of ambiguity by giving users the

option of categorizing their tags [27, 4, 12]. An example of this is in Flickr were a user who

4

chooses to enter that word “apple” can choose from a number of categories like comput-

ers, fruits or city. The primary goal of this thesis is to apply a combination of information

retrieval and clustering techniques to improve the accuracy of a query with little user involve-

ment in pre-determining the categories or context of the query. This implies that approaches

to keyword-based file sorting with strategies to handle poloysemous words that may induce

ambiguity will improve the accuracy of results from a query.

This paper will talk about some of the areas that keyword-based file sorting and infor-

mation retrieval have been used in. It will describe some of the tools that are useful in the

process of information retrieval as well as the specific tools used to generate results in our

experiments. It closes with a discussion of the results and suggestions of possible future

work.

2 Related Work

Important to this thesis is the use of algorithms to extract keywords from tags associated

with files and use them to cluster these files based on relatedness. Latent semantic indexing

(LSI) is an approach that compares documents using a vector representation of that doc-

ument [36, 13, 11, 5]. Further improvements were introduced by others to make LSI more

efficient by giving it a stronger statistical foundation [17]. This gave rise to a novel approach

to indexing called Probabilistic Latent Semantic indexing (PLSI). PLSI generates higher

performance gains in indexing documents because it has the capability of handling the am-

biguity that arises while indexing files that have keywords that are synonyms or polysemes.

Keyword-based information retrieval has been used to trace requirements throughout soft-

ware development [9]. Cleland-Haung et al. [9] also show how a threshold is used to determine

whether a document should be retrieved. The threshold can be found by adding weights to

keywords, and making a keyword having a weight above the threshold, eligible for retrieval.

5

IR is evaluated using precision and recall defined as:

precision =number of relevant documents retrieved

total number of documents retrieved

and

recall =number of relevant documents retrieved

number of relevant documents.

Precision refers to the number of relevant documents that are retrieved based on a user’s

query. Recall is the volume of documents that are retrieved based on the user’s query.

A rule of thumb used when extracting keywords is words that show up less frequently

in documents carry more weight in terms of information used as a distinguishing feature of

that file to determine whether that file matches the query of a user. Research shows that if

users interactively try to disambiguate their queries by placing them into categories based

on topics, it will increase categorization and the retrieval of information [28]. Hierarchical

clustering has also been used to get categories based on keywords in [28].

Clustering by domain specific similarity has been proven to handle ambiguity [31, 30].

More specifically, it handles ambiguity that arises from words, especially adjectives, that

have multiple word senses. For example, the word “Hot” which is an adjective can be used

to describe temperature and also could be used to describe something that is trendy. An

extension of CBC called domain similarity clustering by committee (DSCBC) was introduced

and is used to handle polysemous words, specifically adjectives. Table 1 shows results of

DSCBC and CBC [31]. CBC has more words in committees and also has words like “face”

in a category that is not really appropriate.

Mobile devices also incorporate keyword-based searching [22]. Unlike traditional naviga-

tional search that was used in the mobile devices to exactly match the words users input to

produce results, semantic search tries to understand the user’s input by providing more of

what the user might be looking for. This improves the quality and quantity of the search by

6

Table 1: Examples of committees created from CBC and DSCBC

Algorithm CommitteesCBC taste, smell, scent, aroma

beauty, appearance, look, light, color, thingamount, number, time, information, systemtone, attitude, voice wordday, sound, face, light, word, thing

DSCBC smell, aromaappearance, lookquality, amount, numberattitude, countenance, natureday, shade, room, light, face, color

returning more relevant results [22].

Keyword-based searching and semantics searching methods are done to retrieve infor-

mation [22, 15]. Keyword-based searching attempts to capture the essence of a document

by the words or phrases that are present in it. This can be done either by manually ana-

lyzing the document or by automatically subject indexing them [22]. In previous research,

Apache Lucene Library which is a high performance text search engine library that supports

AND, OR and NOT fuzzy logic, proximity and wildcard searches was used retieve informa-

tion from a database [22]. The technology is an index searcher object. Semantic search for

mobile devices uses “five sense” multimedia ontology, which reflects real word information

connected semantically to locations [22]. If a word is searched, the words belonging to a

similar class has similar semantic relationship with each other. [22] implements this through

term mapping, query graph and SPARQL (SPARQL Protocol and RDF Query Language).

Case-based reasoning (CBR) systems incorporate information retrieval techniques to dis-

play the results of a user’s query. These systems use mainly two techniques to produce query

results. They use keyword/syntactical IR and semantic retrieval. In the keyword IR, infor-

mation is retrieved simply on the basis of spotting keywords while the semantic IR offers

7

a more comprehensive display of results in terms of relatedness to other documents. This

allows a CBR to more effectively return results that matches a user’s query and puts less

pressure on the user to enter specific keywords in his/her search.

In the area of medicine, indexing of medical images is an ongoing task. Latent semantic

indexing has been used to index images in a system in order to retrieve them efficiently

[7]. They use probability to handle case of having words that may belong to more than one

cluster.

Learning and semantics have been very important in the area of multimedia information

retrieval (MIR) according to [26]. Previous work has emphasized incorporating classification

into MIR in order to improve the results given to users. Journalism has employed IR tech-

niques to retrieve multimedia content from databases. This content is indexed and clustered

using semantics and various clustering techniques to improve the accuracy of search results.

The method that is usually used in indexing media content is referred to as topic labeling

or topic clustering [16]. Both these methods cluster documents based on terms found within

the tags or description of a file. Topic clustering will be used in this project to cluster files

based of their tags.

3 Steps in Information Retrieval

This section explains the important steps that are involved in the information retrieval

process, how keywords are extacted, how documents are clustered based on relatedness and

techniques used to cluster documents.

Figure 1 shows the flow of data through a simple information retrieval system [26]. A

corpus of data is collected and then stored in a database. When a user makes a query, their

query is compared against the data within the database. If the information in the database

is relevant to the user’s query then the results are returned back to the user.

8

Figure 1: Diagram of an Information Retrieval System

3.1 Extracting keywords

Starting with keyword-based document clustering algorithm described in [19], we modified

the algorithm to retrieve keywords from the tags associated with each file to be clustered.

The document vector is dependent upon the term frequency and document frequency. Since

we are extracting keywords from tags, it will be an easier process due to the fact that there

are fewer words and these terms can be considered as preprocessed to describe the files.

The document vectors represent a document, or in this case a file, and 〈term,weight〉 pairs

are the unique elements of the document vector. The weight value of a term is a ranking

value which can be used to determine whether the term is a keyword or a stopword in the

document. The weighting function w(t) can be calculated as follows:

w(t) =

0 if t is a stopword

1 if t is a keyword

a otherwise 0 ≤ a ≤ 1

When adding weights to terms there are two criteria that must be filled to represent a

document, these are:

1. Discriminative value that distinguishes or characterizes the document from others

9

2. Measure between a keyword and a stopword

There are two approaches that can be used to add weights to terms. These approaches are

frequency-based term weighting (FBW) and keyword-based term weighting (KBW). FBW

is a statistical measure of terms in an inter-document relationship [19]. This approach is

efficient in distinguishing and characterizing a document from others which makes it useful

in document clustering for information retrieval purposes. We cannot rely on the frequency

of a term to characterize a document by term. The only evaluation measure to characterize

a document in frequency-based weighting is frequency statistics [19].

KBW is an approach that is based on keyword importance factors in a document. It

analyzes the content of a document to retrieve key words from it. It calculates the weight

value for keywords by the keyword-weighting factors and the terms ordered by a key word

ranking score. The ranking score is found from the keyword analysis results of the document.

It can be used with FBW to efficiently weight keywords and eliminate the deficiencies of

FBW.

The keyword ranking method of a term depends on the document type and the location

and role of the word in the sentence or paragraph.Thematic words are the representative

terms for a document [19]. These words can be extracted from text by analyzing its contents.

In the case of this project, our keywords are found within XML tags but in more common

cases they could be found in bodies of documents of within tiles of documents. In the more

common cases of extracting keywords, keywords can be classified by different features [19].

They can the classified as word level which is the part-of-speech information. They can be

classified in sentence-level features, which is the type of phrase or sentence location and type.

Different weight is given to terms that occur in different parts of a sentence. For example,

a term in the subjective clause in the English language may carry more weight than a term

in an auxiliary clause.

10

3.2 Topic Clustering and Topic Labeling

Topic clustering is the grouping of items into topics or subjects. It can be done by using

commonly appearing words or phrases to sort items into categories. This is done by first

finding the phrase or words that appear in each item while ignoring stop words. After these

key phrases are found, they are used to put items into related clusters.

Topic clustering is used in search engines to display results from searches [3]. It is

particularly useful when the item searched for belongs to a number of different categories.

Then, topic clustering can be used to display the results that are related to each individual

category. Typical approaches for clustering are hierarchical clustering, which is distance-

based clustering that can be agglomerative or divisive, and k-means clutering, which is also

distance-based and creates clusters by selecting a central file and building the cluster around

it based on relatedness [27].

Topic labeling is very similar to topic clustering. The difference is that topic labeling

usually uses previously defined data to compare the items against. Topic labeling is fre-

quenty applied by using the k-nearest neighbor strategy. The data is spilt into training and

evaluation data. The training data has been manually assigned topics and is then used in

the evaluation phase to compare to the evaluation data, or incoming data, for similarities to

the set of topics in the training set [16, 1]. New documents or files are placed into the topic

cluster of their most frequent neighbors.

3.3 Clustering the file

After the keywords are extracted from the tags of files they are placed into clusters based

on a similar clustering algorithm we modified from [19]. Let C be the set of all clusters.

If n represents the number of clusters in the set C. Then set C will contain the clusters

C1, C2, C3, . . . , Cn

11

C = {C1, C2, C3, . . . , Cn} (1)

Each cluster Ci has to be initialized by a file f that is not assigned to the existing clusters

[19]. File f is considered to be a seed file of Ci. Every time a new cluster in created a sequence

of steps is taken to expand and reduce the cluster so that it is in a stable state from the start

state. In each evolution steps for cluster Ci, Cji is the jth state of the initialized cluster Ci.

Cji : the jth state of a cluster Ci (2)

The characteristic vector of a cluster is a set of 〈keyword, weight〉 pairs that represents

the cluster. If KD is a key word set of a file F and KCiis a keyword set of a cluster Ci,

then KjCi

is the jth state of the cluster Ci. Given the keywords sets for each file, cluster Ci

is created by the self-expanding algorithm.

3.4 Initializing the cluster

According to Kang, the first step of the clustering algorithm is the creation and initialization

of a new cluster [19]. A file F is randomly selected from the pile of documents that is not

assigned to a cluster yet. It is assigned to a new cluster C0i that is an initial state of cluster

Ci.

C0i = {F} (3)

Because this file F is the first file in the new cluster, it is called the initialization file or

the seed file. Keyword set KF of a file F is a set of keywords k1, k2, . . . , kn that are extracted

from file F . The initial state of keyword set K0Ci

is initialized by KF . Algorithm 1 shows

the steps for keyword-based clustering.

12

K0Ci

= KF (4)

KF = {k|k is a keyword that is extracted from F}

Algorithm 1 Keyword-based clustering algorithm

C0i = F

K0Ci

= KF

C1i = {Fx |documentFx where k ∈ KFx}∀ k such that k ∈ K0

Ci

j = 1do {Kj

Ci= ∪KFx where Fx ∈ Cj

i

Cj+1i = Cj

i

∀ Fx ∈ Cji begin

s = sim (Fx, KjCi

)if(s ≤ threshold)Cj+1

i = Cj+1i − Fx

end forj = j + 1

} while(isDeleteDocument())Ci = Cj

i

3.5 Adding files to a Cluster

After the cluster is initialized by C0i it can be expanded by adding more files that are related

to the seed file. This is done by adding more files to the cluster and keywords to the keyword

set. So the files that appear with each keyword of K0Ci

which is the keywords extracted from

the seed file of the cluster C1i which is the next state of the cluster Ci. This will expand the

cluster by the following:

C1i = {Fx|k ∈ Fx, k ∈ K0

Ci} (5)

The cluster is expanded by a number of iterations consisting of keyword expansion and

13

cluster expansion. Additional files are added to the cluster by a similarity evaluation between

the keyword set and the file. When a new file f is added to the cluster its keywords k is

also added to the set of keywords K in that cluster. When the first expansion of the cluster

is performed, keywords from the set of the seed file are used. When the second expansion

is performed the new set keywords are used from which now consists of the keywords from

the seed file and keywords from the added file. Therefore the ith expansion of the cluster is

performed by using the (i− 1)th state of the keyword set.

The total number of iteration is determined by the size of the dataset. If a cluster is

expanded from C0i to C1

i , the keyword set K0Ci

is also expanded to a new keyword set K1Ci

that appears in the total files of cluster C1i . The keyword set Kj

Ciof Cj

i is a union of the

total keyword sets of Cji .

The keyword set KjCi

of the cluster Cji is used to calculate the characteristic vector of each

cluster. The characteristic vector consists of the weight value calculated by term frequency

(TF) and inverted file frequency (IDF) of the keywords and this is used to calculate the

similarity measure between a file and the cluster.

3.6 Reducing and completing the cluster

This phase of the algorithm is to produce a complete cluster by removing files that do not

belong to the cluster. For the cluster Cji , files of a low similarity to the cluster are removed,

that are not related to a cluster Cji through the similarity computation with the cluster Cj

i .

This will filter the files that are related to the cluster and the cluster Cj+1i is generated as

a next step of the cluster Cji . Ultimately the cluster Ci is completed. The next cluster

Ci+1 is created through the same process. Clustering is terminated if all the documents are

clustered or no more clusters is created.

14

<video>

<title>Funny Sports Bloopers</title>

<category>Comedy</category>

<tags>Funny, Sports, Bloopers</tags>

<id>1796OXXdVzs</id>

</video>

Figure 2: Example of XML file used to create the dataset.

<video>

<tags>Funny, Sports, Bloopers</tags>

</video>

Figure 3: Example of the contents of the XML file

4 Tools for Information Retrieval

This section describes the dataset collected for this project, how words are processed for

information retrieval, indexing and the Lemur toolkit used to implement this project. The

significance of each tool to this project is also mentioned in this section.

4.1 Dataset

The dataset used for this project is information gathered by a YouTube video miner created

to remove videos from the site in XML format [23]. This information was placed in a file that

was parsed to identify and extract the tags associated with the videos using Perl’s built-in

XML parser. An example of an extracted tag file is shown in Figure 2. A Perl script was

used to separate each XML video representation into individual files. The contents of the file

consisted of only the tags from a video in XML format as shown in Figure 3. The categories

used were comedy, music, politics, gardening, auto and sports. The dataset consisted of a

total of 300 XML video files with 50 in each category as shown in Table 2.

15

Table 2: XML video count for each category

Topic XML countComedy 50Music 50Politics 50Gardening 50Auto 50Sports 50

Table 3: Examples of related words

Topic Related WordsComedy fun, laugh, humor, hilariousSports sport, football, goal, win, losePolitics election, democrat, republican, politician

4.2 Processing Words

This section describes synonyms, related words, parts-of-speech tagging, stop word removal

and stemming used to process words for this project.

4.2.1 Synonyms and Related Words

Synonyms and related words were used as a test set to compare to the useful keywords that

were taken from tags as a way to initiate the clustering process. These words were placed

into different categories based on the controlled categories for this experiment. This is to

provide a domain specific concept thesaurus that will be used to determine the concepts that

certain keywords relate to. Table 3 shows examples of related words within different topics.

16

Table 4: Sample of stopwords used in the stopword removal process

Type WordsPronouns I, itDeterminers a, an, that, the thisPrepositions about, by, for, from

of, on, toVerbs are, is

4.2.2 Part-of-Speech Tagging

Part-of-speech tagging was used to extract nouns and adjectives from tags in order to select

useful words for categorizing tags. Nouns and adjectives are used because they are more

useful as distinguishing features of a document. For example a noun or an adjective will carry

more information than a determiner or a preposition in a sentence. The Brill tagger2 and the

Stanford tagger3 are popular part-of-speech tagging tools that are useful for processing the

tags associated with each video.The Stanford taggger was used for this experiment because

of its simplicity in generating part-of-speech tags.

4.2.3 Stop Words

Stop words are words that occur frequently in a document but have no significant value as

distinguishing features of a document. They are usually prepositions, determiners, pronouns

and simple verbs. The removal of these words from the document or tagset saves processing

time during document indexing. Table 4 shows some common stop words.

4.2.4 Stemming

Stemming is the process of finding the stem word by reducing a word to its base form.

This saves processing time by reducing the number of words to be processed and allows

2http://www.ling.gu.se/ lager/mogul/brill-tagger/index.html3http://nlp.stanford.edu/software/tagger.shtml

17

<auto><doc1><2>

<ball><doc2><1>

<car><doc1><2>

.

.

.

Figure 4: Example of inverted file structure. Field 1 is the word. Field 2 is the documentthe word is in and field 3 is the number of times the word is seen in the document.

the document to be converted into an inverted file format for efficient storage as shown

in Figure 4. The first position in the inverted file represents a term that can be found

within that document. The second position is the name of the document and the third

position represents frequency of the term within that document. The Porter stemmer4 is a

common tool used to automatically find stem words [34]. Another stemmer is the Krovetz

stemmer5 which uses a dictionary to find the stem words. The Krovertz stemmer usually

produces a larger output file [21]. It also avoids some of the errors that the Porter stemmer

might produce by outputting meaningless stem words. The Lemur toolkit also has a built

in stemmer application with the option of choosing between the Porter and the Krovetz

stemmer.

4.3 Indexing

Indexing is the process of collecting and storing data for efficient retrieval. Indexing allows

for the retrieval of relevant documents based on a search query. If documents were not

indexed, all the documents in a corpus would have to be searched in order to return the

desired result to the user.

4http://tartarus.org/ martin/PorterStemmer/5http://www.comp.lancs.ac.uk/computing/research/stemming/general/krovetz.htm

18

4.3.1 Latent Semantic Indexing

Latent semantic indexing (LSI) uses vector semantics to represent the relationship between

files. LSI has been shown to have better performance than term matching approaches [18, 11,

5]. It outperforms other approaches mainly when a higher recall is needed, text descriptions

are short, or when text is noisy, i.e. when unwanted words are present [18].

Adding weights to elements in an index matrix is typically done by term frequency-inverse

document frequency. This describes elements in a matrix as proportional to the number of

times a term appears in a document, meaning, terms that occur less often are usually more

important and as a result carry more weight.

After the documents are represented in vector form we are able to do the following:

• Check the relatedness of documents j and q in the concepts space by comparing vectors

d̂j and d̂q. This can be done by cosine similarity and give us the clustering of documents.

• Compare two different terms i and p by comparing the vectors t̂i and t̂p. This will give

you a cluster of the terms in the concept space.

• View user’s query as a mini-document and be able to compare it to other documents

in the concept space. For this to be possible we must first translate the user’s query

into the concept space, which involves transforming it to a vector representation.

Even though LSI is more efficient than other similar approaches, there are still some

drawbacks that may occur when using LSI. The first drawback is its difficulty to debug and

analyze because its concept space cannot be easily understood by humans. The second prob-

lem is the performance cost of doing singular value decomposition (SVD). The performance

is O(N2k3), where N is the number of documents and K represents the number of dimensions

in the concept space. Because N will continuously grow, it makes it almost impractical for

a large dynamic dataset [18]. Another drawback in terms of performance arises when new

documents are added because SVD has to be performed.

19

4.3.2 Probabilistic Latent Semantic Indexing

Probabilistic latent semantic indexing (PLSI) is a technique that is based on statistics which

is used to analyze co-occurrence data. It is an extension of LSI that adds a more solid

probabilistic model to indexing documents. If we compare PLSI to LSI, LSI uses singular

value decomposition [36, 14] but PLSA is based on a mixture of decomposition derived from

a latent class model, which gives it its solid statistical background [17, 14].

PLSA models the probability of each co-occurrence of words and documents (w, d) as

follows. The co-occurrence can be represented as a mixture of conditionally independent

multi-nominal distributions where

P (w, d) =∑c

P (c)P (d|c)P (w|c) = P (d)∑c

P (c|d)P (w|c). (6)

∑c P (c)P (d|c)P (w|c), is a symmetric formulation where the conditional probabilities P (d|c)

and P (w|c)) are used to generate w and d from the latent class c. P (d)∑

c P (c|d)P (w|c), is an

asymmetric formulation that shows for each document d a latent class is chosen conditionally

to the document by P (c|d), then a word is generated from that latent class by P (w|c).

The aspect model of PLSA suffers from over-lifting. This is because the parameters grow

linearly with the number of documents. PLSA is a generative model of the documents in a

collection it is used on, but is not a generative model of new documents.

4.4 Lemur Toolkit

The Lemur toolkit6 is a natural language processing toolkit that was created for the purpose

of doing information retrieval experiments [25]. It has a number of built-in applications that

are useful for creating an index of data and testing algorithms. This toolkit was used to

test indexing techniques like PLSI along with clustering techniques like k-means clustering.

6http://www.lemurproject.org/

20

The Lemur toolkit also has the capacity to do pre-processing tasks like word stemming

and filtering stop words. It has a built-in stop word removal application that uses its own

standard stop word list.

5 Methodology

After the XML video files were generated from the YouTube miner, they were copied to an

XML file by categories. A Perl script was then used to separate each block of video into

individual XML flies. This was the collection of videos to be indexed. The Lemur toolkit

was then used to run the function BuildIndex to generate an index from the collection of

files.

After the index was generated, a number of single word queries were run on the index

using the Lemur function IndriRunQuery. The result from this was used as the baseline

for this experiment. The main focus of the rest of the project was to improve upon the

baseline performance to see if applying clustering techniques to indexed data will improve

the information retrieval part of the or this project.

k-means clustering was applied to the index and a number of single word queries were

run again on the same index. PLSI was then applied to the index and another set of single

word queries were made to compare the results. The results were tabulated based on the

precision and recall of each query.

This process was then repeated with queries having multiple words.

6 Results

The results from the experiment were evaluated using precision and recall as defined in Sec-

tion 2 to compare the different performance measures of keyword-based information retrieval

21

Table 5: Results for queries consisting of one word

ResultsPrecision Recall

Regular index 0.75 0.66k-means clustering 0.85 0.82PLSI 0.88 0.88

before and after k-means clustering and PLSI were applied. A total of 12 queries were made,

6 using single word queries and 6 using multiple word queries. The results were tabulated

for a queries consisting of only one word and queries consisting of multiple words. Results

are shown in Tables 4 and 5.

When a query consisting of one word was done on a regular index, precision was 0.75 and

recall was 0.66. However, when k-means clustering was applied to the index, precision was

0.85 and recall was 0.82. Also, when PLSI was applied, precision increased to 0.88 and recall

increased to 0.88 which were 14.7% and 25% increases in precision and recall respectively

over a query done on the regular index. The best results of precision and recall were obtained

when PLSI was applied.

When a query consisting of two words was done on a regular index, precision fell to 0.67

and recall rose to 0.82. When k-means clustering was applied, precision was increased to

0.82 and recall slightly decreased to 0.80. When PLSI was applied, precision increased to

0.86 and recall increased to 0.88 which was 22% and 6.8% increase in precision and recall

respectively over a query done on the regular index and a 0.4% and 0.9% increase in precision

and recall respectively over a query done when k-means clustering was applied. For two-word

queries, the best results of precision and recall were obtained again when PLSI was applied,

supporting the results from the one word case.

22

Table 6: Results for queries consisting of multiple words

ResultsPrecision Recall

Regular index 0.67 0.82k-means clustering 0.82 0.80PLSI 0.86 0.88

7 Conclusion and Future Work

As television moves over to the internet with GoogleTV and sites like YouTube and Hulu,

the typical viewer’s way of viewing may change. Channel surfing or searching by categories

or titles of programs on your local cable network is soon to be over. Television sets may

have access to a much broader search space in the Internet. Here, viewers will be able to do

more specific searches using genre, actors, context or even specific scenes. The system will

then retrieve a result based on a certain context or word sense that will satisfy the user.

Improved information retrieval can also benefit Internet radios where the users can get

more accurate results based on their listening preferences. Instead of using keyword-based

searches to determine the next song to play or previous patterns of other users, they would

use a more probabilistic approach to determining what the user wants.

In this project, the hypothesis was proven and the results showed that when clustering

and indexing techniques were applied to a collection of data, there will be improvements in

information retrieval. Even though the size of the improvements was not large there were

still improvements. This lack of a big difference may be due to the small size of the dataset.

This project could be improved upon by using larger datasets or tags with larger sets of

keywords. Also, more tests could be run with queries having more than two words. Next,

I would incorporate these findings into designing a system that could index and cluster

multi-media files without predefined tags for more efficient information retrieval.

23

References

[1] Najaf Ali Shah. Topic-based clustering of news articles. Proceedings of ACM SE ’04

ACM Southeast Regional Conference 2004, pages 412–413, 2004.

[2] Emilia Apostolova, Sean Neilan, Gary An, Noriko Tomuro, and Steven Lytinen. Djan-

gology: A light-weight web-based tool for distributed collaborative text annotation.

Proceedings of the Seventh Conference on International Language Resources and Eval-

uation (LREC’10), 2010.

[3] Grigory Begelman, Philipp Keller, and Frank Smadja. Automated tag cluster-

ing:Improving search and exploration in the tag space. Proceedings of the Collaborative

Web Tagging Workshop, 2006.

[4] Ron Bekkerman and James Allan. Using bigrams in text categorization. Center of

Intelligent Information Retrieval, UMass Amherst, IR-408:1–10, 2004.

[5] Jerome Bellegarda. Exploiting latent semantic information in statistical language mod-

eling. Proceedings of the IEEE, 88:1279–1296, 2000.

[6] MPS Bhatia and Akshi Kumar Khalid. Information retrieval and machine learning:

Supporting technologies for web mining research and practice. Webology, 5, 2008.

[7] Mike Brown, Christiane Fortsch, and Dieter WiBman. Combining information retrieval

and case-based reasoning for middle ground text retrieval problems. AAAI Technical

Report WS-98-12, pages 3–7, 1998.

[8] H. Chen and Susan Dumais. Bringing order to the web: Automatically categorizing

search results. Proceedings of the SIGCHI Conference on Human Factors in Computing

Systems, pages 145–152, 2000.

24

[9] Jane Cleland-Huang, Raffaella Settimi, and Oussama Benkhadra. Global-centric trace-

abilty for managing non-functional requirements. Proceedings of the 27th international

Conference on Software Engineering, pages 362–371, 2004.

[10] Grace Dasovich, Robert Kim, Daniela S. Raicu, and Jacob Furst. A model for the

relationship between semantic and content based similarity using LIDC. Proceedings of

Medical Imaging 2010: Computer-Aided Diagnosis Conference, 2010.

[11] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and

Richard Harshman. Indexing by latent semantic analysis. Journal of the American

Society for Information Science, 41(6):391–407, 1990.

[12] Carlotta Domeniconi and Muna Al-Razgan. Weighted cluster ensembles: Methods and

analysis. ACM Transactions on Knowledge Discovery from Data, 2:3–40, 2009.

[13] Susan T. Dumais. Latent semantic indexing. Proceedings of the Text Retrieval Confer-

ence, 1995.

[14] Ayman Farahat and Francie Che. Improving probabilistic latent semantic analysis with

principal component analysis. Eleventh Conference of the European Chapter of the

Association for Computational Linguistics (EACL -2006), pages 105–112, 2006.

[15] William B. Frakes and Ricardo A. Baeza-Yates, editors. Information Retrieval: Data

Structures & Algorithms. Prentice-Hall, 1992.

[16] Alexander Hauptmann. Topic labeling of multiligual broadcast news in the informe-

dia digital video library. Proceedings of the Ninth ACM International Conference on

Multimedia, pages 1–6, 1999.

25

[17] Thomas Hofmann. Probalistic latent semantic indexing. Proceedings of the 22nd Annual

International ACM SIGIR Conference on Research and Development in Information

Retrieval, pages 50–57, 1999.

[18] Jason Hong. An overview of latent semantic indexing. Website, 2000. http://www.cs.

berkeley.edu/~jasonh/classes/sims240/sims-240-final-paper-lsi_files/sims.

[19] Seung-Shik Kang. Keyword-based document clustering. Proceedings of the Sixth Inter-

national Workshop on Information Retrieval with Asian Languages, 11:132–137, 2003.

[20] Robert Kim, Grace Dasovich, Runa Bhaumik, Richard Brock, Jacob D. Furst, and

Daniela S. Raicu. An investigation into the relationship between semantic and content

based similarity using LIDC. Proceedings of the International Conference on Multimedia

Information Retrieval, pages 185–192, 2010.

[21] Bob Krovetz. Viewing morphology as an inference process. Proceedings of 16th ACM

SIGIR Conference, 1993.

[22] Tae-Hoon Lee, Jung-Hyun Kim, Hyeong-Joon Kwon, and Kwang-Seok Hong. Keyword-

based semantic retrieval system using location information in a mobile environment.

Proceedings of the 2009 International Symposium on Web Information Systems and

Applications, 2009.

[23] Brian McMahan. Personal communication, 0ctober 2010. http://ytminer.braingineer.

net.

[24] Prakash Nadkarni. An introduction to information retrieval: Applications in genomics.

The Pharmacogenomics Journal, 2:96–102, 2001.

[25] Paul Ogilvie and Jamie Callan. Experiments using the Lemur Toolkit. Proceedings of

the TREC, 2001.

26

[26] Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. Content-based mul-

timedia information retrieval: State of the art and challenges. ACM Transactions on

Multimedia Computing, Communications and Applications, 2:1–19, 2006.

[27] Andriy Shepitsen, Jonathan Gemmell, Bamshad Mobasher, and Robin Burke. Personal-

ized recommendation in social tagging systems using hierarchical clustering. Proceedings

of the 2008 ACM Conference on Recommender Systems, 2008.

[28] Ahu Sieg, Bamshad Mobasher, Steve Lytinen, and Robin Burke. Using concept hierar-

chies to enhance user queries in web-based. Proceedings of the International Conference

on Artificial Intelligence and Applications, 2004.

[29] Vijayan Sugumaran and Veda C. Storey. A semantic-based approach to component

retrieval. The DATA BASE for Advances in Information Systems-Summer 2003, 34:8–

24, 2003.

[30] Noriko Tomuro and Steve Lytinen. Polysemy in lexical semnatics–Automatic discovery

of polysemous senses and their regularities. NYU Symposium on Semantic Knowledge

Discovery, Organization and Use, 2008.

[31] Noriko Tomuro, Steven Lytinen, Kyoko Kazaki, and Hitoshi Isahara. Clustering using

feature domain similarity to discover word sense for adjectives. International Conference

on Semantic Computing, pages 370–377, 2007.

[32] Paolo Tonella, Christian Girardi, and Emanuele Pianta. An empirical study on keyword-

based web site clustering. Proceedings 12th IEEE International Workshop on Program

Comprehension (IWPC’04), pages 204–213, 2004.

[33] Hwee Tou Ng and Hian Beng Lee. Intergrating multiple knowledge source to disam-

biguate word sense: An exemplar-based approach. Proceedings of the 34th annual meet-

ing on Association for Computational Linguistics, pages 40–47, 1996.

27

[34] Cornelis van Rijsbergen, S.E. Robertson, and M.F. Porter. New models in probabilistic

information retrieval. British Library Research and Development Report, (5587), 1980.

[35] Pu Wang, Carlotta Domeniconi, and Jian Hu. Cross-domain text classification using

wikipedia. IEEE Intelligent Informatics Bulletin, 9:5–17, 2008.

[36] Peter Wiemer-Hastings. Latent semantic analysis. Proceedings of the 16th International

Joint Conference on Artificial Intelligence, pages 1–14, 2004.

[37] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix

factorization. Proceedings of the 26th Annual International ACM SIGIR Conference on

Research and Development in Informaion Retrieval, pages 267–273, 2003.

28

Keyword-Based File Sorting for Information Retrievalkrypton.mnsu.edu/~an5239ke/public/students/joshs11/thesis-balmai… · indexing and clustering techniques will improve the quality

Documents