Keyword-Based File Sorting for Information Retrieval Balmain Beckford Senior Thesis Department of Computer Science Minnesota State University, Mankato December 27, 2010 Abstract Keyword-based file sorting is the aggregation of related files into clusters based on a similarity evaluation between files and the representatives within the clusters. Keywords are the discriminating features of a file. These discriminating features are based on the frequency of the keywords along with their weight value. Information retrieval (IR) is a field of computer science that deals with processing text found in files in order to more efficiently retrieve files that are the closest match to a users query. Files can be indexed based on the words they contain as well as words that match these words’ concept from a thesaurus. Keyword-based IR by itself displays certain deficiencies in retrieving files. This thesis shows how incorporating the proper indexing and clustering techniques will improve the quality and performance of IR based on keywords. 1
28
Embed
Keyword-Based File Sorting for Information Retrievalkrypton.mnsu.edu/~an5239ke/public/students/joshs11/thesis-balmai… · indexing and clustering techniques will improve the quality
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Keyword-Based File Sorting for Information Retrieval
Balmain Beckford
Senior Thesis
Department of Computer Science
Minnesota State University, Mankato
December 27, 2010
Abstract
Keyword-based file sorting is the aggregation of related files into clusters based
on a similarity evaluation between files and the representatives within the clusters.
Keywords are the discriminating features of a file. These discriminating features are
based on the frequency of the keywords along with their weight value. Information
retrieval (IR) is a field of computer science that deals with processing text found in
files in order to more efficiently retrieve files that are the closest match to a users
query. Files can be indexed based on the words they contain as well as words that
match these words’ concept from a thesaurus. Keyword-based IR by itself displays
certain deficiencies in retrieving files. This thesis shows how incorporating the proper
indexing and clustering techniques will improve the quality and performance of IR
based on keywords.
1
1 Introduction
Keyword-based information retrieval is a technology that has been around for quite some time
but is still very useful in various search applications today. For effectiveness and simplicity,
most keyword-based information retrieval systems rely on extracting keywords from tags
associated with the file it is trying to retrieve. These tags can be generated manually by a
set of users or collaboratively through social tagging systems [27, 2]. Popular applications
today that utilize collaborative social tagging are Youtube which allow users to upload and
create tags for videos, Flickr which allow users to upload and create tags for photos [27] and
Amazon which allow users to upload and create tags for almost any item the user wants to
sell. Other applications that utilize collaborative tagging are blogs and wikis [27, 35]. This
method of tagging has been proven more intuitive than other tagging alternatives, because it
is easier for users to select a previously used tag the matches their item [27]. An advantage
of collaborative tagging is its ability to adapt to new vocabulary and word trends [27]. A
challenge that may arise from collaborative tagging is having a wide variety of tags or tag
redundancy and ambiguity due to unsupervised tagging [27]. Ambiguity may affect the
information retrieval process by causing files to be retrieved that were mistakenly placed in
the wrong category.
Keyword-based information retrieval can also be used in other areas to retrieve files
or documents. It can be used in software development to determine the traceability of a
particular piece of code [9, 32, 29]. Traceability is the ability to describe and follow the
requirement of a particular software design in both a forward and backward direction [9].
This is from the inception stage where the design of the software is conceived and throughout
the entire lifecycle of the design process. Information is retrieved from keywords associated
with objects in the software [9]. An object is a real world entity that is modeled by software.
In this field the similarity between files can be calculated by using a similarity index.
2
In addition, information retrieval based on keywords has been used in a number of other
specialized fields. Content-based information retrieval has been used in the medical field to
retrieve medical images based on the tags and description related to that image [10, 20, 24].
There has also been success in using this technology in journalism. It has been used in
the design of systems that retrieve multimedia news content from a database [26, 1, 16]. A
task that is usually done prior or during the retrieval process in most keyword based file
retrieval systems is sorting these files into categories based on similarities or relatedness.
This is to improve the accuracy of the retrieval system by helping it to return results that
are most relevant to a user’s query. This can be accomplished by using various clustering
methods like k-means clustering, hierarchical clustering and clustering by committee (CBC)
[31, 28, 27, 37, 33]. There is also novel clustering algorithm called domain similarity clustering
by committee (DSCBC) which is an extension of CBC [31, 33]. It is an approach that handles
the ambiguities that arise when dealing with adjectives and nouns that are polysemous.
Polysemous words are words that have different meanings or word senses when used in
different contexts and hence belong in different categories given a particular context.
Google1 has implemented a number of techniques available to users to improve their
search results. One of their techniques is allowing users to enter commands within their
searches. An example of this is, placing a tilde before a word where you would like synonyms
for that word to also be included in your search. For example, the search query ∼fast food
will display results that have words that are related to “fast food” as opposed to fast, meaning
quick, and any type of food. Results may include “junk, burgers, or french fries.” An n-gram
system is used to compare the probability of which word is most likely to appear with the
search phrase “fast food”. This allows users’ queries to be easily processed and more effective
in returning the related results.
Google has also updated their search algorithm to use location based semantic retrieval
1http://www.google.com/
3
to give results to the user about a particular query which is location sensitive. For example,
if a user searches for West Indian food and does not specify the location they may get results
that are highly ranked but are not relevant to the user’s query. For example, West Indian
cuisines from New York or Jamaica may show up in the search results when the user lives
in Mankato, Minnesota. Here, semantic information is used to cluster the documents with
the location the query is done from in order to return more appropriate results.
Clustering is used to save time for users. Recently, Facebook has implemented keyword-
based clustering, where it clusters similar postings within the news feed that were posted by
different users at different times. This saves space in displaying similar postings in the news
feed and avoids redundancy. Also, website aggregator digg.com gathers news stories from
different websites based on keywords found within their content or description. Users can
retrieve news stories given a search term. However, digg.com searches mainly by keywords
and not with semantics. Therefore, the retrieval process could be improved by incorporating
semantics in the retrieval process.
In terms of web searching, the precision of the results can be limited by a number of
factors. Bahatia and Khalid explain a number of factors that may hinder the user from
getting the correct results [6, 8]. These are the large amount of information that is present
on the web, keyword based web search usually has low recall because of the large amount
of information present and low precision because of the difficulty of getting relatedness.
Additionally, it is hard for search engines to return information based on user’s personal
preference or context of the words used in a query [6, 8]. In other words, it is hard to tell
what is on the user’s mind.
Although keyword-based sorting and information retrieval has been used for a long time,
there are improvements that can be made to improve the precision of the information re-
trieved. There have being attempts to reduce the effects of ambiguity by giving users the
option of categorizing their tags [27, 4, 12]. An example of this is in Flickr were a user who
4
chooses to enter that word “apple” can choose from a number of categories like comput-
ers, fruits or city. The primary goal of this thesis is to apply a combination of information
retrieval and clustering techniques to improve the accuracy of a query with little user involve-
ment in pre-determining the categories or context of the query. This implies that approaches
to keyword-based file sorting with strategies to handle poloysemous words that may induce
ambiguity will improve the accuracy of results from a query.
This paper will talk about some of the areas that keyword-based file sorting and infor-
mation retrieval have been used in. It will describe some of the tools that are useful in the
process of information retrieval as well as the specific tools used to generate results in our
experiments. It closes with a discussion of the results and suggestions of possible future
work.
2 Related Work
Important to this thesis is the use of algorithms to extract keywords from tags associated
with files and use them to cluster these files based on relatedness. Latent semantic indexing
(LSI) is an approach that compares documents using a vector representation of that doc-
ument [36, 13, 11, 5]. Further improvements were introduced by others to make LSI more
efficient by giving it a stronger statistical foundation [17]. This gave rise to a novel approach
to indexing called Probabilistic Latent Semantic indexing (PLSI). PLSI generates higher
performance gains in indexing documents because it has the capability of handling the am-
biguity that arises while indexing files that have keywords that are synonyms or polysemes.
Keyword-based information retrieval has been used to trace requirements throughout soft-
ware development [9]. Cleland-Haung et al. [9] also show how a threshold is used to determine
whether a document should be retrieved. The threshold can be found by adding weights to
keywords, and making a keyword having a weight above the threshold, eligible for retrieval.
5
IR is evaluated using precision and recall defined as:
precision =number of relevant documents retrieved
total number of documents retrieved
and
recall =number of relevant documents retrieved
number of relevant documents.
Precision refers to the number of relevant documents that are retrieved based on a user’s
query. Recall is the volume of documents that are retrieved based on the user’s query.
A rule of thumb used when extracting keywords is words that show up less frequently
in documents carry more weight in terms of information used as a distinguishing feature of
that file to determine whether that file matches the query of a user. Research shows that if
users interactively try to disambiguate their queries by placing them into categories based
on topics, it will increase categorization and the retrieval of information [28]. Hierarchical
clustering has also been used to get categories based on keywords in [28].
Clustering by domain specific similarity has been proven to handle ambiguity [31, 30].
More specifically, it handles ambiguity that arises from words, especially adjectives, that
have multiple word senses. For example, the word “Hot” which is an adjective can be used
to describe temperature and also could be used to describe something that is trendy. An
extension of CBC called domain similarity clustering by committee (DSCBC) was introduced
and is used to handle polysemous words, specifically adjectives. Table 1 shows results of
DSCBC and CBC [31]. CBC has more words in committees and also has words like “face”
in a category that is not really appropriate.
Mobile devices also incorporate keyword-based searching [22]. Unlike traditional naviga-
tional search that was used in the mobile devices to exactly match the words users input to
produce results, semantic search tries to understand the user’s input by providing more of
what the user might be looking for. This improves the quality and quantity of the search by
6
Table 1: Examples of committees created from CBC and DSCBC
Figure 4: Example of inverted file structure. Field 1 is the word. Field 2 is the documentthe word is in and field 3 is the number of times the word is seen in the document.
the document to be converted into an inverted file format for efficient storage as shown
in Figure 4. The first position in the inverted file represents a term that can be found
within that document. The second position is the name of the document and the third
position represents frequency of the term within that document. The Porter stemmer4 is a
common tool used to automatically find stem words [34]. Another stemmer is the Krovetz
stemmer5 which uses a dictionary to find the stem words. The Krovertz stemmer usually
produces a larger output file [21]. It also avoids some of the errors that the Porter stemmer
might produce by outputting meaningless stem words. The Lemur toolkit also has a built
in stemmer application with the option of choosing between the Porter and the Krovetz
stemmer.
4.3 Indexing
Indexing is the process of collecting and storing data for efficient retrieval. Indexing allows
for the retrieval of relevant documents based on a search query. If documents were not
indexed, all the documents in a corpus would have to be searched in order to return the