Improving Web Search Result Using Cluster Analysis

Improving Web Search Result Using

Cluster Analysis

A thesis Submitted by

Biswapratap SinghSahoo

in partial fulfillment for the award of the degree of

MASTER OF SCIENCE

IN

COMPUTER SCIENCE

Supervisor

Dr. R.C Balabantaray

UTKAL UNIVERSITY: ODISHA

JUNE 2010

Copyright © by

Biswapratap SinghSahoo

June 2010

Contents Declaration iii

Abstract iv

Dedication vi

Acknowledgment vii

List of Figures viii

Chapter 1 Introduction 1

1.1 Motivation 2

1.1.1 From organic to mineral memory 2

1.1.2 The problem of abundance 4

1.1.3 Information retrieval and Web search 5

1.1.4 Web search and Web crawling 7

1.1.5 Why the Web is so popular now? 8

1.1.6 Search Engine System Architecture 9

1.1.7 Overview of Information Retrieval 11

1.1.8 Evaluation in IR 12

1.1.9 Methods for IR 14

Chapter 2 Related works 16

2.1 Search Engine Tools 18

2.1.1 Web Crawlers 18

2.1.2 How the Web Crawler Works 18

2.1.3 Overview of data clustering 19

2.2 An example information retrieval problem 20

Chapter 3 Implementation Details 26 3.1 Determining the user terms 26

3.1.1 Tokenization 26

3.1.2 Processing Boolean queries 27

3.1.3 Schematic Representation of Our Approach 29

3.1.4 Methodology of Our Proposed Model 29

3.2 Our Proposed Model Tool 29

3.2.1 Cluster Processor 29

3.2.2. DB/Cluster 30

3.3 Working Methodology 30

Chapter 4 Future Work and Conclusions 32 4.1 Future Work 32

4.2 Conclusion 32

Appendices 34

References and Bibliography 40

Annexure 44

A Modern Approach to Search Engine Using Cluster Analysis,

Biswapratap SinghSahoo. National Seminar on Computer

Security: Issues and Challenges on 13th & 14th February 2010 held

at PJ College of Management & Technology, Bhubaneswar,

sponsored by All India Council for Technical Education, New Delhi.

Page No - 27

DECLARATION I, Sri Biswapratap SinghSahoo, do hereby declare that this thesis

entitled “Improving Web Search Result Using Cluster Analysis”

submitted to Utkal University, Bhubaneswar for the award of the degree

of Master of Science in Computer Science is an original piece of work

done by me and has not been submitted for award of any degree or

diploma in any other Universities. Any help or source of information,

which has been availed in this connection, is duly acknowledged.

Date: Biswapratap SinghSahoo Place: Researcher

Abstract The key factors for the success of the World Wide Web are its large

size and the lack of a centralized control over its contents. Both issues

are also the most important source of problems for locating information.

The Web is a context in which traditional Information Retrieval methods

are challenged, and given the volume of the Web and its speed of change,

the coverage of modern search engines is relatively small. Moreover, the

distribution of quality is much skewed, and interesting pages are scarce

in comparison with the rest of the content.

Search engines have changed the way people access and discover

knowledge, allowing information almost any subject to be quickly and

easily retrieved within seconds. As increasingly more materials become

available electronically the influence of search engines on our lives will

continue to grow. To engineer a search engine is a challenging task.

Search engines index ten to hundred millions of web pages involving a

comparable number of distinct terms. They answer tens of millions of

queries every day. Despite the importance of large-scale search engines

on the web, very little academic research has been done on them.

Furthermore due to rapid advance in overview of current web search

engine designed and proposed a model with cluster analysis. We

introduce new Meta search engine, which dynamically groups the search

results into clusters labeled by phrases extracted from the snippets.

There has been less research in cluster analysis using user

terminology rather than document keywords. Until log files of web sites

were made available it was difficult to accumulate enough exact user

searches to make a cluster analysis feasible. Another limitation in using

searcher terms is that most users of the Internet employ short (one to

two words) queries (Jansen et al., 1998). Wu, et al. (2001) used queries

as a basis for clustering documents selected by searchers in response to

similar queries. This paper reports on an experimental search engine

based on a cluster analysis of user text for quality information.

To my parents

and to all my teachers

both formal and informal

Serendipity is too important to be left to chance.

Acknowledgements What you are is a consequence of whom you interact with, but just

saying “thanks everyone for everything” would be wasting this

opportunity. I have been very lucky of interacting with really great

people, even if some times I am prepared to understand just a small

fraction of what they have to teach me.

I am sincerely grateful for the support given by my advisor Dr. R. C Balabantaray during this thesis. The comments received from my

advisor during the review process were also very helpful and detailed. It

is my pleasure and good opportunity to express my profound sense of

gratitude and indebtedness towards him for inspiring guidance,

unceasing encouragement, over and above all critical insight that has

done into eventual fruition of this work. His blessings and inspiration

helped me to stand in this new and most challenging field of Information

Retrieval.

This thesis is just a step on a very long road. I want to thank the

professors I met during this study; especially I take this opportunity to

extend my thanks to Prof. S. Prasad and Er. S.K Naik for their

continuous encouragement throughout the entire course of the work.

I am also thankful to each and every staff members of Spintronic

Technology & Advance Research, Bhubaneswar for their time to time

cooperation and help.

I would say at the end that I owe everything to my parents, but

that would imply that they also owe everything to their parents and so

on, creating an infinite recursion that is outside the context of this work.

Therefore, I thank Dr. Balabantaray for being with me even from

before the beginning, and sometimes giving everything they have and

more, and I need no calculation to say that he has given me the best

guidance – thank you.

List of Figures Figure 1.1: Cyclic architecture for search engines, showing how different

components can use the information generated by the other

components.

Figure 1.2 Architecture of a simple Web Search Engine

Figure 2.1 A term-document incidence matrix. Matrix element (t, d) is 1

if the play in column d contains the word in row t, and is 0

otherwise.

Figure 2.2 Results from Shakespeare for the query Brutus AND Caesar

AND NOT Calpurnia.

Figure 2.3 The two parts of inverted index.

Chapter 1 Introduction

The World Wide Web (WWW) has seen a tremendous increase in

size in the last two decades as well as the number of new users

inexperienced in the art of web search [1]. The amount of information

and resources available on WWW today has grown exponentially and

almost any kind of information is present if the user looks long enough.

In order to find relevant pages, a user has to browse through many WWW

sites that may contain the information. Users may either browse the

pages through entry points such as the popular portals, Google, Yahoo,

MSN and AOL, etc. to look for specific information. Beginning the search

from one of the entry points is not always the best approach, since there

is no particular organized structure for the WWW, and not all pages are

reachable from others. In the case of using a search engine, a user

submits a query, typically a list of keywords, and the search engine

returns a list of the web pages that may be relevant according to the

keywords. In order to achieve this, the search engine has to search its

already existing index of all web pages for the relevant ones. Such search

engines rely on massive collections of web pages that are acquired with

the help of web crawlers, which traverse the web by following hyperlinks

and storing downloaded pages in a large database that is later indexed

for efficient execution of user queries. Many researchers have looked at

web search technology over the last few years but very little academic

research has been done on them. Search engines are constantly engaged

in the task of crawling through the WWW for the purpose of indexing.

When a user submits keywords for search, the search engine selects and

ranks the documents from its index. The task of ranking the documents,

according to some predefined criteria, falls under the responsibilities of

the ranking algorithms. A good search engine should present relevant

documents higher in the ranking, with less relevant documents following

them. A crawler for a large search engine has to address two issues.

First, it has to have a good crawling strategy, i.e., a strategy for deciding

which pages to download next. Second, it needs to have a highly

optimized system architecture that can download a large number of

pages per second from WWW.

1.1 Motivation 1.1.1 From organic to mineral memory

As we mentioned before, finding relevant information from the

mixed results is a time consuming task. In this context we introduce a

simple high-precision information retrieval system by clustering and re-

ranking retrieval results with the intention of eliminate these

shortcomings. The proposed architecture has some key features:

• Simple and high performance. Our experimental results (Section

4) shows that it’s almost 79 percent better than the best known standard

Persian retrieval systems [1, 2, 18].

• Independent of initial system architecture. It can embed in any

fabric information retrieval system. It cause proposed architecture very

good envisage for the web search engines.

• High-Precision. Relevant documents exhibit at top of the result

list.

We have three types of memory. The first one is organic, which is

the memory made of flesh and blood and the one administrated by our

brain. The second is mineral, and in this sense mankind has known two

kinds of mineral memory: millennia ago, this was the memory

represented by clay tablets and obelisks, pretty well known in this

country, on which people carved their texts. However, this second type is

also the electronic memory of today’s computers, based upon silicon. We

have also known another kind of memory, the vegetal one, the one

represented by the first papyruses, again well known in this country, and

then on books, made of paper.

TheWorldWideWeb, a vast mineral memory, has become in a few

years the largest cultural endeavor of all times, equivalent in importance

to the first Library of Alexandria. How was the ancient library created?

This is one version of the story:

“By decree of Ptolemy III of Egypt, all visitors to the city were

required to surrender all books and scrolls in their possession;

these writings were then swiftly copied by official scribes. The

originals were put into the Library, and the copies were delivered to

the previous owners. While encroaching on the rights of the

traveler or merchant, it also helped to create a reservoir of books in

the relatively new city.”

The main difference between the Library of Alexandria and the Web

is not that one was vegetal, made of scrolls and ink, and the other one is

mineral, made of cables and digital signals. The main difference is that

while in the Library books were copied by hand, most of the information

on the Web has been reviewed only once, by its author, at the time of

writing.

Also, modern mineral memory allows fast reproduction of the work,

with no human effort. The cost of disseminating content is lower due to

new technologies, and has been decreasing substantially from oral

tradition to writing, and then from printing and the press to electronic

communications. This has generated much more information than we

can handle.

1.1.2 The problem of abundance

The signal-to-noise ratio of the products of human culture is

remarkably high: mass media, including the press, radio and cable

networks, provide strong evidence of this phenomenon every day, as well

as more small-scale actions such as browsing a book store or having a

conversation. The average modern working day consists of dealing with

46 phone calls, 15 internal memos, 19 items of external post and 22 e-

mails.

We live in an era of information explosion, with information being

measured in exabytes (1018 bytes): “Print, film, magnetic, and optical

storage media produced about 5 exabytes of new information in 2002.

We estimate that new stored information grew about 30% a year between

1999 and 2002. Information flows through electronic channels –

telephone, radio, TV, and the Internet – contained almost 18 exabytes of

new information in 2002, three and a half times more than is recorded in

storage media. The World Wide Web contains about 170 terabytes of

information on its surface.” On the dawn of the World Wide Web, finding

information was done mainly by scanning through lists of links collected

and sorted by humans according to some criteria. Automated Web search

engines were not needed when Web pages were counted only by

thousands, and most directories of the Web included a prominent button

to “add a new Web page”. Web site administrators were encouraged to

submit their sites. Today, URLs of new pages are no longer a scarce

resource, as there are thousands of millions of Web pages. The main

problem search engines have to deal with is the size and rate of change

of the Web, with no search engine indexing more than one third of the

publicly available Web. As the number of pages grows, it will be

increasingly important to focus on the most “valuable” pages, as no

search engine will be able of indexing the complete Web. Moreover, in

this thesis we state that the number of Web pages is essentially infinite;

this makes this area even more relevant.

1.1.3 Information retrieval and Web search

Information Retrieval (IR) is the area of computer science

concerned with retrieving information about a subject from a collection of

data objects. This is not the same as Data Retrieval, which in the context

of documents consists mainly in determining which documents of a

collection contain the keywords of a user query. Information Retrieval

deals with satisfying a user need:

“... the IR system must somehow ’interpret’ the contents of the

information items (documents) in a collection and rank them

according to a degree of relevance to the user query. This

‘interpretation’ of document content involves extracting syntactic

and semantic information from the document text ...”

Although there was an important body of Information Retrieval

techniques published before the invention of the World Wide Web, there

are unique characteristics of the Web that made them unsuitable or

insufficient. A survey by Arasu et al. on searching the Web notes that:

“IR algorithms were developed for relatively small and coherent

collections such as newspaper articles or book catalogs in a

(physical) library. The Web, on the other hand, is massive, much

less coherent, changes more rapidly, and is spread over

geographically distributed computers ...”

This idea is also present in a survey about Web search by Brooks,

which states that a distinction could be made between the “closed Web”,

which comprises high-quality controlled collections on which a 3 search

engine can fully trust, and the “open Web”, which includes the vast

majority of web pages and on which traditional IR techniques concepts

and methods are challenged.

One of the main challenges the open Web poses to search engines

is “search engine spamming”, i.e.: malicious attempts to get an

undeserved high ranking in the results. This has created a whole branch

of Information Retrieval called “adversarial IR”, which is related to

retrieving information from collections in which a subset of the collection

has been manipulated to influence the algorithms. For instance, the

vector space model for documents], and the TF-IDF similarity measure

are useful for identifying which documents in a collection are relevant in

terms of a set of keywords provided by the user. However, this scheme

can be easily defeated in the “open Web” by just adding frequently-asked

query terms to Web pages.

A solution to this problem is to use the hypertext structure of the

Web, using links between pages as citations are used in academic

literature to find the most important papers in an area. Link analysis,

which is often not possible in traditional information repositories but is

quite natural on the Web, can be used to exploit links and extract useful

information from them, but this has to be done carefully, as in the case

of Pagerank:

“Unlike academic papers which are scrupulously reviewed, web

pages proliferate free of quality control or publishing costs. With a

simple program, huge numbers of pages can be created easily,

artificially inflating citation counts. Because the Web environment

contains profit seeking ventures, attention getting strategies evolve

in response to search engine algorithms. For this reason, any

evaluation strategy which counts replicable features of web pages

is prone to manipulation”.

The low cost of publishing in the “open Web” is a key part of its

success, but implies that searching information on the Web will always

be inherently more difficult that searching information in traditional,

closed repositories.

1.1.4 Web search and Web crawling

The typical design of search engines is a “cascade”, in which a Web

crawler creates a collection which is indexed and searched. Most of the

designs of search engines consider the Web crawler as just a first stage

in Web search, with little feedback from the ranking algorithms to the

crawling process. This is a cascade model, in which operations are

executed in strict order: first crawling, then indexing, and then

searching. Our approach is to provide the crawler with access to all the

information about the collection to guide the crawling process effectively.

This can be taken one step further, as there are tools available for

dealing with all the possible interactions between the modules of a

search engine, as shown in Figure 1.1

Figure 1.1: Cyclic architecture for search engines, showing how different

components can use the information generated by the other components.

The typical cascade model is depicted with thick arrows. The indexing

module can help the Web crawler by providing information about the

ranking of pages, so the crawler can be more selective and try to collect

important pages first. The searching process, through log file analysis or

other techniques, is a source of optimizations for the index, and can also

help the crawler by determining the “active set” of pages which are

actually seen by users. Finally, the Web crawler could provide on-

demand crawling services for search engines. All of these interactions are

possible if we conceive the search engine as a whole from the very

beginning.

1.1.5 Why the Web is so popular now?

Commercial developers noticed the potential of the web as a

communications and marketing tool when graphical Web browsers broke

onto the Internet scene (Mosaic, the precursor to Netscape Navigator,

was the first popular web browser) making the Internet, and specifically

the Web, "user friendly." As more sites were developed, the more popular

the browser became as an interface for the Web, which spurred more

Web use, more Web development etc. Now graphical web browsers are

powerful, easy and fun to use and incorporate many "extra" features

such as news and mail readers. The nature of the Web itself invites user

interaction; web sites are composed of hypertext documents, which mean

they are linked to one another. The user can choose his/her own path by

selecting predefined "links". Since hypertext documents are not organized

in an arrangement which requires the user to access the pages

sequentially, users really like the ability to choose what they will see next

and the chance to interact with the site contents.

1.1.6 Search Engine System Architecture

This section provides an overview of how the whole system of a

search engine works. The major functions of the search engine crawling,

indexing and searching are also covered in detail in the later sections.

Before a search engine can tell you where a file or document is, it must

be found. To find information on the hundreds of millions of Web pages

that exist, a typical search engine employs special software robots, called

spiders, to build lists of the words found on Web sites. When a spider is

building its lists, the process is called Web crawling. A Web crawler is a

program, which automatically traverses the web by downloading

documents and following links from page to page. They are mainly used

by web search engines to gather data for indexing. Other possible

applications include page validation, structural analysis and

visualization; update notification, mirroring and personal web

assistants/agents etc. Web crawlers are also known as spiders, robots,

worms etc. Crawlers are automated programs that follow the links found

on the web pages. There is a URL Server that sends lists of URLs to be

fetched to the crawlers. The web pages that are fetched are then sent to

the store server. The store server then compresses and stores the web

pages into a repository. Every web page has an associated ID number

called a doc ID, which is assigned whenever a new URL is parsed out of a

web page. The indexer and the sorter perform the indexing function. The

indexer performs a number of functions. It reads the repository,

uncompressed the documents, and parses them. Each document is

converted into a set of word occurrences called hits. The hits record the

word, position in document, an approximation of font size, and

capitalization. The indexer distributes these hits into a set of "barrels",

creating a partially sorted forward index. The indexer performs another

important function. It parses out all the links in every web page and

stores important information about them in an anchors file. This file

contains enough information to determine where each link points from

and to, and the text of the link. The URL Resolver reads the anchors file

and converts relative URLs into bsolute URLs and in turn into doc IDs. It

puts the anchor text into the forward index, associated with the doc ID

that the anchor points to. It also generates a database of links, which are

pairs of doc IDs. The links database is used to compute Page Ranks for

all the documents. The sorter takes the barrels, which are sorted by doc

ID and resorts them by word ID to generate the inverted index. This is

done in place so that little temporary space is needed for this operation.

The sorter also produces a list of word IDs and offsets into the inverted

index. A program called Dump Lexicon takes this list together with the

lexicon produced by the indexer and generates a new lexicon to be used

by the searcher. A lexicon lists all the terms occurring in the index along

with some term-level statistics (e.g., total number of documents in which

a term occurs) that are used by the ranking algorithms The searcher is

run by a web server and uses the lexicon built by Dump Lexicon together

with the inverted index and the Page Ranks to answer queries. (Brin and

Page 1998)

Figure 1.2 Architecture of a simple Web Search Engine

Figure 1.2 illustrates the architecture of a simple WWW search engine. In

general, a search engine usually consists of three major modules:

a) Information gathering

b) Data extraction and indexing

c) Document ranking

Retrieval systems generally look at each document as a unique in

assigning a page rank. If the document is viewed as a combination of

other related documents in the query area, we can have better results.

The conjecture that relevant documents tend to cluster was made by

[26]. Irrelevant documents share many terms with relevant documents

but about two completely different topics, so these may demonstrate

some patterns. On the other hand an irrelevant cluster can be viewed as

the retrieval result for a different query that share many terms with the

original query. Xu et al. believe that document clustering can make

mistake and when this happens, it adds more noise to the query

expansion process. But as we discuss, document clustering is a good tool

for high-precision information retrieval systems. In this context we

proposed architecture (Fig. 3.1) to cluster search results and re-rank

them based on cluster analysis. Although our benchmark in the Persian

language but we believe that same results must be exhibit in other

benchmarks. 1.1.7 Overview of Information Retrieval

People have the ability to understand abstract meanings that are

conveyed by natural language. This is why reference librarians are

useful; they can talk to a library patron about her information needs and

then find the documents that are relevant. The challenge of information

retrieval is to mimic this interaction, replacing the librarian with an

automated system. This task is difficult because the machine

understanding of natural language is, in the general case, still an open

research problem. More formally, the field of Information Retrieval (IR) is

concerned with the retrieval of information content that is relevant to a

user’s information needs (Frakes 1992).

Information Retrieval is often regarded as synonymous with

document retrieval and text retrieval, though many IR systems also

retrieve pictures, audio or other types of non-textual information. The

word “document” is used here to include not just text documents, but

any clump of information. Document retrieval subsumes two related

activities: indexing and searching (Sparck Jones 1997). Indexing refers

to the way documents, i.e. information to be retrieved, and queries, i.e.

statements of a user’s information needs, are represented for retrieval

purposes. Searching refers to the process whereby queries are used to

produce a set of documents that are relevant to the query. Relevance

here means simply that the documents are about the same topic as the

query, as would be determined by a human judge. Relevance is an

inherently fuzzy concept, and documents can be more or less relevant to

a given query. This fuzziness puts IR in opposition to Data Retrieval,

which uses deductive and boolean logic to find documents that

completely match a query (van Rijsbergen 1979).

1.1.8 Evaluation in IR

Information retrieval algorithms are usually evaluated in terms of

relevance to a given query, which is an arduous task considering that

relevance judgements must be made by a human for each document

retrieved. The Text REtrieval Conference (TREC) provides is a forum for

pooling resources to evaluate text retrieval algorithms. Document corpora

are chosen from naturally occurring collections such as the

Congressional Record and the Wall Street Journal. Queries are created by

searching corpora for topics of interest, and then selecting queries that

have a decent number of documents relevant to that topic. Queries and

corpora are distributed to participants, who use their algorithms to

return ranked lists of documents related to the given queries. These

documents are then evaluated for relevance by the same person who

wrote the query (Voorhees 1999).

This evaluation method is based on two assumptions. First, it

assumes that relevance to a query is the right criterion on which to judge

a retrieval system. Other factors such as the quality of the document

returned, whether the document was already known, the effort required

to find a document, and whether the query actually represented the

user’s true information needs are not considered. This assumption is

controversial in the field. One alternative that has been proposed is to

determine the overall utility of documents retrieved during normal task

(Cooper 1973).

Users would be asked how many dollars (or other units of utility)

each contact with a document was worth. The answer could be positive,

zero, or negative depending on the experience.

Utility would therefore be defined as any subjective value a document

gives the user, regardless of why the document is valuable. The second

assumption inherent in the evaluation method used in TREC is that

queries tested are representative of queries that will be performed during

actual use. This is not necessarily a valid assumption, since queries that

are not well represented by documents in the corpus are explicitly

removed from consideration. These two assumptions can be summarized

as follows: if a retrieval system returns no documents that meet a user’s

information needs, it is not considered the fault of the system so long the

failure is due either to poor query construction or poor documents in the

corpus.

1.1.9 Methods for IR

There are many different methods for both indexing and retrieval,

and a full description is out of the scope of this thesis. However, a few

broad categories will be described to give a feel for the range of methods

that exist.

Vector-space model. The vector-space model represents queries

and documents as vectors, where indexing terms are regarded as the

coordinates of a multidimensional information space (Salton 1975).

Terms can be words from the document or query itself or picked from a

controlled list of topics. Relevance is represented by the distance of a

query vector to a document vector within this information space.

Probabilistic model. The probabilistic model views IR as the

attempt to rank documents in order of the probability that, given a

query, the document will be useful (van Rijsbergen 1979). These models

rely on relevance feedback: a list of documents that have already been

annotated by the user as relevant or non-relevant to the query. With this

information and the simplifying assumption that terms in a document

are independent, an assessment can be made about which terms make a

document more or less likely to be useful.

Natural language processing model. Most of the other

approaches described are tricks to retrieve relevant documents without

requiring the computer to understand the contents of a document in any

deep way. Natural Language Processing (NLP) does not shirk this job,

and attempts to parse naturally occurring language into representations

of abstract meanings. The conceptual models of queries and documents

can then be compared directly (Rau 1988).

Knowledge-based approaches. Sometimes knowledge about a

particular domain can be used to aid retrieval. For example, an expert

system might retrieve documents on diseases based on a list of

symptoms. Such a system would rely on knowledge from the medical

domain to make a diagnosis and retrieve the appropriate documents.

Other domains may have additional structure that can be leveraged. For

example, links between web pages have been used to identify authorities

on a particular topic (Chakrabarti 1999). Data Fusion. Data fusion is a meta-technique whereby several

algorithms, indexing methods and search methods are used to produce

different sets of relevant documents. The results are then combined in

some form of voting to produce an overall best set of documents (Lee 1995). The Savant system described in Chapter 2.7 is an example of a

data fusion IR system.

Chapter 2 Related works

Using some kind of documents clustering technique to help

retrieval results is not new, although we believe we are the first to

explicitly present and deal with the low-precision problem in terms of

clustering search results. Many research efforts such as [10] have been

made on how to solve the keyword barrier which exists because there is

no perfect correlation between matching words and intended meaning.

[9] presents TermRank, a variation of the PageRank algorithm based on a

relational graph representation of the content of web document

collections. Search result clustering has successfully served this purpose

in both commercial and scientific systems [30, 10, 23, 16, 25, 33]. The

proposed methods focus on separating search results into meaningful

groups and user can browse and view of retrieval results. One of the first

approaches to search results clustering called Suffix Tree Clustering

would group documents according to the common phrases [13]. STC has

two key features: the use of phrases and a simple cluster definition. This

is very important when attempting to describe the contents of a cluster.

[12] proposes a new approach for web search result clustering to improve

the performance of approaches that uses the previous STC algorithms.

Search Results Clustering has a few interesting characteristics and one

of them is the fact that it is based only on document snippets. Certainly

Document snippets returned by search engines are usually very short

and noisy. Another shortage with these systems is the cluster’s name.

Cluster’s name must accurately and concisely describe the contents of

the cluster, so that the user can quickly decide if the cluster is

interesting or not. This aspect of these systems is difficult and sometimes

neglected [7, 12]. In this context our tendency to provide very simple

high-precision system based on cluster hypothesis [16] without any user

feedback. Document clustering can be performed, in advance, on the

collection as a whole (static clustering) [7, 15], but post-retrieval

document clustering (dynamic clustering) has been shown produce

superior results [10, 8]. Tombros et al. [14] conducted a number of

experiments using five document collections and four hierarchic

clustering methods to show that if hierarchic clustering is applied to

search results (query-specific clustering), then it has the potential to

increase the retrieval effectiveness compared both to that of static

clustering and of conventional inverted file search. The actual

effectiveness of hierarchic clustering can be gauged by Cluster-based

retrieval strategies perform a ranking of clusters instead of individual

documents in response to each query [13]. The generation of precision-

recall graphs is thus not possible in such systems, and in order to derive

an evaluation function for clustering systems some effectiveness function

was proposed by [13]. In this paper, firstly we want to propose a simple

architecture which uses local cluster analysis to improve the

effectiveness of retrieval and yet utilize traditional precision-recall

evaluation. Secondly, this paper is devoted to high-precision retrieval.

Thirdly, we use a larger Persian standard test collection which is created

based on TREC specifications that validate findings in a wider context.

Query expansion is another approach to improve the effectiveness of

information retrieval. These techniques can be categorized as either

global or local. While global techniques rely on analysis of a whole

collection to discover word relationships, local techniques emphasize

analysis of the top-ranked documents retrieved for a query [28]. While

local techniques have shown to be more effective that global techniques

in general [29, 2]. In this paper we don’t want to expand a query based

on the information in the set of top-ranked documents retrieved for the

query, instead use very simple and more efficient re-ranking approach to

improve the effectiveness of search result and make high-precision

system that contain more relevant documents at top of the result list to

help user that find information needs efficiently.

2.1 Search Engine Tools 2.1.1 Web Crawlers

To find information from the hundreds of millions of Web pages

that exist, a typical search engine employs special software robots, called

spiders, to build lists of the words found on Web sites [6]. When a spider

is building its lists, the process is called Web crawling. A Web crawler is

a program, which automatically traverses the web by downloading

documents and following links from page to page [8]. They are mainly

used by web search engines to gather data for indexing. Web crawlers are

also known as spiders, robots, worms etc. Crawlers are automated

programs that follow the links found on the web pages [10].

There are a number of different scenarios in which crawlers

are used for data acquisition. A very few examples and how they differ in

the crawling strategies used are Breadth-First Crawler, Recrawling Pages

for Updates, Focused Crawling, Random Walking and Sampling,

Crawling the “Hidden Web”[11].

2.1.2 How the Web Crawler Works

Following is the process by which Web crawlers work: [3]

Download the Web page.

Parse through downloaded page and retrieve all the links.

For each link retrieved, repeat the process.

In the first step, a Web crawler takes a URL and downloads the

page from the Internet at the given URL. Oftentimes the downloaded page

is saved to a file on disk or put in a database. [3] In the second step, a Web crawler parses through the downloaded

page and retrieves the links to other pages. After the crawler has

retrieved the links from the page, each link is added to a list of links to

be crawled. [3] The third step of Web crawling repeats the process. All crawlers

work in a recursive or loop fashion, but there are two different ways to

handle it. Links can be crawled in a depth-first or breadth-first manner.

[3] Web pages and links between them can be modeled by a directed

graph called the web graph. Web pages are represented by vertices and

linked are represented by directed edges [7].

Using depth first search, an initial web page is selected, a link is

followed to second web page (if there exist such a link), a link on the

second web page is followed to a third web page, if there is such a link

and so on, until a page with no new link is found. Backtracking Is used

to examine links at the previous level to look for new links and so on.

(Because of practical limitations, web spiders have limits to the depth

they search in depth first search.)

Using a breadth first search, an initial web page is selected and a

link on this page is followed to second web page, then a second link on

the initial page is followed (if it exist), and so on, until all link of the

initial page have been followed. Then links on the pages one level down

are followed, page by page and so on.

2.1.3 Overview of data clustering

The data clustering, as a class of data mining techniques, is to

partition a given data set into separate clusters, with each cluster

composed of the data objects with similar characteristics. Most existing

clustering methods can be broadly classified into two categories:

partitioning methods and hierarchical methods. Partitioning algorithms,

such as k-means, kmedoid and EM, attempt to partition a data set into k

clusters such that a previously given evaluation function can be

optimized. The basic idea of hierarchical clustering methods is to first

construct a hierarchy by decomposing the given data set, and then use

agglomerative or divisive operations to form clusters. In general, an

agglomeration-based hierarchical method starts with a disjoint set of

clusters, placing each data object into an individual cluster, and then

merges pairs of clusters until the number of clusters is reduced to a

given number k. On the other hand, the division-based hierarchical

method treats the whole data set as one cluster at the beginning, and

divides it iteratively until the number of clusters is increased to k. See

[11] for more information. Although [17, 20, 23, 31, 33] have developed

some special algorithms for clustering search results but now we prefer

to use traditional methods in this paper. We will show that our method

with basic clustering algorithms such as k-means and Principal Direction

Divisive Partitioning achieves significant improvement over the methods

based on similarity search ranking alone.

2.2 An example information retrieval problem

A fat book which many people own is Shakespeare’s Collected

Works. Suppose youwanted to determinewhich plays of Shakespeare

contain thewords Brutus AND Caesar AND NOT Calpurnia. One way to

do that is to start at the beginning and to read through all the text,

noting for each play whether it contains Brutus and Caesar and

excluding it from consideration if it contains Calpurnia. The simplest

form of document retrieval is for a computer to do this sort of linear scan

through documents. This process is commonly referred to as grepping

through text, after the Unix GREP command grep, which performs this

process. Grepping through text can be a very effective process, especially

given the speed of modern computers, and often allows useful

possibilities forwildcard patternmatching through the use of regular

expressions. With modern computers, for simple querying of modest

collections (the size of Shakespeare’s Collected Works is a bit under one

million words of text in total), you really need nothing more. But for

many purposes, you do need more:

1. To process large document collections quickly. The amount of

online data has grown at least as quickly as the speed of computers, and

we would now like to be able to search collections that total in the order

of billions to trillions of words.

2. To allow more flexible matching operations. For example, it is

impractical to perform the query Romans NEAR countrymen with grep,

where NEAR might be defined as “within 5 words” or “within the same

sentence”.

3. To allow ranked retrieval: in many cases you want the best answer to

an information need among many documents that contain certain words.

The way to avoid linearly scanning the texts for each query is to index the

documents in advance. Let us stick with Shakespeare’s Collected Works,

and use it to introduce the basics of the Boolean retrieval model.

Suppose we record for each document – here a play of Shakespeare’s –

whether it contains eachword out of all the words Shakespeare used

(Shakespeare used about 32,000 differentwords). The result is a binary

term-document incidence matrix, as in Figure 2.1. Terms are the indexed

units (further discussed in Section 2.2); they are usually words, and for

the moment you can think of them as words, but the information

retrieval literature normally speaks of terms because some of them, such

as perhaps I-9 or Hong Kong are not usually thought of aswords. Now,

depending onwhetherwe look at thematrix rows or columns, we can have

a vector for each term, which shows the documents it appears in, or a

vector for each document, showing the terms that occur in it.2

Figure: 2.1 A term-document incidence matrix. Matrix element (t, d) is 1

if the play in column d contains the word in row t, and is 0 otherwise.

To answer the query Brutus AND Caesar AND NOT Calpurnia, we take

the vectors for Brutus, Caesar and Calpurnia, complement the last, and

then do a bitwise AND:

The answers for this query are thus Antony and Cleopatra and Hamlet

(Figure 2.2).

The Boolean retrieval model is a model for information retrieval in which

we can pose any query which is in the form of a Boolean expression of

terms, that is, in which terms are combined with the operators AND, OR,

and NOT. The model views each document as just a set of words.

Figure:2.2 Results from Shakespeare for the query Brutus AND Caesar

AND NOT Calpurnia.

Let us now consider a more realistic scenario, simultaneously

using the opportunity to introduce some terminology and notation.

Suppose we have documents. By documents we mean whatever units we

have decided to build a retrieval system over. We will refer to the group of

documents over which we perform retrieval as the (document) collection.

It is sometimes also referred to as a corpus (a body of texts). Suppose

each document is about 1000 words long (2-3 book pages). If we assume

an average of 6 bytes per word including spaces and punctuation, then

this is a document collection about 6 GB in size. Typically, there might

be about distinct terms in these documents. There is nothing special

about the numbers we have chosen, and they might vary by an order of

magnitude or more, but they give us some idea of the dimensions of the

kinds of problems we need to handle.

Our goal is to develop a system to address the ad hoc retrieval

task. This is the most standard IR task. In it, a system aims to provide

documents from within the collection that are relevant to an arbitrary

user information need, communicated to the system by means of a one-

off, user-initiated query. An information need is the topic about which the

user desires to know more, and is differentiated from a query, which is

what the user conveys to the computer in an attempt to communicate

the information need. A document is relevant if it is one that the user

perceives as containing information of value with respect to their

personal information need. Our example above was rather artificial in

that the information need was defined in terms of particular words,

whereas usually a user is interested in a topic like ``pipeline leaks'' and

would like to find relevant documents regardless of whether they

precisely use those words or express the concept with other words such

as pipeline rupture. To assess the effectiveness of an IR system (i.e., the

quality of its search results), a user will usually want to know two key

statistics about the system's returned results for a query:

Precision : What fraction of the returned results are relevant to the

information need?

Recall : What fraction of the relevant documents in the collection

were returned by the system?

A matrix has half-a-trillion 0's and 1's - too many to fit in a

computer's memory. But the crucial observation is that the matrix is

extremely sparse, that is, it has few non-zero entries. Because each

document is 1000 words long, the matrix has no more than one billion

1's, so a minimum of 99.8% of the cells are zero. A much better

representation is to record only the things that do occur, that is the 1

position.

This idea is central to the first major concept in information

retrieval, the inverted index. The name is actually redundant: an index

always maps back from terms to the parts of a document where they

occur. Nevertheless, inverted index, or sometimes inverted file, has

become the standard term in information retrieval. We keep a dictionary

of terms (sometimes also referred to as a vocabulary or lexicon; in this

book, we use dictionary for the data structure and vocabulary for the set

of terms). Then for each term, we have a list that records which

documents the term occurs in. Each item in the list - which records that

a term appeared in a document (and, later, often, the positions in the

document) - is conventionally called a posting. The list is then called a

postings list (or), and all the postings lists taken together are referred to

as the postings.

Figure 2.3

Chapter 3 Implementation Details 3.1 Determining the user terms 3.1.1 Tokenization

Given a character sequence and a defined document unit,

tokenization is the task of chopping it up into pieces, called tokens,

perhaps at the same time throwing away certain characters, such as

punctuation. Here is an example of tokenization:

These tokens are often loosely referred to as terms or words, but it is

sometimes important to make a type/token distinction. A token is an

instance of a sequence of characters in some particular document that

are grouped together as a useful semantic unit for processing. A type is

the class of all tokens containing the same character sequence. A term is

a (perhaps normalized) type that is included in the IR system’s

dictionary. The set of index terms could be entirely distinct from the

tokens, for instance, they could be semantic identifiers in a taxonomy,

but in practice in modern IR systems they are strongly related to the

tokens in the document. However, rather than being exactly the tokens

that appear in the document, they are usually derived from them by

various normalization processes. For example, if the document to be

indexed is to sleep perchance to dream, then there are 5 tokens, but only

4 types (since there are 2 instances of to). However, if to is omitted from

the index, then there will be only 3 terms: sleep, perchance, and dream.

Themajor question of the tokenization phase is what are the correct

tokens to use? In this example, it looks fairly trivial: you chop on

whitespace and throw away punctuation characters. This is a starting

point, but even for English there are a number of tricky cases. For

example, what do you do about the various uses of the apostrophe for

possession and contractions? Mr. O’Neill thinks that the boys’ stories

about Chile’s capital aren’t amusing.

3.1.2 Processing Boolean queries How do we process a query using an inverted index and the basic

Boolean retrieval model? Consider processing the simple conjunctive

query:

1.1 Brutus AND Calpurnia

The intersection operation is the crucial one: we need to efficiently

intersect postings lists so as to be able to quickly find documents that

contain both terms. (This operation is sometimes referred to as merging

postings lists: this slightly counterintuitive name reflects using the term

merge algorithm for a general family of algorithms that combine multiple

sorted lists by interleaved advancing of pointers through each; here we

are merging the lists with a logical AND operation.)

There is a simple and effective method of intersecting postings lists using

the merge algorithm: we maintain pointers into both lists

and walk through the two postings lists simultaneously, in time linear in

the total number of postings entries. At each step, we compare the docID

pointed to by both pointers. If they are the same, we put that docID in

the results list, and advance both pointers. Otherwise we advance the

pointer pointing to the smaller docID. If the lengths of the postings lists

are x and y, the intersection takes O(x + y) operations. Formally, the

complexity of querying is Q(N), where N is the number of documents in

the collection.6 Our indexing methods gain us just a constant, not a

difference in Q time complexity compared to a linear scan, but in practice

the constant is huge. To use this algorithm, it is crucial that postings be

sorted by a single global ordering. Using a numeric sort by docID is one

simple way to achieve this. We can extend the intersection operation to

processmore complicated queries like:

1.2 (Brutus OR Caesar) AND NOT Calpurnia

1.3 Brutus AND Caesar AND Calpurnia

1.4 (Calpurnia AND Brutus) AND Caesar

1.5 (madding OR crowd) AND (ignoble OR strife) AND (killed OR slain)

3.1.3 Schematic Representation of Our Approach

Figure 3.1: A Schematic Model of Our Approach

3.1.4 Methodology of Our Proposed Model We have followed the existing process to get the DB/Indexes. Then

we will group or cluster the existing index_database by analyzing the

popularity of the page, the position and size of the search terms within

the page, and the proximity of the search terms to one another on the

page, and each cluster is associated with a set of keywords, which is

assumed to represent a concept e.g. technology, science, arts, film,

medical, music, sex, photo and so on.

3.2 Our Proposed Model Tool 3.2.1 Cluster Processor

Cluster Processor improve its performance automatically by

learning relationships and associations within the stored data and make

the cluster, a statistical technique is used for identifying patterns and

associations in complex data. It is somehow difficult to accumulate

enough exact user searches to make a cluster. The clustering process is

fully depends on fuzzy methods.

3.2.2. DB/Cluster

This is the second major module of our approach. It stores the

patterns or cluster of complex data present on the web with its

corresponding URL. The content inside the DB/cluster is similar to the

DB/Indexes but the terms or string or keywords those are related to

pattern or concept were found together in the same cluster whereas the

index is sorted alphabetically by search term, with each index entry

storing a list of documents in which the term appears and the location

within the text where it occurs in the DB/Indexes. The data structure

used in DB/Cluster allows rapid access to documents that contain user

query terms.

3.3 Working Methodology

When the user will give any query string through the entry point of

the search engine [12], the query engine will filter those strings or

keywords by analyzing them. This will also do by learning process. Next,

the Query Engine can detect that the searched string is associated with

which clusters. Next, the Query Engine will retrieve the string from the

relevant cluster only; without searching the entire DB/Indexes as in the

previous architecture. In this way our methodology can give fast and

relevant results.

But one potential problem with this system is: it may happen

that one string can be present many clusters also. E.g. Ferrari is a string

which is laptop model from Acer and also it is car model. Here how the

query engine will know which Ferrari the user is looking for. So in this

study we will store the frequency of each string in a file in DB/Cluster.

So that the query engine can compare the matching number of clusters

for that searched string and return the higher occurrences of the relevant

cluster result.

Chapter 4 Future Work and Conclusions 4.2 Future Work

There are several directions in which this research can proceed. In

this paper, we proposed a model for retrieval systems that is based on a

simple document re-ranking method using Local Cluster Analysis.

Experimental results showS that it is more effective than existing

techniques. Whereas we intended to exhibits the efficiency of the

proposed architecture, we use single clustering method (PDDP) to

produce clusters that are tailored to the information need represented by

the query. Afterwards, utilize K-means with PDDP clusters as initial

configuration (Hybrid approach) and showed that PDDP has potential to

improve results individually.

Whereas in our approach, the context of a document is considered

in the retrieved results by the combination of information search and

local cluster analysis, cause first: relevant cluster tailored to the user

information need and improve the search results efficiently, second:

make high-precision system that contain more relevant documents at top

of the result list. As it was shown, even in worst query that average

precision 0.1982 percent decreased, still our system remain high-

precision.

4.1 Conclusion

We will pursue the work in several directions. Firstly, the current

method for clustering search results is PDDP and hybrid K-means,

however our experimental results had shown that PDDP has a great

efficiency for our purpose but thence the total size of input in search

results clustering is small, we can afford some more complex processing,

which can possibly let us achieve better results. Unlike previous

clustering techniques that use some proximity measure between

documents, tries to discover meaningful phrases that can become cluster

descriptions and only then assign documents to those phrases to form

clusters. Use these concept-driven clustering approaches maybe a useful

future work.

Secondly, I assumed that search results contain two clusters

(Relevant and Irrelevant). In some cases irrelevant cluster can split into

other sub-clusters by semantic relations. Get the optimal sub-clusters

semantically can be produce better results.

Thirdly, we re-ranked results based on both clusters and after that

choose better one manually. As we mentioned before, we conjecture that

relevant cluster centroid must be near than irrelevant cluster centroid to

the query. So we can choose which cluster centroid that toward to the

query (relevant cluster).

Lastly, we evaluate the proposed architecture in adhoc retrieval. As

we mentioned before, our approach is independent of initial system

architecture so it can embed on any fabric search engine. One of the high

precision needful systems are Web search engines. Indisputable evaluate

this approach on Web search engines can be a prominent future work.

Appendices import javax.swing.*;

import javax.swing.JScrollPane;

import java.awt.*;

import java.awt.event.*;

import java.util.*;

import java.io.*;

public class GDemo

{

public static void main(String args[])

{

SimpleFrame frame = new SimpleFrame();

frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);

frame.setVisible(true);

}

}

class SimpleFrame extends JFrame implements ActionListener

{

public static HashMap<String,ArrayList> result= new HashMap<String, ArrayList>();

public static String token;

public static String op;

public static String searchstring;

public static final int DEFAULT_WIDTH = 600;

public static final int DEFAULT_HEIGHT = 400;

final JTextArea textArea;

final JTextField textField;

final JButton button;

public SimpleFrame()

{

setTitle("Information Retrival System");

setSize(DEFAULT_WIDTH, DEFAULT_HEIGHT);

textField = new JTextField(30);

Font f = new Font("Old Book Style", Font.BOLD, 12);

textField.setFont(f);

textArea = new JTextArea(20, 50);

JScrollPane scrollPane = new JScrollPane(textArea);

add(scrollPane,BorderLayout.CENTER);

textArea.setWrapStyleWord(true);

Font f1 = new Font("Old Book Style", Font.BOLD, 12);

textArea.setFont(f1);

JPanel panel = new JPanel();

JLabel label = new JLabel("Input Text: ");

panel.setLayout(new FlowLayout(FlowLayout.CENTER));

button = new JButton("Click Here");

panel.add(label);

panel.add(textField);

panel.add(button);

panel.add(textArea);

button.addActionListener(this);

Container cp=getContentPane();

cp.add(panel,BorderLayout.CENTER);

}//SimpleFrame()

public void actionPerformed(ActionEvent event)

{

Object sr=event.getSource();

if(sr==button)

{

textArea.setText("");

searchstring=textField.getText();

String tokens[]=searchstring.split(" ");

if(tokens.length > 2)

{

op=tokens[1];

//ArrayList list=searchText(tokens[1]);

result.put(tokens[0], searchText(tokens[0]));

result.put(tokens[2], searchText(tokens[2]));

if(op.equals("AND"))

{

HashSet<String> hs1= new

HashSet<String>(result.get(tokens[0]));

HashSet <String> hs2= new


hs1.retainAll(hs2);

//System.out.println("And "+hs1);


//textArea.setText(hs1.toString());

for(String fileName: hs1)

textArea.append(fileName+"\n");

}

else if(op.equals("OR"))

{

HashSet<String> hs1= new


HashSet <String> hs2= new


hs1.addAll(hs2);

//System.out.println("OR" + hs1);


//textArea.setText(hs1.toString());

for(String fileName: hs1)

textArea.append(fileName+"\n");

}

}

else

{

ArrayList list=searchText(searchstring);


//textArea.setText(list.toString());

Iterator fileName=list.iterator();

while(fileName.hasNext())

{

//System.out.println(it.next());

textArea.append(fileName.next()+"\n");

}

}

//textArea.append(textField.getText()+"\n");

}

}//actionPerformed()

public ArrayList searchText(String args1)

{

//String args1=textField.getText();

//System.out.println("token="+args1);

String args[]=args1.split(" ");

for(int i=0;i<args.length;i++)

args[i]=args[i].toUpperCase();

ArrayList<String> filefound= new ArrayList<String>();

File f= new File("D:\\program\\Java\\Test");

File[] files=f.listFiles();

for(File s: files)

{

for(int i=0;i<args.length;i++)

{

try

{

if(search(s.getPath(),args[i]))

{

filefound.add(s.getPath());

}

}

catch(Exception e)

{

e.toString();

}

}

}

textArea.append(filefound+"\n");

return filefound;

}//searchText()

public boolean search(String file,String token) throws Exception

{

StringTokenizer st= null;

HashSet<String> set= new HashSet<String>();

BufferedReader br= new BufferedReader(new FileReader(file));

String line=null;

while((line=br.readLine())!=null)

{

st=new StringTokenizer(line," ,.");

while(st.hasMoreElements())

{

set.add((st.nextToken()).toUpperCase());

}

}

//System.out.println(set+"\n");

if(set.contains(token))

return true;

else

return false;

}//search()

}//class SimpleFrame

Results

REFERENCES AND BIBLIOGRAPHY 1. Brin, Sergey and Page Lawrence. The anatomy of a large-scale hyper textual Web

search engine. Computer Networks and ISDN Systems, April 1998

2. A Novel Page Ranking Algorithm for Search Engines Using Implicit Feedback by

Shahram Rahimi, Bidyut Gupta, Kaushik Adya, Southern Illinois University, USA,

Engineering Letters, 13:3, EL_13_3_20 (Advance online publication: 4 November

2006)

3. Crawling the Web with Java by James Holmes, Chapter 6, Page: 2 & 3

4. Breadth-First Search Crawling Yields High-Quality Pages by Marc Najork and Janet

L. Wiener, Compaq Systems Research Center, USA

5. HOW SEARCH ENGINES WORK AND A WEB CRAWLER APPLICATION by Monica

Peshave, Department of Computer Science, University of Illinois at Springfield,

Springfield

6. Search Engines for Intranets by K.T. Anuradha, National Centre for Science

Information (NCSI), Indian Institute of Science, Bangalore

7. Searching the Web by Arvind Arasu Junghoo Cho Hector Garcia-Molina Andreas

Paepcke Sriram Raghavan, Computer Science Department, Stanford University

8. Franklin, Curt. How Internet Search Engines Work, 2002. www.howstuffworks.com

9. Garcia-Molina, Hector. Searching the Web, August 2001

http://oak.cs.ucla.edu/~cho/papers/cho-toit01.pdf

10. Pant, Gautam, Padmini Srinivasan and Filippo Menczer: Crawling the Web, 2003.

http://dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf

11. Retriever: Improving Web Search Engine Results Using Clustering by Anupam

Joshi, University of Maryland, USA and Zhihua Jiang, American Management

Systems, Inc., USA

12. Effective Web Crawling, PhD thesis by Carlos Castillo, Dept. of Computer Science -

University of Chile, November 2004 13. Design and Implementation of a High-Performance Distributed Web Crawler,

Vladislav Shkapenyuk Torsten Suel, CIS Department, Polytechnic University,

Brooklyn, New York 11201 14. R. Burke, K. Hammond, V. Kulyukin, S. Lytinen, N. Tomuro, and S. Schoenberg.

Natural language processing in the faq finder system: Results and prospects, 1997.

15. T. Calishain and R. Dornfest. Google Hacks: 100 Industrial-Strength Tips & Tools.

O’Reilly, ISBN 0596004478, 2003.

16. David Carmel, Eitan Farchi, Yael Petruschka, and Aya Soffer. Automatic query

refinement using lexical affinities with maximal information gain. In Proceedings of

the 25th annual international ACM SIGIR conference on Research and development in

information retrieval, pages 283–290. ACM Press, 2002.

17. Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data.

Morgan- Kauffman, 2002.

18. Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a

new approach to topic-specific Web resource discovery. Computer Networks

(Amsterdam, Netherlands: 1999), 31(11–16):1623–1640, 1999.

19. Michael Chau, Hsinchun Chen, Jailun Qin, Yilu Zhou, Yi Qin, Wai-Ki Sung, and

Daniel Mc- Donald. Comparison of two approaches to building a vertical search tool:

A case study in the nanotechnology domain. In Proceedings Joint Conference on

Digital Libraries, Portland, OR., 2002.

20. M. Keen C.W. Cleverdon, J. Mills. Factors determining the performance of indexing

systems. Volume I - Design, Volume II - Test Results, ASLIB Cranfield Project,

Reprinted in Sparck Jones & Willett, Readings in Information Retrieval, 1966.

21. B. D. Davison, D. G. Deschenes, and D. B. Lewanda. Finding relevant website

queries. In Proceedings of the twelfth international World Wide Web conference, 2003.

22. Daniel Dreilinger and Adele E. Howe. Experiences with selecting search engines

using metasearch. ACM Transactions on Information Systems, 15(3):195–222, 1997.

23. Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation

methods for the web. In Proceedings of the tenth international conference on World

Wide Web, pages 613–622. ACM Press, 2001.

24. B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics,

7(1):1–26, 1979.

25. Tina Eliassi-Rad and Jude Shavlik. Intelligent Web agents that learn to retrieve and

extract information. Physica-Verlag GmbH, 2003.

26. Oren Etzioni. Moving up the information food chain: Deploying softbots on the world

wide web. In Proceedings of the Thirteenth National Conference on Artificial

Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference,

pages 1322–1326, Menlo Park, 4– 8 1996. AAAI Press / MIT Press.

27. Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar, John

A. Tomlin, and David P. Williamson. Searching the workplace web. In WWW ’03:

Proceedings of the twelfth international conference on World Wide Web, pages 366–

375. ACM Press, 2003.

28. Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient similarity search and

classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD

international conference on on Management of data, pages 301–312. ACM Press,

2003.

29. A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for

information filtering. In Proc. 24th European Colloquium on Information Retrieval

Research, Glasgow, pages 353–362, 2002.

30. Aidan Finn and Nicholas Kushmerick. Learning to classify documents according to

genre. In IJCAI-03 Workshop on Computational Approaches to Style Analysis and

Synthesis, 2003.

31. C. Lee Giles, Kurt Bollacker, and Steve Lawrence. CiteSeer: An automatic citation

indexing system. In Ian Witten, Rob Akscyn, and Frank M. Shipman III, editors,

Digital Libraries 98 – 126 The Third ACM Conference on Digital Libraries, pages 89–

98, Pittsburgh, PA, June 23–26 1998. ACM Press.

32. Eric Glover, Gary Flake, Steve Lawrence, William P. Birmingham, Andries Kruger, C.

Lee Giles, and David Pennock. Improving category specific web search by learning

query modifications. In Symposium on Applications and the Internet, SAINT, pages

23–31, San Diego, CA, January 8–12 2001. IEEE Computer Society, Los Alamitos,

CA.

33. Eric J. Glover, Steve Lawrence, William P. Birmingham, and C. Lee Giles.

Architecture of a metasearch engine that supports user information needs. In

Proceedings of the eighth international conference on Information and knowledge

management, pages 210–216. ACM Press, 1999.

34. Ayse Goker. Capturing information need by learning user context. In Sixteenth

International Joint Conference in Artificial Intelligence: Learning About Users

Workshop, pages 21–27, 1999.

35. Ayse Goker, Stuart Watt, Hans I. Myrhaug, Nik Whitehead, Murat Yakici, Ralf

Bierig, Sree Kanth Nuti, and Hannah Cumming. User context learning for intelligent

information retrieval. In EUSAI ’04: Proceedings of the 2nd European Union

symposium on Ambient intelligence, pages 19–24. ACM Press, 2004.

36. Google Web APIs. http://www.google.com/apis/.

37. Luis Gravano, Chen-Chuan K. Chang, Hector Garcia-Molina, and Andreas Paepcke.

Starts: Stanford proposal for internet meta-searching. In Proceedings of the 1997

ACM SIGMOD international conference on Management of data, pages 207–218. ACM

Press, 1997.

38. Robert H. Guttmann and Pattie Maes. Agent-mediated integrative negotiation for

retail electronic commerce. Lecture Notes in Computer Science, pages 70–90, 1999.

39. Monika Henzinger, Bay-Wei Chang, Brian Milch, and Sergey Brin. Query-free news

search. In Twelfth international World Wide Web Conference (WWW-2003), Budapest,

Hungary, May 20-24 2003.

40. Adele E. Howe and Daniel Dreilinger. SAVVYSEARCH: A metasearch engine that

learns which search engines to query. AI Magazine, 18(2):19–25, 1997.

41. Jianying Hu, Ramanujan Kashi, and Gordon T. Wilfong. Document classification

using layout analysis. In DEXA Workshop, pages 556–560, 1999.

42. David Hull. Using statistical testing in the evaluation of retrieval experiments. In

SIGIR ’93: Proceedings of the 16th annual international ACM SIGIR conference on

Research and development in information retrieval, pages 329–338. ACM Press,

1993.

43. Thorsten Joachims. Text categorization with suport vector machines: Learning with

many relevant features. In Proceedings of the 10th European Conference on Machine

Learning, pages 137–142. Springer-Verlag, 1998.

44. Thorsten Joachims. Text categorization with support vector machines: learning with

many relevant features. In Claire N´edellec and C´eline Rouveirol, editors,

Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137–

142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. 45. George H. John, Ron Kohavi, and Karl Pfleger. Irrelevant features and the subset

selection problem. In International Conference on Machine Learning, pages 121–129,

1994.

Improving Web Search Result Using Cluster Analysis

Documents

shakespeares

world wide

exact user

clustering

local cluster

web search

boolean retrieval

document incidence