A Seminar Report On Working of web search engine Submitted in partial fulfillment of the requirement For the award of the degree Of Bachelor of Engineering In Information Technology Submitted to : Guide: Sachin Sharma Dr. K.R. Chowdhary B.E. Final Year Professor, CSE Dept. Department of Computer Science and Engineering M.B.M. Engineering College, Faculty of Engineering, i Working of web search engine
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ASeminar Report
On
Working of web search engine
Submitted in partial fulfillment of the requirementFor the award of the degree
OfBachelor of Engineering
InInformation Technology
Submitted to: Guide:Sachin Sharma Dr. K.R. ChowdharyB.E. Final Year Professor, CSE Dept.
Department of Computer Science and EngineeringM.B.M. Engineering College, Faculty of Engineering,
Jai Narain Vyas UniversityJodhpur (Rajasthan) – 342001
Session 2008-09
iWorking of web search engine
CANDIDATE’S DECLARATION
I hereby declare that the work which is being presented in the Seminar entitled
“Working of Web Search engine” in the partial fulfillment of the requirement for the
award of degree of Bachelor of engineering in Information Technology, submitted in
the department of Computer science and engineering, M.B.M. Engineering college,
Jodhpur (Rajasthan), is an authentic record of my own work carried out during the
period from February 2009 to May 2009, under the supervision of Dr. K.R. Chowdhary,
Professor, Department of Computer science and engineering, M.B.M. Engineering
college, Jodhpur (Rajasthan).
The matter embodied in this project has not been submitted by me for the award of any
other degree. I also declare that the matter of seminar is not ‘reproduced as it is’ from
any source.
Date:
Place: Jodhpur (SACHIN SHARMA)
CERTIFICATE
This is certified that the above statement made by the candidate is correct to the best of
my knowledge.
Dr. K.R. Chowdhary
Professor
Department of Computer science and engineering
M.B.M. Engineering College,
Jodhpur (Rajasthan) – 342001
iiWorking of web search engine
Contents
1. Introduction…………………………………………………………….1
2. Types of search engine………………………………………………2
3. General system architecture of web search engine………………2
Exploring the content of web pages for automatic indexing is of fundamental importance
for efficient e-commerce and other applications of the Web. It enables users, including
customers and businesses, to locate the best sources for their use. Today’s search
engines use one of two approaches to indexing web pages. They either:
Analyze the frequency of the words (after filtering out common or meaningless
words) appearing in the entire or a part (typically, a title, an abstract or the first
300 words) of the text of the target web page,
They use sophisticated algorithms to take into account associations of words in
the indexed web page. In both cases only words appearing in the web page in
question are used in analysis. Often, to increase relevance of the selected terms
to the potential searches, the indexing is refined by human processing.
To identify so called “authority” or “expert” pages, some search engines use the
structure of the links between pages to identify pages that are often referenced by other
pages. The approach used in the Google Search Engine implementation, assign each
page a score that depends on frequency with which this page is visited by web surfers.
ivWorking of web search engine
1. Introduction
A search engine is an information retrieval system designed to help find information
stored on a computer system. Search engines help to minimize the time required to find
information and the amount of information which must be consulted, akin to other
techniques for managing information overload. The most public, visible form of a search
engine is a Web search engine which searches for information on the World Wide Web.
Engineering a web search engine is a challenging task. Search engines index
tens to hundreds of millions of web pages involving a comparable number of distinct
terms. They answer tens of millions of queries every day. Despite the importance of
large-scale search engines on the web, very little academic research has been
conducted on them. Furthermore, due to rapid advance in technology and web
proliferation, creating a web search engine today is very different from three years ago.
There are differences in the ways various search engines work, but they all perform
three basic tasks:
They search the Internet or select pieces of the Internet based on important
words.
They keep an index of the words they find, and where they find them.
They allow users to look for words or combinations of words found in that index.
The most important measure for a search engine is the search performance, quality of
the results and ability to crawl, and index the web efficiently. The primary goal is to
provide high quality search results over a rapidly growing World Wide Web. Some of the
efficient and recommended search engines are Google, Yahoo and Teoma, which
share some common features and are standardized to some extent.
vWorking of web search engine
Types of search engine
2. Types of search engine
Search engines provide an interface to a group of items that enables users to specify
criteria about an item of interest and have the engine find the matching items. The
criteria are referred to as a search query. In the case of text search engines, the search
query is typically expressed as a set of words that identify the desired concept that one
or more documents may contain. There are several styles of search query syntax that
vary in strictness. It can also switch names within the search engines from previous
sites. Whereas some text search engines require users to enter two or three words
separated by white space, other search engines may enable users to specify entire
documents, pictures, sounds, and various forms of natural language. Some search
engines apply improvements to search queries to increase the likelihood of providing a
quality set of items through a process known as query expansion.
3. General system architecture of web search engine
This section provides an overview of how the whole system of a search engine works.
The major functions of the search engine crawling, indexing and searching are also
covered in detail in the later sub-sections.
Before a search engine can tell you where a file or document is, it must be found.
To find information on the hundreds of millions of Web pages that exist, a typical search
engine employs special software robots, called spiders, to build lists of the words found
on Websites. When a spider is building its lists, the process is called Web crawling. A
Web crawler is a program, which automatically traverses the web by downloading
documents and following links from page to page. They are mainly used by web search
engines to gather data for indexing. Other possible applications include page validation,
structural analysis and visualization; update notification, mirroring and personal web
assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc.
Crawlers are automated programs that follow the links found on the web pages. There
is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages
General system architecture of search engine
that are fetched are then sent to the store server. The store server then compresses
and stores the web pages into a repository. Every web page has an associated ID
number called a doc ID, which is assigned whenever a new URL is parsed out of a web
page. The indexer and the sorter perform the indexing function.
The indexer performs a number of functions. It reads the repository,
uncompresses the documents, and parses them. Each document is converted into a set
of word occurrences called hits. The hits record the word, position in document, an
approximation of font size, and capitalization. The indexer distributes these hits into a
set of "barrels", creating a partially sorted forward index.
The indexer performs another important function. It parses out all the links in
every web page and stores important information about them in an anchors file. This file
contains enough information to determine where each link points from and to, and the
text of the link. The URL Resolver reads the anchors file and converts relative URLs into
absolute URLs and in turn into doc IDs. It puts the anchor text into the forward index,
associated with the doc ID that the anchor points to.
It also generates a database of links, which are pairs of doc IDs. The links
database is used to compute Page Ranks for all the documents. The sorter takes the
barrels, which are sorted by doc ID and resorts them by word ID to generate the
inverted index. This is done in place so that little temporary space is needed for this
operation. The sorter also produces a list of word IDs and offsets into the inverted index.
A program called Dump Lexicon takes this list together with the lexicon produced by the
indexer and generates a new lexicon to be used by the searcher.
A lexicon lists all the terms occurring in the index along with some term-level
statistics (e.g., total number of documents in which a term occurs) that are used by the
ranking algorithms The searcher is run by a web server and uses the lexicon built by
Dump Lexicon together with the inverted index and the Page Ranks to answer queries.
General system architecture of search engine
3.1. Web crawling
Web crawlers are an essential component to search engines; running a web crawler is a
challenging task. There are tricky performance and reliability issues and even more
importantly, there are social issues. Crawling is the most fragile application since it
involves interacting with hundreds of thousands of web servers and various name
servers, which are all beyond the control of the system. Web crawling speed is
governed not only by the speed of one’s own Internet connection, but also by the speed
General system architecture of search engine
of the sites that are to be crawled. Especially if one is a crawling site from multiple
servers, the total crawling time can be significantly reduced, if many downloads are
done in parallel. Despite the numerous applications for Web crawlers, at the core they
are all fundamentally the same. Following is the process by which Web crawlers work:
Download the Web page.
Parse through the downloaded page and retrieve all the links.
For each link retrieved, repeat the process.
The Web crawler can be used for crawling through a whole site on the Inter-/Intranet.
You specify a start-URL and the Crawler follows all links found in that HTML page. This
usually leads to more links, which will be followed again, and so on. A site can be seen
as a tree-structure, the root is the start-URL; all links in that root-HTML-page are direct
sons of the root. Subsequent links are then sons of the previous sons.
General system architecture of search engine
3.1.1. Types of crawling Crawlers are of two types basically.
3.1.1.1. Focused crawling
A general purpose Web crawler gathers as many pages as it can from a particular set of
URL’s. Where as a focused crawler is designed to gather documents only on a specific
topic, thus reducing the amount of network traffic and downloads. The goal of the
focused crawler is to selectively seek out pages that are relevant to a pre-defined set of
topics. The topics are specified not using keywords, but using exemplary documents.
Rather than collecting and indexing all accessible web documents to be able to
answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to
find the links that are likely to be most relevant for the crawl, and avoids irrelevant
regions of the web. This leads to significant savings in hardware and network resources,
and helps keep the crawl more up-to-date. The focused crawler has three main
components: a classifier, which makes relevance judgments on pages, crawled to
decide on link expansion, a distiller which determines a measure of centrality of crawled
pages to determine visit-priorities, and a crawler with dynamically reconfigurable priority
controls which is governed by the classifier and distiller.
The most crucial evaluation of focused crawling is to measure the harvest ratio,
which is the rate at which relevant pages are acquired and irrelevant pages are
effectively filtered off from the crawl. This harvest ratio must be high, otherwise the
focused crawler would spend a lot of time merely eliminating irrelevant pages, and it
may be better to use an ordinary crawler instead.
3.1.1.2. Distributed crawling
Indexing the web is a challenge due to its growing and dynamic nature. As the size of
the Web is growing it has become imperative to parallelize the crawling process in order
to finish downloading the pages in a reasonable amount of time. A single crawling
process even if multithreading is used will be insufficient for large – scale engines that
General system architecture of search engine
need to fetch large amounts of data rapidly. When a single centralized crawler is used
all the fetched data passes through a single physical link. Distributing the crawling
activity via multiple processes can help build a scalable, easily configurable system,
which is fault tolerant system. Splitting the load decreases hardware requirements and
at the same time increases the overall download speed and reliability. Each task is
performed in a fully distributed fashion, that is, no central coordinator exists.
3.1.2. Robot exclusion protocol
Web sites also often have restricted areas that crawlers should not crawl. To address
these concerns, many Web sites adopted the Robot protocol, which establishes
guidelines that crawlers should follow. Over time, the protocol has become the unwritten
law of the Internet for Web crawlers. The Robot protocol specifies that Web sites
wishing to restrict certain areas or pages from crawling have a file called robots.txt
placed at the root of the Web site. The ethical crawlers will then skip the disallowed
areas. Following is an example robots.txt file and an explanation of its format:
# Robots.txt for http://somehost.com/
User-agent: *
Disallow: /cgi-bin/
Disallow: /registration # Disallow robots on registration page
Disallow: /login
The first line of the sample file has a comment on it, as denoted by the use of a hash (#)
character. Crawlers reading robots.txt files should ignore any comments. The third line
of the sample file specifies the User-agent to which the Disallow rules following it apply.
General system architecture of search engine
User-agent is a term used for the programs that access a Web site. Each browser has a
unique User-agent value that it sends along with each request to a Web server.
However, typically Web sites want to disallow all robots (or User-agents) access to
certain areas, so they use a value of asterisk (*) for the User-agent. This specifies that
all User-agents be disallowed for the rules that follow it. The lines following the User-
agent lines are called disallow statements. The disallow statements define the Web site
paths that crawlers are not allowed to access. For example, the first disallow statement
in the sample file tells crawlers not to crawl any links that begin with “/cgi-bin/”. Thus,
the following URLs are both off limits to crawlers according to that line.
http://somehost.com/cgi-bin
http://somehost.com/cgi-bin/register
3.1.3. Resource Constraints
Crawlers consume resources: network bandwidth to download pages, memory to
maintain private data structures in support of their algorithms, CPU to evaluate and
select URLs, and disk storage to store the text and links of fetched pages as well as
other persistent data.
3.2. Web Indexing
Search engine indexing collects, parses, and stores data to facilitate fast and accurate
information retrieval. Index design incorporates interdisciplinary concepts from
linguistics, cognitive psychology, mathematics, informatics, physics and computer
science. An alternate name for the process in the context of search engines designed to
find web pages on the Internet is Web indexing.
The purpose of storing an index is to optimize speed and performance in finding
relevant documents for a search query. Without an index, the search engine would scan
every document in the corpus, which would require considerable time and computing
General system architecture of search engine
power. For example, while an index of 10,000 documents can be queried within
milliseconds, a sequential scan of every word in 10,000 large documents could take
hours. The additional computer storage required to store the index, as well as the
considerable increase in the time required for an update to take place, are traded off for
the time saved during information retrieval.
3.2.1. Index design factors
Major factors in designing a search engine's architecture include:
Merge factors : How data enters the index, or how words or subject features are
added to the index during text corpus traversal, and whether multiple indexers
can work asynchronously. The indexer must first check whether it is updating old
content or adding new content. Traversal typically correlates to the data
collection policy. Search engine index merging is similar in concept to the SQL
Merge command and other merge algorithms.
Storage techniques : How to store the index data, that is, whether information
should be data compressed or filtered.
Index size : How much computer storage is required to support the index.
Lookup speed : How quickly a word can be found in the inverted index. The
speed of finding an entry in a data structure, compared with how quickly it can be
updated or removed, is a central focus of computer science.
Maintenance : How the index is maintained over time.
Fault tolerance : How important it is for the service to be reliable. Issues include
dealing with index corruption, determining whether bad data can be treated in
isolation, dealing with bad hardware, partitioning, and schemes such as hash-
based or composite partitioning, as well as replication.
3.2.2. Index data structures
General system architecture of search engine
Search engine architectures vary in the way indexing is performed and in methods of
index storage to meet the various design factors. Types of indices include:
Suffix tree : It is figuratively structured like a tree, supports linear time lookup.
Built by storing the suffixes of words. The suffix tree is a type of trie. Tries
support extendable hashing, which is important for search engine indexing. Used
for searching for patterns in DNA sequences and clustering. A major drawback is
that the storage of a word in the tree may require more storage than storing the
word itself. An alternate representation is a suffix array, which is considered to
require less virtual memory and supports data compression such as the BWT
algorithm.
Tree : An ordered tree data structure that is used to store an associative array
where the keys are strings. Regarded as faster than a hash table but less space-
efficient.
Inverted index : Stores a list of occurrences of each atomic search criterion,
typically in the form of a hash table or binary tree
Citation index : Stores citations or hyperlinks between documents to support
citation analysis, a subject of Bibliometrics.
Ngram index : Stores sequences of length of data to support other types of
retrieval or text mining.
Term document matrix : Used in latent semantic analysis, stores the occurrences
of words in documents in a two-dimensional sparse matrix.
3.2.3. Types of indexing: Indexing is basically of two types.
General system architecture of search engine
3.2.3.1. Inverted Index:
Many search engines incorporate an inverted index when evaluating a search query to
quickly locate documents containing the words in a query and then rank these
documents by relevance. Because the inverted index stores a list of the documents
containing each word, the search engine can use direct access to find the documents
associated with each word in the query in order to retrieve the matching documents
quickly. The following is a simplified illustration of an inverted index:
Word Documents
the Doc1, Doc3, Doc4, Doc5
cow Doc2, Doc3, Doc4
says Doc5
moo Doc7
This index can only determine whether a word exists within a particular document,
since it stores no information regarding the frequency and position of the word; it is
therefore considered to be a Boolean index. Such an index determines which
documents match a query but does not rank matched documents. In some designs the
index includes additional information such as the frequency of each word in each
document or the positions of a word in each document. Position information enables the
search algorithm to identify word proximity to support searching for phrases; frequency
can be used to help in ranking the relevance of documents to the query. Such topics are
the central research focus of information retrieval.The inverted index is a sparse matrix,
since not all words are present in each document. To reduce computer storage memory
requirements, it is stored differently from a two dimensional array. The index is similar to
the term document matrices employed by latent semantic analysis. The inverted index
General system architecture of search engine
can be considered a form of a hash table. In some cases the index is a form of a binary
tree, which requires additional storage but may reduce the lookup time. In larger indices
the architecture is typically a distributed hash table.
The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge
but first deletes the contents of the inverted index. The architecture may be designed to
support incremental indexing, where a merge identifies the document or documents to
be added or updated and then parses each document into words. For technical
accuracy, a merge conflates newly indexed documents, typically residing in virtual
memory, with the index cache residing on one or more computer hard drives.
After parsing, the indexer adds the referenced document to the document list for
the appropriate words. In a larger search engine, the process of finding each word in the
inverted index (in order to report that it occurred within a document) may be too time
consuming, and so this process is commonly split up into two parts, the development of
a forward index and a process which sorts the contents of the forward index into the
inverted index. The inverted index is so named because it is an inversion of the forward
index.
3.2.3.2. Forward Index:
The forward index stores a list of words for each document. The following is a simplified
form of the forward index:
Document word
Doc1 the, cow, says, moo
Doc2 the, cat, and, the ,hat
Doc3 the, dish, ran, away, with, the, spoon
General system architecture of search engine
The rationale behind developing a forward index is that as documents are parsing, it is
better to immediately store the words per document. The delineation enables
Asynchronous system processing, which partially circumvents the inverted index update
bottleneck. The forward index is sorted to transform it to an inverted index. The forward
index is essentially a list of pairs consisting of a document and a word, collated by the
document. Converting the forward index to an inverted index is only a matter of sorting
the pairs by the words. In this regard, the inverted index is a word-sorted forward index.
3.2.4. Latent Semantic Indexing (LSI)
3.2.4.1. What is LSI:
Regular keyword searches approach a document collection with a kind of accountant
mentality: a document contains a given word or it doesn't, without any middle ground.
We create a result set by looking through each document in turn for certain keywords
and phrases, tossing aside any documents that don't contain them, and ordering the
rest based on some ranking system. Each document stands alone in judgement before
the search algorithm - there is no interdependence of any kind between documents,
which are evaluated solely on their contents.
Latent semantic indexing adds an important step to the document indexing
process. In addition to recording which keywords a document contains, the method
examines the document collection as a whole, to see which other documents contain
some of those same words. LSI considers documents that have many words in common
to be semantically close, and ones with few words in common to be semantically
distant. This simple method correlates surprisingly well with how a human being, looking
at content, might classify a document collection. Although the LSI algorithm doesn't
understand anything about what the words mean, the patterns it notices can make it
seem astonishingly intelligent.
When you search an LSI-indexed database, the search engine looks at similarity values
it has calculated for every content word, and returns the documents that it thinks best fit
the query. Because two documents may be semantically very close even if they do not
General system architecture of search engine
share a particular keyword, LSI does not require an exact match to return useful results.
Where a plain keyword search will fail if there is no exact match, LSI will often return
relevant documents that don't contain the keyword at all.
To use an earlier example, let's say we use LSI to index our collection of mathematical
articles. If the words n-dimensional, manifold and topology appear together in enough
articles, the search algorithm will notice that the three terms are semantically close. A
search for n-dimensional manifolds will therefore return a set of articles containing that
phrase (the same result we would get with a regular search), but also articles that
contain just the word topology. The search engine understands nothing about
mathematics, but examining a sufficient number of documents teaches it that the three
terms are related. It then uses that information to provide an expanded set of results
with better recall than a plain keyword search.
3.2.4.2 How LSI Works:
Natural language is full of redundancies, and not every word that appears in a
document carries semantic meaning. In fact, the most frequently used words in English
are words that don't carry content at all: functional words, conjunctions, prepositions,
auxiliary verbs and others. The first step in doing LSI is culling all those extraneous
words from a document, leaving only content words likely to have semantic meaning.
There are many ways to define a content word - here is one recipe for generating a list
of content words from a document collection:
Make a complete list of all the words that appear anywhere in the collection
Discard articles, prepositions, and conjunctions
Discard common verbs (know, see, do, be)
Discard pronouns
Discard common adjectives (big, late, high)
Discard frilly words (therefore, thus, however, albeit, etc.)
General system architecture of search engine
Discard any words that appear in every document
Discard any words that appear in only one document
This process condenses our documents into sets of content words that we can then use
to index our collection.
Using our list of content words and documents, we can now generate a term-
document matrix. This is a fancy name for a very large grid, with documents listed
along the horizontal axis, and content words along the vertical axis. For each content
word in our list, we go across the appropriate row and put an 'X' in the column for any
document where that word appears. If the word does not appear, we leave that column
blank.
Doing this for every word and document in our collection gives us a mostly empty
grid with a sparse scattering of X-es. This grid displays everything that we know about
our document collection. We can list all the content words in any given document by
looking for X-es in the appropriate column, or we can find all the documents containing
a certain content word by looking across the appropriate row.
Notice that our arrangement is binary - a square in our grid either contains an X, or it
doesn't. This big grid is the visual equivalent of a generic keyword search, which looks
for exact matches between documents and keywords. If we replace blanks and X-es
with zeroes and ones, we get a numerical matrix containing the same information.
The key step in LSI is decomposing this matrix using a technique called singular value
decomposition. The mathematics of this transformation is beyond the scope of this
article.
Imagine that you are curious about what people typically order for breakfast
down at your local diner, and you want to display this information in visual form. You
decide to examine all the breakfast orders from a busy weekend day, and record how
many times the words bacon, eggs and coffee occur in each order.
General system architecture of search engine
You can graph the results of your survey by setting up a chart with three orthogonal
axes - one for each keyword. The choice of direction is arbitrary - perhaps a bacon axis
in the x direction, an eggs axis in the y direction, and the all-important coffee axis in the
z direction. To plot a particular breakfast order, you count the occurrence of each
keyword, and then take the appropriate number of steps along the axis for that word.
When you are finished, you get a cloud of points in three-dimensional space,
representing all of that day's breakfast orders.
If you draw a line from the origin of the graph to each of these points, you obtain
a set of vectors in 'bacon-eggs-and-coffee' space. The size and direction of each vector
tells you how many of the three key items were in any particular order, and the set of all
the vectors taken together tells you something about the kind of breakfast people favor
on a Saturday morning.
What your graph shows is called a term space. Each breakfast order forms a vector in
that space, with its direction and magnitude determined by how many times the three
keywords appear in it. Each keyword corresponds to a separate spatial direction,
perpendicular to all the others. Because our example uses three keywords, the resulting
term space has three dimensions, making it possible for us to visualize it. It is easy to
see that this space could have any number of dimensions, depending on how many
keywords we chose to use. If we were to go back through the orders and also record
occurrences of sausage, muffin, and bagel, we would end up with a six-dimensional
term space, and six-dimensional document vectors.
Applying this procedure to a real document collection, where we note each use of a
content word, results in a term space with many thousands of dimensions. Each
document in our collection is a vector with as many components as there are content
words. Although we can't possibly visualize such a space, it is built in the exact same
way as the whimsical breakfast space we just described. Documents in such a space
that have many words in common will have vectors that are near to each other, while
documents with few shared words will have vectors that are far apart.
General system architecture of search engine
Latent semantic indexing works by projecting this large, multidimensional space down
into a smaller number of dimensions. In doing so, keywords that are semantically similar
will get squeezed together, and will no longer be completely distinct. This blurring of
boundaries is what allows LSI to go beyond straight keyword matching. To understand
how it takes place, we can use another analogy.
3.2.4.3. Singular Value Decomposition:
Imagine you keep tropical fish, and are proud of your prize aquarium - so proud that you
want to submit a picture of it to Modern Aquaria magazine, for fame and profit. To get
the best possible picture, you will want to choose a good angle from which to take the
photo. You want to make sure that as many of the fish as possible are visible in your
picture, without being hidden by other fish in the foreground. You also won't want the
fish all bunched together in a clump, but rather shot from an angle that shows them
nicely distributed in the water. Since your tank is transparent on all sides, you can take
a variety of pictures from above, below, and from all around the aquarium, and select
the best one.
In mathematical terms, you are looking for an optimal mapping of points in 3-space (the
fish) onto a plane (the film in your camera). 'Optimal' can mean many things - in this
case it means 'aesthetically pleasing'. But now imagine that your goal is to preserve the
relative distance between the fish as much as possible, so that fish on opposite sides of
the tank don't get superimposed in the photograph to look like they are right next to
each other. Here you would be doing exactly what the SVD algorithm tries to do with a
much higher-dimensional space.
Instead of mapping 3-space to 2-space, however, the SVD algorithm goes to much
greater extremes. A typical term space might have tens of thousands of dimensions,
and be projected down into fewer than 150. Nevertheless, the principle is exactly the
same. The SVD algorithm preserves as much information as possible about the relative
distances between the document vectors, while collapsing them down into a much
General system architecture of search engine
smaller set of dimensions. In this collapse, information is lost, and content words are
superimposed on one another.
Information loss sounds like a bad thing, but here it is a blessing. What we are losing is
noise from our original term-document matrix, revealing similarities that were latent in
the document collection. Similar things become more similar, while dissimilar things
remain distinct. This reductive mapping is what gives LSI its seemingly intelligent
behavior of being able to correlate semantically related terms. We are really exploiting a
property of natural language, namely that words with similar meaning tend to occur
together.
While a discussion of the mathematics behind singular value decomposition is beyond
the scope of our article, it's worthwhile to follow the process of creating a term-
document matrix in some detail, to get a feel for what goes on behind the scenes. Here
we will process a sample wire story to demonstrate how real-life texts get converted into
the numerical representation we use as input for our SVD algorithm.
The first step in the chain is obtaining a set of documents in electronic form. This can be
the hardest thing about LSI - there are all too many interesting collections not yet
available online. In our experimental database, we download wire stories from an online
newspaper with an AP news feed. A script downloads each day's news stories to a local
disk, where they are stored as text files.
Let's imagine we have downloaded the following sample wire story, and want to incorporate it in our
collection:
O'Neill Criticizes Europe on GrantsPITTSBURGH (AP)
Treasury Secretary Paul O'Neill expressed irritation on Wednesday that European countries have refused to go along with a U.S. proposal to boost the amount of direct grants rich nations offer poor countries. The Bush administration is pushing a plan
General system architecture of search engine
to increase the amount of direct grants the World Bank provides the poorest nations to 50 percent of assistance, reducing use of loans to these nations.
The first thing we do is strip all formatting from the article,
including capitalization, punctuation, and extraneous markup
(like the dateline). LSI pays no attention to word order,
formatting, or capitalization, so can safely discard that
information. Our cleaned-up wire story looks like this:
O’Neill criticizes Europe on grants treasury secretary Paul O’Neill expressed irritation Wednesday that European countries have refused to go along with a us proposal to boost the amount of direct grants rich nations offer poor countries the bush administration is pushing a plan to increase the amount of direct grants the world bank provides the poorest nations to 50 percent of assistance reducing use of loans to these nations
The next thing we want to do is pick out the content words in our article. These are the
words we consider semantically significant - everything else is clutter. We do this by
applying a stop list of commonly used English words that don't carry semantic meaning. Using a stop
list greatly reduces the amount of noise in our collection, as well as eliminating a large number of words that
would make the computation more difficult. Creating a stop list is something of an art - they depend very
much on the nature of the data collection. You can see our full wire stories stop list here.
Here is our sample story with stop-list words highlighted:
O’Neill criticizes Europe on grants treasury secretary Paul O’Neill expressed irritation Wednesday that European countries have refused to go along with a US proposal to boost the amount of direct grants rich nations offer poor countries
General system architecture of search engine
the bush administration is pushing a plan to increase the amount of direct grants the world bank provides the poorest nations to 50 percent of assistance reducing use of loans to these nations
Removing these stop words leaves us with an abbreviated version of the article containing content words
only:
O’Neill criticizes Europe grants treasury secretary Paul O’Neill expressed irritation European countries refused US proposal boost direct grants rich nations poor countries bush administration pushing plan increase amount direct grants world bank poorest nations assistance loans nations
However, one more important step remains before our document is ready for indexing.
We can notice how many of our content words are plural noun (grants, nations) and
inflected verbs (pushing, refused). It doesn't seem very useful to have each inflected
form of a content word be listed separately in our master word list - with all the possible
variants, the list would soon grow unwieldy. More troubling is that LSI might not
recognize that the different variant forms were actually the same word in disguise. We
solve this problem by using a stemmer.
3.2.4.4. Stemming:
While LSI itself knows nothing about language (we saw how it deals exclusively with a
mathematical vector space), some of the preparatory work needed to get documents
ready for indexing is very language-specific. We have already seen the need for a stop
list, which will vary entirely from language to language and to a lesser extent from
document collection to document collection. Stemming is similarly language-specific,
derived from the morphology of the language. For English documents, we use an
algorithm called the Porter stemmer to remove common endings from words, leaving
General system architecture of search engine
behind an invariant root form. Here are some examples of words before and after
And here is our sample story as it appears to the stemmer:
O’Neill criticizes Europe grants treasury secretary Paul O’Neill expressed irritation European countries refused US proposal boost direct grants rich nations poor countries bush administration pushing plan increase amount direct grants world bank poorest nations assistance loans nations
Note that at this point we have reduced the original natural-language news story to a
series of word stems. All of the information carried by punctuation, grammar, and style
is gone - all that remains is word order, and we will be doing away with even that by
transforming our text into a word list. It is striking that so much of the meaning of text
passages inheres in the number and choice of content words, and relatively little in the
way they are arranged. This is very counterintuitive, considering how important
grammar and writing style are to human perceptions of writing.
Having stripped, pruned, and stemmed our text; we are left with a flat list of words: