Meta-Search Engine based on Query-Expansion Using Latent Semantic Analysis and Probabilistic Latent Semantic Analysis DISSERTATION SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF TECHNOLOGY IN INFORMATION TECHNOLOGY (SOFTWARE ENGINEERING) Under the Supervision of Dr. Sudip Sanyal Associate Professor INDIAN INSTITUTE OF INFORMATION TECHNOLOGY – ALLAHABAD (DEEMED UNIVERSITY) DEOGHAT, JHALWA ALLAHABAD- 211011, (U.P.) INDIA IIIT-Allahabad Submitted by Anand Arun Atre MS200504 M.Tech. IT (Software Engineering) IIIT-Allahabad
88
Embed
Meta-Search Engine based on Query-Expansion … grade/Anand Arun Atre...Meta-Search Engine based on Query-Expansion Using Latent Semantic Analysis and Probabilistic Latent Semantic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Meta-Search Engine based on
Query-Expansion Using Latent Semantic Analysis and
Probabilistic Latent Semantic Analysis
DISSERTATION
SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
MASTER OF TECHNOLOGY IN INFORMATION TECHNOLOGY
(SOFTWARE ENGINEERING)
Under the Supervision of Dr. Sudip Sanyal Associate Professor
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY – ALLAHABAD
DECLARATION .......................................................................................................... iv
Abstract.......................................................................................................................... v
Table of Contents .........................................................................................................vi
List of Tables ..............................................................................................................viii
List of Figures ..............................................................................................................ix
Introduction...................................................................................................................1 1.1 Overview..................................................................................................................1 1.2 Objective..................................................................................................................1 1.3 Motivation ...............................................................................................................2 1.4 Problem Statement .................................................................................................4 1.5 Contribution of Thesis ...........................................................................................5 1.6 Structure of Thesis .................................................................................................5 1.7 Summary .................................................................................................................6
Literature Survey...........................................................................................................7 2.1 Current Trends in Meta-Search Engine...............................................................7 2.2 Vector Space Model................................................................................................8 2.3 Latent Semantic Analysis (LSA) .........................................................................12
2.3.1 Concept of LSA.............................................................................................................. 12 2.3.2 Limitations of LSA......................................................................................................... 19 2.3.3 Advantages and Applications of LSA ............................................................................ 20
2.4 Probabilistic Latent Semantic Analysis (PLSA) ................................................21 2.4.1 Concept of PLSA............................................................................................................ 21 2.4.2 PLSA Algorithm............................................................................................................. 23 2.4.3 Advantages and Applications of PLSA .......................................................................... 26
3.1 Basic Theme .................................................................................................................28 3.2 Architecture of Proposed MSE ..................................................................................29 3.3 Implementation Details ...............................................................................................31 3.4 Features of Proposed System......................................................................................34 3.5 Summary ......................................................................................................................35
Result and Analysis.....................................................................................................36
vi
4.1 Result-Analysis of LSA ........................................................................................36 4.1.1 Value of ‘k’ for Optimal Rank Approximation of Term-Document matrix ................... 36 4.1.2 Comparison of “Tf-IDf Measure” to “Term-Count Measure”........................................ 38
4.2 Result Analysis of PLSA ......................................................................................40 4.2.1 Optimal value for number of topics (a) .......................................................................... 41 4.2.2 Convergence................................................................................................................... 43 4.2.3 Number of Iterations for Convergence ........................................................................... 46 4.2.4 PLSA slide-shots ............................................................................................................ 46
4.3 Convergence in number of unique links after some iteration ..........................48 4.4 Comparison between LSA and PLSA results ....................................................48 4.5 Comparison with Dogpile Search-Engine ..........................................................49
Improvements from NER (Named-Entity Recognizer)..............................................52 5.1 Introduction ..........................................................................................................52 5.2 Modified Architecture of Meta-Search Engine..................................................54 5.3 Modified High-level Design .................................................................................55 5.4 Results of NER......................................................................................................57 5.5 Summary ...............................................................................................................58
Fig 4.2 Convergence in Term- Topic Matrix computed by Absolute
Measure
Convergence
0
2
4
6
8
10
12
14
16
18
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Iteration
-ive
max
diff
(in
log
scal
e)
Topic-Document Matrix
Fig 4.3 Convergence in Topic-Document Matrix computed by Absolute
Measure
44
4.2.2.2 Average Measure
The average measure can be computed by the following formula
Max i, j = | P i, j n+1 - P i, j n | / 2 ( | P i, j n+1 | + | P i, j n | )
where
P i, j n = value at ith row and jth column of term-topic matrix or topic-
document matrix after n th iteration.
The same procedure as previously explained, is used here. Only average measure
is used in place of absolute measure. Following graphs also represent convergent
behavior using this measure. Experiment is performed for query keyword “IIT” and
for three topics, as in the previous experiment.
Convergence
0
5
10
15
20
25
30
35
40
45
50
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Iteration
-ive
max
diff
(in
log
scal
e)
Term-Topic Matrix
Fig 4.4 Convergence in Term-Topic Matrix computed by Average
Measure
45
Convergence
0
5
10
15
20
25
30
35
40
45
1 3 5 7 9 11 13 15 17 19 21 23 25
Iteration
-ive
max
diff
(in
log
scal
e)
Topic-Document Matrix
Fig 4.5 Convergence in Topic-Document Matrix computed by Average
Measure
4.2.3 Number of Iterations for Convergence
The number of iterations for convergence is also one of the important issues. This
number must be an optimal one. It should not be so small which may persist non-
converged state and should not be so large which may over tune the values
(probabilities in both the matrix). The technique of “Early Stopping” is used for these
cases. The algorithm is implemented in such a way that it can take care of these
situations automatically. Maximum difference between corresponding cell value of
both the older and newer matrix is computed for each iteration. If this difference
appears small enough (say < .001), then the iterations are automatically stopped.
4.2.4 PLSA slide-shots Following are two examples of User Interface which shows results for query, showing
that different query results are grouped according to their context. The queries are
46
1. Thread
2. India Tourism
Fig 4.6 GUI representing results for query “Thread”
Fig 4.7 GUI representing results for query “India Tourism”
47
4.3 Convergence in number of unique links after some iteration
It is really a point of concern that for how many times a query should be expanded
to get refined as well as needed result. Following graph clearly depicts that after some
specified iterations for query expansion (say 5-6), number of unique links (in turn
unique documents) will converge. The test is performed for query “Thread” and
further expanded gradually as “Thread Package”, ”Thread Package lang” and
”Thread Package lang Java” etc. A web-page link and its respective sub-page link is
considered once and treated as unique.
0
5
10
15
20
25
30
1 2 3 4 5 6 7
Iterations for Query-Expansion
Num
ber o
f Uni
que
Web
-link
s
Fig 4.8 Behavior of num. of unique web-links to iterations for Query
Expansion
4.4 Comparison between LSA and PLSA results
Results of both LSA and PLSA that are demonstrated in previous sections are
quite admirable. In case of LSA, next suggested keywords generally belong to one
dominating context but are pointing very clear direction for search by query
expansion. For example “Forts”, “Hill” and likewise other terms for expansion, surely
directs a search towards a focused area of need, in the case of search for “India
Tourism”.
48
On the other hand, PLSA suggests next keywords by classifying them into needed
number of topics entered by user. This splendid feature of PLSA ensures a big ease
for refinement of results by selecting a keyword from a specific topic and then to use
it for query-expansion. As- for same query “India Tourism”, next keywords are
grouped into some topics. First topic shows simple aspect of tourism and displays
famous places for visit. In other groups all the famous hotel-name and restaurants as-
“Hyatt”, “Mariott”, “Regency” are present which represent another important aspect
of “Tourism in India”.
Definitely, since PLSA represents results in more organized way with capability
of distributing terms according to various logical aspects, it is better than LSA for
query expansion purpose in Meta-search Engine. Firm base of statistics and use of
EM for convergence are two sufficient reasons for PLSA’s commendable results.
4.5 Comparison with “Dogpile” Meta-Search Engine
A comparative study was performed to check top ten results of open-source Meta-
Search Engine “Dogpile” and results of proposed MSE after query expansion. Results
demonstrate the fact that while Dogpile shows the results in jumbled manner about
both the concept namely “Dress” and “Java” related to query “Thread”. The results of
proposed MSE for expanded query “Thread Package Java” and “Thread Dress” are
totally confined to their respective concepts. The long web-link (urls) in results
illustrates the fact that proposed MSE is yielding result to sub pages of web-sites and
proving the fact that they are more confined.
49
Fig 4.9 Top ten results of Meta-Search Engine “Dogpile” for Query
“Thread”
Fig 4.10 Top ten results of proposed MSE for Expanded Query “Thread
Package Java”
50
Fig 4.11 Top ten results of proposed MSE for Expanded Query “Thread
Dress”
Summary In this chapter we have reviewed the results obtained with the proposed meta
search engine and compared the results with those of “Dogpile”. Various experiments
confirm that the LSA and PLSA can indeed provide effective query expansion. Also,
PLSA seems to outperform LSA. However, there are some shortcomings in the
present version. We shall discuss the shortcoming and see how it can be overcome in
the next chapter.
51
Chapter 5
Improvements from NER (Named-Entity Recognizer)
5.1 Introduction
This chapter presents a significant improvement to the results of MSE. The
improvement is due to the use of “Named Entity Recognizer”. The chapter introduces
the Named-Entity Recognizer, its role and hence its influence in the context of MSE.
Next to this, the chapter demonstrates the change in previous architecture to
incorporate this extra module and its consequence. At last, results are shown with
illustrative examples.
In previous chapter, results of LSA and PLSA are shown with clarifying
examples. These are all appreciable results but are lacking in some way. For example
if we refer to Table 4.1 then terms like “Corbett”, ”Golden”, ”Royal” and ”North”
have appeared there. These terms are relevant to search but not providing the real
meaning because they are only representing part of some collection of words. These
words, when grouped together, reveal the real meaning. Similarly if we examine
Table 4.5 which shows results of PLSA for query “Australian University”, we find
terms like “Australian”, ”International”, “University” are appearing independently
which are actually part of single entity “Australian International University”. The
same scenario is there with terms “Jammu Kashmir” and “Taj Mahal”. Due to this
fact we will have to expand query for few more iterations to get refined results. Also
the inference obtained by looking at the classified keywords may be erroneous.
52
This is not actually the problem of the techniques of information retrieval, used in
the proposed MSE. The reason for such partially correct results is Search-Engine’s
literal matching mechanism. If we fire a query “A B” then Search-Engines retrieve all
the pages that contain A, B and both. Since our MSE is based on these SEs and parses
each term individually this type of error in results is obvious.
The essence of the problem is that we are considering each word as a separate
term. This may be incorrect in several cases where a group of words should be treated
as a single term since they represent a single semantic unit. However, the problem
gets complicated because the individual words may also have acceptable semantics of
their own. For example, if we are considering a single entity like “Banaras Hindu
University”, then the individual words have distinct meanings of their own which are
quite different from the meaning we get when all three words are taken together as a
single unit. This is actually a very famous problem of natural language processing
which is called “Recognition of Named Entity”. This named entity may be name of
person, organization, institute, place or any proper noun as- Mr. Albert Einstein,
Carnegie Mellon University, President of India, etc. “Named-Entity” is a collection of
few words when grouped together, represent a meaning behind those words.
A “Named Entity Recognizer (NER)” is a system which can recognize all the
named entities in a given passage. Thus, we realize that if an NER is introduced in our
system then we will be able to find all the named entities in our corpus and thus treat
them as single terms. There are already some NER packages that are freely available
for English and other language. Therefore, rather than implementing our own NER
module, it was felt that it would be better to use an available one. Since our design of
MSE supports easy extendibility, we can easily incorporate this new module.
We expected two positive outcomes after adding NER. Firstly the number of
terms in term-document matrix will be reduced which will definitely increase the
responsiveness of system, particularly for PLSA. This is because PLSA uses an
iterative procedure for convergence and complexity of the algorithm used, in each
iteration, depends on the number of terms. Secondly, we will have more number of
different terms as next probable query, because some of the most related terms are
53
already grouped within their respective named-entities. In fact the whole procedure
will add a notion of meaning to next keywords.
Empirical results shows that on an average the number of terms without NER
were 3096 while after addition of NER module this number reduced to 3079 which
clearly reflects presence of 17 such named-entities. After using NER, the named
entities will be treated as single terms and will provide more meaningful next
probable keywords for expansion.
5.2 Modified Architecture of Meta-Search Engine
In Figure 5.1 below we show the modified MSE. As can be seen, since our
architecture was highly modular, we could easily plug in the NER module.
User Interface Common Interface to Search-
Engines
Search-Engines (SEs)
Search Engine
1
Search Engine
2
Search Engine
nth
Baseline Establishment (Naïve Algorithm)
Page Retriever Pre-Processing Unit
Algorithms
LSA PLSA
Query
Next Keywords And
Ranked Links
(URLs)
Web-Pages
Processed Text-Corpora
NER
Fig. 5.1 Architecture of Modified Meta-Search Engine
From component User Interface to Page Retriever everything is same as was
presented in previous chapter. The named entity recognition task has to be performed
after the text has been extracted from the retrieved pages and before the pre-
54
processing unit. Named entities must be recognized from all the text-files and be
stored in desired term-document format with suitable term-weight factor. Moreover,
the identified named entities must not be passed through the stop-word removal and
stemming phases. For example, if we consider the named entity “Indian Institute of
Information Technology”, if “of” is removed by the stop word removal process then it
will distort the whole named-entity. Therefore, the NER is appropriate just before the
preprocessing unit. Now remaining terms are stored in term-document matrix which
then works as an input to the rest of the system where LSA or PLSA are executed. As
mentioned previously, these algorithms yield next keywords for given query.
5.3 Modified High-level Design
MetaSearchEngine
LSA / PLSAAlgorithm Reading And
Stemming
Parser
SearchEngineInterface
GUI
BaseLine
NER
Fig. 5.2 High-Level Design with NER (Package Diagram)
Package diagram given in Fig. 5.2 shows the position of NER package and its
interactions with the remaining packages. The advantage of using modular design is
evident in this case. We can easily add ‘NER’ module to our previous design with few
of the changes and the new idea works well.
55
This NER package uses Named Entity Recognizer Library that has been
developed by “Natural Language Processing Group” of Stanford University. This
library is freely available under GNU license. It uses a Conditional Random Field
(CRF) classifier. The library provides an implementation of Conditional Random
Field Sequence model and is coupled with a feature extractor for NER. It recognizes
three types of named entities: person, location and organization. This library also
contains some another models and versions with and without additional similarity
features. These features improve performance but requires considerable amount of
memory. For proposed MSE a classifier with least memory requirement is used. The
library is implemented in java and is available as a .jar file called standford-ner.jar
[41]. Following is an example which shows text data after named entity recognition.
Fig. 5.3 A text file before and after Named-Entity Recognition
From Fig 5.3 it is evident that named-entities as “Indian Institute of Information
Technology”, “Allahabad” and “Dr.M.D.Tiwari” are properly identified and enclosed
under respective tags.
Classes of NER package as NamedEntity.java converts all the text files into the
named-entity format. NEExtraction.java extracts named-entities and stores them in
term-document matrix and non-named entity terms are again stored in corresponding
56
files for further processing of the text. For proper and efficient usage of NER a small
modification was done in Jericho-html Parser.
5.4 Results of NER
We performed the same experiments as were performed in the previous chapter
without using NER. The results obtained were as per our expectations. The following
user-interface displays one of the results of query, “India Tourism”, after applying
Named-Entity Recognizer. Result contains named entities like- “Golden City”,
“Indian Wildlife”, “Corbett National Park Tour”, ”Taj & Wildlife Tour”, ”Discover
North India” and “Discover Forts and Palaces”. It is instructive to compare the results
shown below with that obtained for the same query, without using NER, given in the
previous chapter. If we refer to those results then it is quite clear that term “Corbett”
is related to national park, “North” comes in context of ”Discover North India” and
forts and palaces belong to “Discover Forts and Palaces”.
Now we can use these named entities for query expansion and will get refined
results within the next one or two iterations. Similarly we can still get some other
most related keywords from different context. However, it should be emphasized that
the improvement in the results would be less dramatic if the results of the original
query did not contain a significant number of named entities or if the named entities
were not very relevant to the original query.
57
Fig. 5.4 GUI after applying NER for query “India Tourism”
5.5 Summary
In this chapter we explored thoroughly the effect of introducing a new module
“Named Entity Recognizer”. We examined its importance, consequences and results.
From the results it is quite evident that the new module provides significant
improvements, particularly for those queries where the named entities are likely to
have a high relevance. This chapter also demonstrated the strength of the design of
our software because we could add the new module quite easily into the existing
system.
58
Chapter 6
Conclusion and Future Enhancements
7. Conclusion
In present scenario search-engines are really useful devices to extract needed
information from Internet. Meta-Search engines solve the same purpose with big span
of coverage and advanced features like maintaining user’s profile, filtering results etc.
Proposed MSE is based on refining the results using query-expansion while next
keywords are suggested by MSE itself without using any thesaurus or dictionary. We
can very easily conclude that both the algorithms namely LSA and PLSA work well
for suggesting next keywords for MSE.
Result and analysis part demonstrates that PLSA outperforms LSA and represents
all the results in well classified and easily understandable format. Further
incorporation of “Named Entity Recognizer” in MSE improves results. So, at last, it
can be concluded that to design MSE using LSA/PLSA for query expansion is a nice
and fruitful thought.
8. Future Enhancements
Following points recommend all the future amendments to proposed MSE-
• Current meta-search engine uses only result of Google, Yahoo and MSN.
Other search-engines are still available as Altavista, Askjeeves etc. They
can be added to proposed MSE It will increase the coverage span of MSE
and can provide even more acceptable results. Design of proposed MSE
supports easy modification.
59
• API’s used for implementing the MSE provide only limited results of
respective search-engines. If number of these results can be enhanced then
it will yield nice results.
• Various parsers for different type of file formats can be added and would
give this MSE an admirable feature.
• Sometimes web-pages may contain advertisements and images in a large
extent. These things are of no use from the algorithm and query-expansion
point of view. A good provision can be made in MSE to deal with all such
situations, effectively.
• To maintain the information about user-profiles could be an extended
feature of proposed MSE. If a provision is made to keep a (user, url)
information for a particular user, then next time whenever a user will
query, results could be filtered or categorized before displaying on user-
interface.
• Pronouns in context, reduces the weight of noun appeared in it. For
example we consider following passage:
“The Taj Mahal is one of the most famous historical monument of India. It
is one among the seven wonders of world. It is built by Shahjahan.”
‘It’ in sentence number 2 and 3 is referring to “Taj Mahal”. So they must
be replaced by “Taj Mahal” i.e. the frequency count of “Taj Mahal” should be
three. However, in the present technique it is only one. Therefore the resultant
weight of “Taj Mahal” is not the correct one. In fact, it is reduced. This is one
of the famous problems of NLP, called as “Procedure for Anaphora
Resolution”. Thus, in order to solve this problem, we need to add a module for
Anaphora Resolution. Such a module can be easily added in the design of
MSE and even more efficient results can be expected.
60
Appendix-A:
Search Engines’ API
A.1 Google SOAP Search API (beta)
Software developers are now capable of making their own program by which
they can query lots of web pages. They can use Google SOAP Search API for this
purpose. To provide such facility, Google utilizes Simple Object Access Protocol
(SOAP) and Web Service Description Language (WSDL). The availability of this
API in various languages and platforms like- Perl, Java and Visual Studio .NET gives
a liberty to developer so that he/she can choose his favorite environment.[33]
Some good example code and complete documentation is present with
developer’s kit. It is needed to have license key and Google Account for accessing
services of this API. By using both of these, we are entitled to fire 1000 queries per
day. Maximally 10 links can be retrieved by this API.[33] Google has replaced it
with its newer version “Google AJAX API”. Some classes and their used methods are
described next. Some times proxy does not allow passing SOAP request. To eliminate
this problem a user must have privileges so that his/ her request can be bypassed
through proxy. Some of the essential classes and their respective methods of this API
that are used in proposed MSE are given in Table A.1 below:
Table A.1 Classes and Methods of Google SOAP Search API
Class Method Description
GoogleSearch
public GoogleSearch()
Construct a new instance of a GoogleSearch client.
public void
setKey(String key)
Set the user key used for authorization by Google SOAP server. This is a
mandatory attribute for all requests.
61
public void
setQueryString(String q)
Set the query string for this search.
public byte[] doGetCachedPage
(String url) throws GoogleSearchFault
Retrieve a cached web page from Google. The key attribute must be set.
public String doSpellingSuggestion
(String phrase) throws GoogleSearchFault
Ask Google to return a spelling suggestion for a word or phrase.
public GoogleSearchResult
doSearch() throws GoogleSearchFault
Invoke the Google search. Note: key and query attributes must already be
set. GoogleSearch
Result
public GoogleSearchResult() Constructor
public String toString()
Returns a nicely formatted representation of a Google search
result.
A.2 Yahoo Search Web Service API
Yahoo! developer’s network provides various Web Services for application
developers. They can access services to build new and customized applications. These
services are based on REST that stands for Representational State Transfer. Yahoo!
Web Services use some operations over HTTP requests in which URL must be
encoded. All the libraries and example code are bundled together for accessing the
Yahoo! Search Web Services as a Software Development Kit (SDK).It can be easily
downloaded from their website [34]. This SDK includes code in the Java, Lua,
JavaScript, Perl etc. The developer can easily choose a language and platform of his
choice. It is required to register and to use an application ID, which is tied to
application for accessing Yahoo! Web Services. This application ID must be
associated with each Web Services request.
62
Table A.2 Classes and Methods of Yahoo Search Web Service API
Class Method Description SearchClient
public SearchClient (String appId)
Constructs a new SearchClient with the given application ID using the default
settings.
public WebSearchResults webSearch(WebSearchReque
st request) throws IOException,
SearchException
Searches the Yahoo database for web results.
WebSearch Request
public
WebSearchRequest (String query)
Constructs a new web search request.
public void setResults
(int results)
The maximum number of results to return. May return fewer results if there aren't enough results in the database. At the time of writing, the default value is
10, the maximum value is 50. Interface
WebSearch Result
String getTitle() The title of the web page.
String getUrl()
The URL for the web page.
Interface WebSearch
Results
BigInteger getTotalResultsAvailable
( )
The number of query matches in the database.
BigInteger getTotalResultsReturned
( )
The number of query matches returned. This may be lower than the number of results requested if there were fewer
A.3 MSN Search SDK beta MSN search SDK beta gives a provision for a user to send queries to MSN Live
Search and receive results. The documentation with SDK explains the essential
concepts, guidelines and library for the MSN Search Web Service. The SDK also
contains example code that illustrates techniques for application development.
This SDK requires any of the Windows platforms, as- Windows 2000, Server
2003, Vista, XP. A computer with the ability of sending requests via SOAP 1.1 and
HTTP 1.1 is needed. It must be able to parse XML. Microsoft® Visual Studio® .NET
2003 or Microsoft® Visual Studio® .NET 2005 and the Microsoft .NET Framework
must be installed on a deployment computer to build, run and execute the
applications. An application ID must always be entitled with request. For a given
query, top 10 results of MSN can be received in user program [35].
64
Appendix-B:
Parser’s API
B.1 Jericho HTML Parser
Jericho HTML Parser is a powerful library in java which analyses and
manipulates parts of an HTML document [36]. It also consist some functions that can
manipulate high-level HTML forms. Since, it is available as open source library we
can easily use it in our commercial applications. The library has following major
features that are different from other HTML parsers:
• It is not a tree based parser. It is completely based on simple text search
and efficient recognition of tags.
• The requirements for memory and resources are far better compared to
DOM based parser.
• Each parsed segment can be easily accessed and modifications into
selected segments can be efficiently performed.
• It provides an easy way to define and register custom-tags so that parser
can recognize them.
65
B.2 PDFBOX-0.7.0
PDFBox performs the functionalities like- creating new PDF document,
manipulating them and extracting content from them. It is available as open source.
PDFBox also comprises of several utilities[37]. Some very essential features of
PDFBOX are:
• Text extraction from PDF
• Merging of PDF Documents
• Encryption/Decryption of PDF Document
• Integration with Lucene Search Engine
• Creation of a PDF from a text file
• Images Creation from PDF pages
For proposed MSE only PDF to text extraction feature is used.
66
Appendix-C:
List of Stop Words
Following Stop word list is available on computer science department’s LSI web
site of University of Tennessee [42].
Table C.1 List of stop words. A appear C doing former a appreciate c'mon don't formerly
a's appropriate c's done forth able are came down four
about aren't can downwards from above around can't during further
according as cannot E furthermore accordingly aside cant each G
across ask cause edu get actually asking causes eg gets
after associated certain eight getting afterwards at certainly either given
again available changes else gives against away clearly elsewhere go
ain't awfully co enough goes all B com entirely going
allow be come especially gone allows became comes et got almost because concerning etc gotten alone become consequently even greetings along becomes consider ever H
already becoming considering every had also been contain everybody hadn't
although before containing everyone happens always beforehand contains everything hardly
am behind corresponding everywhere has among being could ex hasn't
amongst believe couldn't exactly have an below course example haven't and beside currently except having
another besides D F he any best definitely far he's
anybody better described few hello
67
anyhow between despite fifth help anyone beyond did first hence
anything both didn't five her anyway brief different followed here anyways but do following here's anywhere by does follows hereafter
apart doesn't for hereby
herein K N others saying hereupon keep name otherwise says
hers keeps namely ought second herself kept nd our secondly
hi know near ours see him knows nearly ourselves seeing
himself known necessary out seem his L need outside seemed
hither last needs over seeming hopefully lately neither overall seems
how later never own seen howbeit latter nevertheless P self however latterly new particular selves
I least next particularly sensible i'd less nine per sent i'll lest no perhaps serious i'm let nobody placed seriously i've let's non please seven ie like none plus several if liked noone possible shall
ignored likely nor presumably she immediate little normally probably should
in look not provides shouldn't inasmuch looking nothing Q since
inc looks novel que six indeed ltd now quite so indicate M nowhere qv some indicated mainly O R somebody indicates many obviously rather somehow
inner may of rd someone insofar maybe off re something instead me often really sometime
into mean oh reasonably sometimes inward meanwhile ok regarding somewhat
is merely okay regardless somewhere isn't might old regards soon
it more on relatively sorry it'd moreover once respectively specified it'll most one right specify it's mostly ones S specifying
68
its much only said still itself must onto same sub
J my or saw such just myself other say sup
sure they U we whole T they'd un we'd whom t's they'll under we'll whose
take they're unfortunately we're why taken they've unless we've will tell think unlikely welcome willing
tends third until well wish th this unto went with
than thorough up were within thank thoroughly upon weren't without thanks those us what won't thanx though use what's wonder that three used whatever would
that's through useful when would thats throughout uses whence wouldn't the thru using whenever X
their thus usually where Y theirs to uucp where's yes them together V whereafter yet
themselves too value whereas you then took various whereby you'd
thence toward very wherein you'll there towards via whereupon you're
there's tried viz wherever you've thereafter tries vs whether your thereby truly W which yours
therefore try want while yourself therein trying wants whither yourselves theres twice was who Z