8/3/2019 Final Report - Advanced Search Engine
1/23
Final Report
Advanced Keyword Search EngineBy
Nikhil Pratap
Sathishkumar Poornachandran
CSE 511
5/5/2011
I. ABSTRACT
8/3/2019 Final Report - Advanced Search Engine
2/23
We have introduced an efficient advanced keyword search engine on XML dataset with ranked
results based on the total number of reviews, review date posted and rating for each product. The
advanced search engine was implemented using Boolean operators AND, OR & NOT. Using
these operators, the exact results will be shown to the user. We have also implemented features
like wild card and Phrase search that would enhance the users searching experience.
II. INTRODUCTION & MOTIVATION
The existing problem with Amazon is that user can query the search engine based on certain pre-
defined categories like Electronics, Computers, Books, Cosmetics etc., There are no options for
users to do real time queries based on the attributes of the products. In this paper, we are
proposing an advanced search query engine which would allow users to enter more inputs that
can be used for filtering out search results based on relevancy. Our advanced search query option
would also support operators like AND, OR, NOT that are just considered as a normal text
strings in Amazon. Eg) Search query Electronics = (computers not laptop) would exactly fetch
laptop information in Amazon. Here the word not is considered as just string. This paper also
introduces the concept of phrase search, wild card search and an intuitive ranking algorithm
based on number of reviews, review dates and ranking.
III.ARCHITECTURE
The main architecture of the system can be divided into two categories.
a) Offline Computation.
b) Online Computation.
a) Offline Computation
Page 2
8/3/2019 Final Report - Advanced Search Engine
3/23
In the offline computation, we are extracting the data from the XML dataset via SAX & DOM
PARSER and storing the data in the indexer file as inverted index. For our scenario, we are
storing values as fields like title, brand, reviews, model etc. The illustration of the process is
given below.
b) Online Computation
In the online computation part, the user would send his query via the web interface and the
request would be handled by a servlet. Here, the servlet processes the user request and sends the
user query to query parser for evaluation. The query parser would evaluate the validity of thequery like removing unwanted words, parenthesis, quotes etc., and sends it back to servlet. The
servlet sends the processed query to the indexer where the actual searching of the query would
take place. The search result sets would be returned back to the user via the same servlet. The
pictorial representation of the whole process is given below.
Page 3
8/3/2019 Final Report - Advanced Search Engine
4/23
Introducing Inverted Index:
Indexer is able to retrieve efficient search results because, instead of searching the documents
directly, it searches an index instead. This would be the correspondent of retrieving pages in a
book related to a keyword by searching the index at the back of a book, as opposed to searching
the words in each page of the book.
This type of index is called an inverted index, because it inverts a page-centric data structure
(page ->Keywords) to a keyword-centric data structure (word->pages).
IV. FUNCTIONALIES
The functionalities included in this project are as follows.
a) Boolean Search.
b) Ranking Result sets.
c) Phrase Search.
Page 4
8/3/2019 Final Report - Advanced Search Engine
5/23
d) Wildcard Search.
a) BOOLEAN SEARCH
Boolean search allows users to combine search results using the operators AND, OR & NOT.
We can add a clause to a BooleanQuery using the below method.
public void add(Query query, BooleanClause.Occur occur)
Where occur can be BooleanClause.Occur.MUST, BooleanClause.Occur.SHOULD or
BooleanClause.Occur.MUST_NOT.
BooleanClause.Occur.MUST:
This BooleanClause is used to AND two or more search results returned by the indexer.
For eg, (query1 AND query2 AND query3)
BooleanQuery.add(query1, BooleanClause.Occur.MUST)
BooleanQuery.add(query2, BooleanClause.Occur.MUST)
BooleanQuery.add(query3, BooleanClause.Occur.MUST)
BooleanClause.Occur.SHOULD:
This BooleanClause is used to OR two or more search results returned by the indexer.
For eg, (query1 OR query2 OR query3)
BooleanQuery.add(query1, BooleanClause.Occur.SHOULD)
BooleanQuery.add(query2, BooleanClause.Occur.SHOULD)
BooleanQuery.add(query3, BooleanClause.Occur.SHOULD)
BooleanClause.Occur.NOT:
This BooleanClause is used to NOT two or more search results returned by the indexer.
Page 5
8/3/2019 Final Report - Advanced Search Engine
6/23
For eg, (query1 NOT query2)
BooleanQuery.add(query1, BooleanClause.Occur.AND)
BooleanQuery.add(query2, BooleanClause.Occur.NOT)
b) Ranking Result sets:
We have implemented an algorithm which would rank results based on number of reviews,
review date posted and rating of each product. The motivation behind implementing this
algorithm is that some of the existing algorithms were totally dependent on averaging rating of
the products. They failed to consider the number of reviews and review dates into consideration
which would bring some of the old products with very less reviews and high ratings into top. We
will look into some of the ambiguous scenarios that would affect the quality of ranking.
Scenario 1:
Page 6
8/3/2019 Final Report - Advanced Search Engine
7/23
Two products are being compared in the above picture. The first product has got only one review
with a rating of 5. Naturally, the average rating for the first product is 5. On the other hand, the
second product has got 4 reviews with an average rating of 4.5. It would be ambiguous, if we just
consider average rating as the only criteria and rank results. Here, product 2 looks an obvious
selection for the user because it is reliable with many reviews.
Scenario 2:
There is a chance in the review date for product one in scenario 2. The product is relatively new
to the market and it has got one review with rating 5. The second product remains the same. In
this case, the user may like product one, because it is new to the market and there is every
change that it would get good ratings in the future. If this product does not get good ratings in the
future, the algorithm would bring it down, as it also considers average days per single review.
Page 7
8/3/2019 Final Report - Advanced Search Engine
8/23
Algorithm Steps:
1) Calculate the maximum number of reviews and minimum time difference among all
the products.
2) Iterate one product at a time.
Calculate average ratings and average days per single review.
tempWeightedAverage=((No of Reviews/maximum number of reviews)*Average
rating)
WeightedAverage=((Minimum time difference/Avg days per
Reviews)*tempWeightedAverage)
Add the product with its corresponding weightedAverage into the HashMap.
1) Terminate after finding weighted average of all the products.
How to calculate maximum number of reviews? - For example, consider the dataset has got
1000 products. If 100th product has got 50 reviews which is maximum among all the products,
then maximum number of reviews is 50.
How to calculate minimum time difference? - Minimum number of time difference in daysbetween the current date and the product launch date.
The generated HashMap would then be dumbed into TreeMap and sorted accordingly with
respect to the WeightedAverage.
c) Phrase Search
A Query that matches documents containing a particular sequence of terms. A PhraseQuery is
built by QueryParser for input like title: "Sony 12 Megapixel".
For queries enclosed with double quotations, the indexer would look for the exact match in the
inverted index matrix.
Page 8
8/3/2019 Final Report - Advanced Search Engine
9/23
V. RESULTS
UI Interface for advanced Search engine
Notes:
i) User can search for multiple products in the same query.
For eg) Query: (Brand: Sony Camera) or (Brand: Sony Computer)
ii) User can build complex Boolean query with the combination of AND, OR, NOT.
iii) User can select a maximum of 1000 results per page.
Page 9
8/3/2019 Final Report - Advanced Search Engine
10/23
iv) User can select ranking if needed.
Result Set page
Query: (Brand: Sony Camera) or (Brand: Sony Computer)
Page 10
8/3/2019 Final Report - Advanced Search Engine
11/23
Ranked Results
Query: (title: canon)
Page 11
8/3/2019 Final Report - Advanced Search Engine
12/23
Notes:
i) Here, the ranking is based on the number of reviews, review dates and rating of each
product.
ii) If the product has got no reviews or ratings, it is given a rating 0.
iii)The rating is given on a scale of 0 to 5.
Wild Card Search
Query: (title: can*n) AND (Feature: S?R)
Page 12
8/3/2019 Final Report - Advanced Search Engine
13/23
Notes:
i) * and ? can be used as a wild card characters.
ii) Multiple wild cards can be used in a term to match query strings.
iii) The wild cards can be used anywhere in the string but cannot be used in the begging of the
string.
Phrase Search:
Query: Feature: (Compatible With Select Canon Digital SLR Cameras)
Page 13
8/3/2019 Final Report - Advanced Search Engine
14/23
VI. EFFICIENCY EVALUTAION
The system was efficient enough to retrieve 1000 results in less than 2 seconds (approximately).
We have taken 30,000+ dataset for evaluating the performance of the system. The system was
also evaluated with a load testing tool called Web performance Load tester to evaluate the
load testing capability of the system. Since the application was deployed in a single tomcat
server, there were some failures when the system was load tested with 7 users at the same time.
This problem can easily be solved with a concept called Tomcat Clustering where many
tomcat servers can be clustered and the traffic can be load balanced.
The output of the tool is published below.
Performance Goal Analysis
Page 14
8/3/2019 Final Report - Advanced Search Engine
15/23
Page Duration
Page 15
8/3/2019 Final Report - Advanced Search Engine
16/23
Page Completion Rate
Transaction (URL) Completion Rate
Page 16
8/3/2019 Final Report - Advanced Search Engine
17/23
Failures
Bandwidth Consumption
Waiting Users
Page 17
8/3/2019 Final Report - Advanced Search Engine
18/23
Summarized by the selected user levels, this table shows some of the key metrics that reflect theperformance of the test as a whole.
Time Based Analysis:
Page 18
8/3/2019 Final Report - Advanced Search Engine
19/23
Page Duration:
Page Completion Rate
Transaction (URL) Completion Rate
Page 19
8/3/2019 Final Report - Advanced Search Engine
20/23
Failures
Bandwidth Consumption
Page 20
8/3/2019 Final Report - Advanced Search Engine
21/23
Waiting Users
Page 21
8/3/2019 Final Report - Advanced Search Engine
22/23
Test summary metrics
Sorted by the elapsed test time, this table shows some of the key metrics that reflect theperformance of the test as a whole.
VII. CONCLUSION
This paper has introduced an efficient keyword search engine on XML dataset with ranked
results based on the total number of reviews, review date posted and rating for each product. The
data extraction from XML file was done through DOM & SAX parser. The advanced search was
implemented through AND, OR & NOT. The paper then went on to explain the concepts of
phrase search and wild card search that would enhance the users searching experience. An
elaborate performance analysis of the system was furnished with graphs and tables.
Page 22
8/3/2019 Final Report - Advanced Search Engine
23/23
VIII. FUTURE WORKS
i) Based on metrics such as Click-through rate and Conversion rate, the system can be
trained better to provide more relevant results.ii) Better personalization based on user profiles. By considering the user activities over time,
the system can provide better personalized results for each user.
IX. REFERENCES
Make sure to give credit to any papers you used to get ideas for algorithms.
P 23