SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN AJAX CONTENT IN WEB APPLICATIONS by PAUL SUGANTHAN G C 20084053 MUTHUKUMAR V 20084041 NANDHAKUMAR B 20084043 A project report submitted to the FACULTY OF INFORMATION AND COMMUNICATION ENGINEERING in partial fulfillment of the requirements for the award of the degree of BACHELOR OF ENGINEERING in COMPUTER SCIENCE AND ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ANNA UNIVERSITY CHENNAI CHENNAI - 600025 MAY 2012
73
Embed
SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN …pages.cs.wisc.edu/~paulgc/Thesis.pdf · Certified that this project report titled “ SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
SEARCH ENGINE ENHANCEMENT BYEXTRACTING HIDDEN AJAX CONTENT IN
WEB APPLICATIONSby
PAUL SUGANTHAN G C 20084053MUTHUKUMAR V 20084041NANDHAKUMAR B 20084043
A project report submitted to the
FACULTY OF INFORMATION AND
COMMUNICATION ENGINEERING
in partial fulfillment of the requirements
for the award of the degree of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ANNA UNIVERSITY CHENNAI
CHENNAI - 600025
MAY 2012
CERTIFICATE
Certified that this project report titled “ SEARCH ENGINE ENHANCEMENTBY EXTRACTING HIDDEN AJAX CONTENT IN WEBAPPLICATIONS” is the bonafide work of PAUL SUGANTHAN G C(20084053), MUTHUKUMAR V (20084041), NANDHAKUMAR B(20084043) who carried out the project work under my supervision, for the
fulfillment of the requirements for the award of the degree of Bachelor of
Engineering in Computer Science and Engineering. Certified further that to the
best of my knowledge, the work reported herein does not form part of any other
thesis or dissertation on the basis of which a degree or an award was conferred on
an earlier occasion on these are any other candidates.
Place: Chennai Dr. V VetriselviDate: Project Guide,
Designation,
Department of Computer Science and Engineering,
Anna University Chennai,
Chennai - 600025
COUNTERSIGNED
Head of the Department,
Department of Computer Science and Engineering,
Anna University Chennai,
Chennai – 600025
ACKNOWLEDGEMENTS
We express our deep gratitude to our guide, Dr. V VETRISELVI for guiding us
through every phase of the project. We appreciate her thoroughness, tolerance and
ability to share her knowledge with us. We thank her for being easily approachable
and quite thoughtful. Apart from adding her own input, she has encouraged us to
think on our own and give form to our thoughts. We owe her for harnessing our
potential and bringing out the best in us. Without her immense support through
every step of the way, we could never have it to this extent.
We are extremely grateful to Dr. K.S. EASWARAKUMAR, Head of the
Department of Computer Science and Engineering, Anna University, Chennai
600025, for extending the facilities of the Department towards our project and for
his unstinting support.
We express our thanks to the panel of reviewers Dr. ARUL SIROMONEY, Dr.A.P. SHANTHI and Dr. MADHAN KARKY (list of panel members) for their
valuable suggestions and critical reviews throughout the course of our project.
We thank our parents, family, and friends for bearing with us throughout the course
of our project and for the opportunity they provided us in undergoing this course
in such a prestigious institution.
Paul Suganthan G C Muthukumar V Nandhakumar B
ABSTRACT
Current search engines such as Google and Yahoo! are prevalent for searching the
Web. Search on dynamic client-side Web pages is, however, either inexistent or
far from perfect, and not addressed by existing work, for example on Deep Web.
This is a real impediment since AJAX and Rich Internet Applications are already
very common in the Web. AJAX applications are composed of states which can
be seen by the user, but not by the search engine, and changed by the user using
client-side events. Current search engines either ignore AJAX applications or
produce false negatives. The reason is that crawling client-side code is a difficult
problem that cannot be solved naively by invoking user events.
The project is aimed to propose a solution for crawling and extracting the hidden
ajax content. Thus enabling the search engines to enhance its search result quality
by indexing dynamic ajax content. Though AJAX can be crawled by testing
manually in browser by invoking client side events, enhancing the search engine
to crawl AJAX content automatically similar to traditional web applications
hasn’t been achieved.
The project describes the design and implementation of an AJAX Crawler. Then
enabling search engine to index the crawled states of an AJAX page. The
performance of AJAX Crawler is evaluated and compared with traditional
crawler. The possible issues regarding crawling AJAX content and future
From the above graphML file, the following inferences can derived,
21
• Number of nodes(application states) = 4
• The application state changes from source to target on clicking the element
derived by the Xpath expression stored in a key named target. Thus from the
graphML format, the path from one state to another can be obtained.
4.2.3.1 Visualizing the State Machine
The state machine of the sample site at http://test.thurls.com/ajax/home.php can be
visualized as shown in Figure 4.2. We can infer that there are totally 8 states.
FIGURE 4.2: Visualizing State Machine
22
4.2.4 Indexing
While State Machine is being constructed as a result of AJAX Crawler,
simultaneously indexing of states has to be done to enable searching through the
states. The project uses Lucene Open Source API for Indexing.
4.2.5 Searching
Searching involves getting input query and returning suitable results based on
reading the index files. Lucene Search API is used to perform search operation
and return the results.
4.2.6 Reconstruction of state
Once the user searches for a query and the results are displayed, we need to
navigate to a particular state directly and display it in browser once a user views a
result. We use Selenium Web Driver to navigate to a particular state by finding
the path between the target state and initial state from the State Machine, and then
invoking events along the path.
4.3 User Interface Design
User Interface is provided for users to perform search.The user enters the search
query in text box and performs the search. The snapshots of user interface have
been provided in the Appendix A.1.
23
4.4 UseCase Model
4.4.1 UseCase Diagram
Figure 4.3 denotes the control flow pattern of our algorithm, also showing the
involvement of the various software/hardware components in the various sections
of the flow. The actors represent these components and the use cases represent the
functionality.
FIGURE 4.3: UseCase Diagram
24
4.5 System Sequence Diagram
4.5.1 Event Invocation
Figure 4.4 shows the Sequence Diagram for Event Invocation. It shows the
sequence of events involved in invoking events and updation of DOM.
FIGURE 4.4: Sequence Diagram - Event Invocation
4.5.2 Searching
Figure 4.5 shows the Sequence Diagram for Searching the crawled states. It shows
the sequence of events involved in searching and reconstruction of result state.
25
FIGURE 4.5: Sequence Diagram - Searching
4.6 Data Flow Model
Figure 4.6 shows the Level 0 DFD of the system. Level 1 DFD’s are shown in
Figure 4.7 and Figure 4.8.
4.6.1 Data Flow Diagram
FIGURE 4.6: Level 0 Data Flow Diagram
26
FIGURE 4.7: Level 1 Data Flow Diagram
27
FIGURE 4.8: Level 1 Data Flow Diagram
CHAPTER 5
SYSTEM DEVELOPMENT
5.1 Implementation
5.1.1 Tools Used
The following tools were employed to implement the project.
Operating System Windows 7Languages used for development Java,PHP,JSP
Libraries HtmlUnit,JSoup,Lucene,JUNG,Selenium DriverDatabase MySql
IDE NetBeansUser Interface Mozilla Firefox
Performance Visualization(Graphs) Google Charts,Powerpoint
TABLE 5.1: Tools Used
5.1.2 Implementation Description
This section provides the detailed implementation of the complete system. This
section discusses all the algorithms used in the project. A brief explanation of
each algorithm and its need is also described. AJAX Crawling algorithm forms the
basis of the AJAX Crawler. It is described as follows :
28
29
5.1.2.1 Ajax Crawling Algorithm
The first step in crawling is to load the initial state and then wait for background
Javascript execution (this handles the case when an Ajax call is made using
onload event). Then all clickables in the initial state are found and the event is
invoked. The clickables are extracted using XPath expression. Those elements
matching the XPath expression are clicked. If there are DOM changes, then the
state machine is updated. The crawling is done in a breadth first manner. First all
states originating from the intial state are found. Then each state is crawled in a
similar fashion. HtmlUnit [2] Java library is used for implementing the AJAX
Crawling Algorithm. A WebClient object can be viewed as a Browser instance.
This covers the requirement of an AJAX Crawler to be capable of executing
Javascript.
Algorithm 1 Ajax Crawling algorithm1: procedure CRAWL(url)2: Load url in HtmlUnit WebClient3: Wait for background Javascript execution4: StateMachine← Initialize state machine5: StateMachine.add(initial state)6: while still some state uncrawled do7: current state← f ind some uncrawled state to crawl8: webclient← get web client(current state,StateMachine,url)9: while current state still uncrawled do
10: crawl state(webclient,current state,StateMachine)11: webclient← get web client(current state,StateMachine,url)12: end while13: end while14: save the StateMachine15: end procedure
30
Algorithm 2 Ajax Crawling algorithm (Continued)1: procedure GET WEB CLIENT(current state,StateMachine,url)2: webclient← Load url in HtmlUnit WebClient3: Wait for background Javascript execution4: path← Find shortest path from initial state to current state5: while current state not reached do6: xpath← Get Xpath to traverse to next state in path7: Generate the click event on the element retrieved by xpath8: Wait for background Javascript execution9: end while
10: return webclient11: end procedure
One of the problems with the HtmlUnit Webclient is that, once a DOM change
occurs and another state is reached, we cant be able to go back to the source state
to continue the breadth first crawling process. We need to again traverse from the
initial state to the source state , to continue the crawling process. This is done by
the function GET WEB CLIENT. Here we find the path from the intial state to
the current state to be crawled. Then invoke the events along the path to reach the
current state. Another issue with the WebClient is that it cannot be serialized and
stored. Thus each time when there is a DOM change, there is a need to traverse
from the initial state to current state.
The algorithm for crawling a individual state is described by the function
CRAWL STATE.Special care must be taken in order to avoid regenerating states
that have already been crawled (i.e., duplicate elimination). This is a problem also
encountered in traditional search engines. However, traditional crawling can most
of the time solve this by comparing the URLs of the given pages - a quick
operation. AJAX cannot count on that, since all AJAX states have the same URL.
Currently, we compare the DOM tree as a whole to check if two states are same.
31
Algorithm 3 Ajax Crawling algorithm (Continued)1: procedure CRAWL STATE(webclient,current state,StateMachine)2: elements← Get all clickable elements using Xpath3: while still an element remaining do4: xpath← Get Xpath of the current element5: if current element is already clicked in the current state then6: continue7: end if8: if current element is an anchor element then9: hre f ← Get href attribute of the current element
10: if href is null then11: Generate the click event on current element12: Wait for background Javascript execution13: if dom is changed then14: if new state is not already present in StateMachine then15: Add the new state to StateMachine16: Add a transition from current state to new state17: end if18: return19: end if20: end if21: else22: Generate the click event on current element23: Wait for background Javascript execution24: if dom is changed then25: if new state is not already present in StateMachine then26: Add the new state to StateMachine27: Add a transition from current state to new state28: end if29: return30: end if31: end if32: end while33: end procedure
32
5.1.2.2 State Machine
The algorithm for maintaining the State Machine is shown below.
Algorithm 4 State Machine Representation1: transition← Initialize a MultiKey Map2: crawl status← Initialize a Bit Vector3: graph← Initialize a Directed Multi Graph4: states← Initialize a Array List5: url← url currently being crawled6: procedure ADD NEW STATE(dom xml)7: if dom xml NOT IN states then8: state id=states.size();9: states.add(dom xml);
10: doc id=md5(url);11: index state(dom xml,url,doc id,state id);12: graph.addVertex(state id);13: end if14: end procedure15: procedure ADD TRANSITION(start state,end state,event, target xpath)16: if (start state,event,target xpath) NOT IN transition then17: transition.put(start state,event,target xpath,end state);18: graph.addEdge(start state,end state,event,target xpath);19: end if20: end procedure21: procedure UPDATE CRAWL STATUS(state id)22: crawl status.set(status id);23: end procedure24: procedure CHECK CRAWL STATUS25: num states=states.size()-1;26: for i = 0→ num states do27: if !crawl status.get(i) then28: return false;29: end if30: end for31: return true;32: end procedure
33
Algorithm 5 State Machine Representation (Continued)1: procedure GET NEXT STATE TO CRAWL2: num states=states.size()-1;3: for i = 0→ num states do4: if !crawl status.get(i) then5: return i;6: end if7: end for8: return -1;9: end procedure
10:
11: procedure CHECK STATE CRAWL STATUS(state id)12: if crawl status.get(state id) then13: return true;14: end if15: return false;16: end procedure17:
18: procedure SAVE STATE MACHINE19: layout← Initialize a Circle Layout of graph20: graphWriter← Initialize a Graph Writer21: out put← Initialize a Print Writer22: Add event type custom data for each edge in graph23: Add target XPath custom data for each edge in graph24: graphWriter.save(graph,output);25: end procedure
Thus we represent State Machine as a Directed Multigraph in JUNG(Java
Universal Network/Graph Framework)[3] . Each time when a new state is added
we check if the DOM is already in the state machine. Also each time while
adding a new transition, we check if it is not a duplicate transition. The State
Machine is saved in graphML format.
34
5.1.2.3 Indexing
Indexing is the process of extracting text from web pages, tokenizing it and then
creating an index structure (inverted index) that can be used to quickly find which
pages contain a particular word. The purpose of storing an index is to optimize
speed and performance in finding relevant documents for a search query. Without
an index, the search engine would scan every document in the corpus, which
would require considerable time and computing power.The project uses Lucene
Open Source API [4] for indexing the crawled states.Another advantage with
Lucene is that it supports incremental indexing. Thus we need not index all
documents from begining each time. The index files can be updated each time.
Only the text part in the DOM is indexed. In the inverted file,we store the URL,
DOC ID and STATE ID. The algorithm for indexing is given below.
Algorithm 6 Indexing crawled states using Lucene1: procedure INDEX STATE(dom xml,doc id,url,state id)2: indexWriter=new IndexWriter(path to index files,new
index files generated in Java to be read in PHP. The search results are returned as
an associate array. The array consists of the values for the parameters we
specified during indexing. For each result a score is assigned by Lucene. The
score is assigned during the indexing process based on the frequency of each
word in the document. The higher the score, the more relevant the result is to do
with the search query. The code snippet for searching in PHP using Lucene
Search library in Zend Framework is shown below.
Algorithm 7 Searching Lucene indexed files in PHP1: procedure SEARCH(query)2: $index = new Zend Search Lucene(path to index files);3: $hits = $index->find($query);4: foreach($hits as $hit)5: {6: echo $hit->score;7: echo $hit->docid;8: echo $hit->url;9: echo $hit->state;
10: }11: end procedure
36
5.1.2.5 Reconstruction of a particular state after crawling
After crawling the states of a particular URL, the states should be indexed to be
able to be searched by the search engine. [8] Thus a state needs to be reconstructed
for being displayed in search results. A web browser can load only the initial state
of a URL. But we need to load subsequent states which actually occur in browser
after a sequence of Javascript events are invoked. Thus the project uses Selenium
Web Driver [6] to load a particular state in browser directly. A Web Driver can be
viewed as a browser which can be controlled through code. Thus the project finds
the path from the initial state to the state to be loaded.The project initially loads the
initial state in Web Driver. Then the Javascript events along the path are invoked
in the Web Driver until the required state in reached. Thus the required state is
loaded in the browser to be viewed by the user. From then the user can continue
browsing from the required state like a normal browser.
Algorithm 8 Reconstruction of a particular state after crawling1: procedure RECONSTRUCT STATE(state)2: Read the graphML file of the corresponding URL and construct a Directed
Multigraph3: path← Find shortest path from initial state to the state to be constructed
(Djikstra Algorithm)4: Load the initial state in a Web Driver like Selenium5: while state not reached do6: xpath← Get Xpath expression of the element to be clicked next7: Generate the click event on the element retrieved by xpath8: Wait for background Javascript execution9: end while
10: The required state is currently loaded in the Web Driver11: end procedure
CHAPTER 6
RESULTS AND DISCUSSION
In this chapter, we report the significant results obtained in our experiments.
6.1 Results
Table 6.1 contains the list sample test cases used for evaluating the performance of
AJAX Crawler.
Case AJAX SiteC1 http://test.thurls.com/ajax/home.phpC2 http://spci.st.ewi.tudelft.nl/demo/aowe/C3 http://www.itrix.co.in/C4 http://demo.tutorialzine.com/2009/09/simple-ajax-website-jquery/demo.htmlC5 http://test.thurls.com/ajax/home1.php
TABLE 6.1: Test Cases
Some of the sample clickables in each of the test cases are shown below.
• Sample Clickables in C1
<div onclick=‘load content(1)’>Great Wall of China</div>
<div onclick=‘load content(2)’>Petra</div>
37
38
• Sample Clickables in C2
<b>Home</b>
<b>Workshop Organizers</b>
<b>Program Committee</b>
<b>Call for Papers</b>
• Sample Clickables in C3
<p id=”hel”>About Us</p>
<p id=”hel”>Sponsors</p>
• Sample Clickables in C4
<a href=”#page1”>Page 1</a>
<a href=”#page”>Page 2</a>
• Sample Clickables in C5
<div onclick=‘load content(24)’>Test 16</div>
<div onclick=‘load content(25)’>Test 17</div>
Table 6.2 contains the experimental results obtained for the sample test cases.
Probable Clickables are those elements in the DOM which can be clicked.
Detected Clickables are those that actually trigger AJAX requests.
Case MaximumDOM StringSize(bytes)
Probable Clickables Detected Clickables Number ofStates
Crawling time of a page = network latency + server response time
In AJAX crawling,
Crawling time of a state = network latency + server response time + AJAX req.
time
The crawl time of a page in traditional crawling is in order of milli seconds whereas
in AJAX crawling, the crawl time of a page is in order of minutes. This is due to
the time spent in executing Javascript.Table 6.3 contains the crawling time for each
test case. Also crawl time per state is also shown.
Case Number of States Total Crawlingtime (in mins) Crawling time per state (in mins)C1 8 11.44 1.43C2 11 216.45 19.68C3 27 607.5 22.5C4 5 34.9 6.98C5 26 103.13 3.97
TABLE 6.3: Crawling Time
40
6.2.1.1 Number of States Vs Crawling Time
Figure 6.1 shows the plot between number of states and Crawling time (in
minutes).
FIGURE 6.1: Number of States Vs Crawling Time(in minutes)
Inferences from the graph
• The variation between Crawling time and number of states is not uniform.
• Crawling time doesn’t depend directly on the number of states.
• For the same website, crawling time is not constant when measured at
different instances.
• Network latency and server response time is not constant.
41
• Crawling time doesn’t depend only on the number of states. It is a weighted
measure of network latency, server response time ,AJAX request time and
also number of states.
6.2.2 Clickable Selection Policy
Clickable selection refers to the process of identifying clickables for invoking
events. An ideal clickable selection policy should identify clickables in an
optimal way such that most of them trigger AJAX requests or cause a change in
DOM. A better clickable selection policy can reduce the Javascript wait time ,
thus decreasing crawling time.We define a ratio called Clickable Selection Ratio
, which can defined as a ratio of Number of AJAX Requests to that of number of
probable Clickables.Table 6.4 contains the Clickable Selection Ratio for the
sample test cases.
Clickable Selection ratio = No. of AJAX Requests / No. of Probable Clickables
Case Number of Clickables Number of AJAX Requests Clickable Selection RatioC1 24 8 0.33C2 61 11 0.18C3 167 27 0.16C4 23 5 0.21C5 58 26 0.45
TABLE 6.4: Clickable Selection Policy
42
6.2.2.1 Number of AJAX Requests Vs Probable Clickables
Figure 6.2 shows the plot between Number of AJAX Requests and Probable
Clickables.
FIGURE 6.2: Number of AJAX Requests Vs Probable Clickables
Inferences from the graph
• The variation between Probable Clickables and Number of AJAX Requests
is not uniform.
• The number of clickables depends on the structure of the web page.
• The number of AJAX requests cannot be directly related to the number of
probable clickables.
43
6.2.2.2 Probable Clickables Vs Detected Clickables
Figure 6.3 shows the plot between Probable Clickables and Detected Clickables
FIGURE 6.3: Probable Clickables Vs Detected Clickables
Inferences from the graph
• The variation between Probable Clickables and Detected Clickables is not
uniform.
• The number of Detected Clickables depends on the structure of the web page
rather than on the number of Probable Clickables.
• The number of Detected Clickables cannot be directly related to the number
of Probable Clickables.
44
• The number of Detected Clickables cannot be directly related to the number
of AJAX Requests.
• Number of AJAX Requests <= Number of Detected Clickables
6.2.3 Clickable Selection Ratio Vs Crawling Time
Figure 6.4 shows the plot between Clickable Selection Ratio Vs Crawling Time
per state (in minutes)
FIGURE 6.4: Clickable Selection Ratio Vs Crawling time per state(in minutes)
Inferences from the graph
• The variation between Clickable Selection Ratio and Crawling Time is
uniform.
45
• Crawling time is inversely proportional to Clickable Selection Ratio
6.3 Search Result Quality
Search result quality is improved on indexing hidden AJAX Content. The AJAX
Content is not visible to traditional crawlers. Thus the AJAX Content is not
indexed by traditional crawlers. Hence in AJAX Crawler the quality of search
results improve compared with traditional crawlers. We will now see how Google
Bot and AJAX Crawler fetches http://test.thurls.com/ajax/home.php (C1).
Screenshots are provided in Appendix A.2.
This is how Google bot fetches http://test.thurls.com/ajax/home.php (C1)HTTP/1.1 200 OK