Crawling Rich Internet Applications

Qiao

Gregor v. Bochmann(in collaboration with the SSRG group)

University of OttawaCanada

Oldenburg, den 16 Dezember 2013

Crawling Rich Internet Applications

Overview

Background– The evolving web– Why crawling – Our research project

Web Crawling– Traditional web crawling– RIA crawling– Performance objectives, assumptions

Crawling strategies– Breadth-first - Depth-first - Greedy– Model-based strategies (Hypercube - Menu)– Probabilistic strategy– Component-based crawling

Distributed crawling– Different architectures– Experimental results

Conclusions

2

Web Crawling is …

Exploring Web Applications automaticallyDiscovering the pages of a Web applicationEmulating the user behaviour to retrieve

states of a web application.

Web Crawling is as old as the web itself !From the early times of the web, matching the expansion of the web has been a challenge

3

The evolving Web Traditional Web

– static HTML pages stored as separate files, identified by a URL

Deep Web– Server application accesses a database, user fills request forms

– HTML pages dynamically created by server, identified by URL including request parameters

Rich Internet Applications (RIA – “Web-2”)– pages contain executable code (e.g. JavaScript, Silverlight,

Adobe Flex...); executed in response to user interactions, or time-outs (so-called events); script may change displayed page (the “state” of the application changes) – with same URL.

– AJAX: script may interact asynchronously with the server to update the page

4

Example of a traditional web application

Show my web site (http://www.eecs.uottawa.ca/~bochmann/ )

Simplified model of the web site

5

Bochmann

Hobbies

DSRG

Pub

Painter B

research group

hobbies

Gregor von Bochmann

publications

publications

page with URL link (event)

6

RIA examples – TestRIA , AltroMutual

RIA example - Clipmarks

7

RIA example – Google Mail

8

The Graph Model a web application

Graph model:– Web page (client state of the application) node

• is encoded in HTML – called DOM

– Event (click, mouse-over, etc.) edge• An event triggers a transition between states

9

Bochmann

Hobbies

DSRG

Pub

Painter B

research group

hobbies

Gregor von Bochmann

publications

publications


RIA vs. Traditional Web (Web-1)

Graph model: Web-1 RIA – Web page (state) : has URL few pages have a URL

– Event includes next URL code execution

10

Bochmann

Hobbies

DSRG

Pub

Painter B

research group

hobbies

Gregor von Bochmann

publications

publications


Bochmann

Hobbies

DSRG

Pub

Painter B

research group

hobbies

Gregor von Bochmann

publications

publications

State (no URL)

Why crawling

Objective A: find all (or all “important”) pages – for content indexing for search engines

– for security testing and vulnerability assessment

– for accessibility testing

Objective B: find all links between pages– for ranking pages, e.g. Google ranking in search queries

– for automated testing and model checking of the web application

– for assuring that all pages have been found

11

Software Security Research Group (SSRG), University of Ottawa in collaboration with IBM

Software Security Research Group (SSRG), University of OttawaIn collaboration with IBM

University of Ottawa IBM – R&D (Ottawa)Prof. Guy-Vincent Jourdan

– Prof. Gregor v. Bochmann -- Iosif Viorel Onut (PhD) – Suryakant Choudhary (Master student) -- AppScan product team– Emre Dincturk (PhD student)– Khaled Ben Hafaiedh (PhD student)– Seyed M. Mir Taheri (PhD student)– Ali Moosavi (Master student)

12

13

View detailed security issues reports

Security Issues Identified with Static Analysis (white-box view) Security Issues Identified with Dynamic Analysis (black-box view) Aggregated and correlated results Remediation Tasks Security Risk Assessment

Overview


Web Crawling– Traditional web crawling– RIA crawling– Performance objectives and assumptions

Crawling strategies– Breadth-first , Depth-first , Greedy– Model-based strategies (Hypercube - Menu)– Probabilistic strategy– Component-based crawling


Conclusions

14

Traditional Web CrawlingHTML page

– is a tree data structure, called DOM. It includes• information about display by the browser• events that can be activated by the user (for instance, clicking

on certain displayed fields); for each event– URL to be requested from the server through an HTTP Request (link to

next page)

– The page returned by the server for a given URL, in general, depends on the server state and the values of cookies

– The displayed page is identified by its URL• if we ignore server state and cookies

15

Traditional web crawling algorithm

Given: – an initial seed URL

– a domain (or list of domains) defining the limit of the web space to be explored

Crawler variables (of type “set of URLs”):– exploredURLs = empty

– unexploredURLs = {seedURL}

Algorithm– While unexploredURLs is not empty do

• Take a URL from unexploredURLs, add it to exploredURLs , request it from the server, analyse the returned page (according to the purpose of the crawl), extract the links in the page and add the corresponding URLs (if they are new, and if they are in the domain) to unexploredURLs

16

RIA CrawlingDifference from traditional web

– Most pages have no URL and therefore are not directly accessible

– When an event triggers the execution of a script, the script may change the DOM structure – which may lead to a new display and a new set of enabled events – that is a new state of the application.

17

Bochmann

Hobbies

DSRG

Pub

Painter B

research group

hobbies

Gregor von Bochmann

publications

publications

State (no URL)

Crawling means:– finding all URLs that are part of the

application, plus– for each URL, find all states reached

(from the “seed” URL) by executing any sequence of events

• Important note: only the “seed” states are directly accessible by a URL

Difficulties for crawling RIAState identification

– A state can not be identified by a URL.

– Instead, we consider that the state is identified by the current DOM in the browser.

Most links (events) do not contain a URL– An event included in the DOM may not explicitly identify the next

state reached when this event is executed.

– To determine the state reached by such an event, we have to execute that event.

• In traditional crawling, the event (link) contains the URL -identification of the next state reached

Accessibility of states– Most states are not directly accessible (no URL) – only through

“seed” URL and a sequence of events (and intermediate states)

18

Important consequence

For a complete crawl (a crawl that ensures that all states of

the application are found), the crawler has to execute all events in all states of the application

– since for any of these events, we do not know, a priory, whether its execution in the current state will lead to a new state or not.

• Note: In the case of traditional web crawling, it is not necessary to execute all events on all pages; it is sufficient to extract the URLs from these events, and get the page for each URL only once.

19

Example The links “publication” in the

pages “Bochmann” and “DSRG” have the same URL

The page “Pub” will be retrieved only once.

20

Bochmann

Hobbies

DSRG

Pub

Painter B

research group

hobbies

Gregor von Bochmann

publications

publications

Bochmann

Hobbies

DSRG

Pub

Painter B

research group

hobbies

Gregor von Bochmann

publications

publications

The events “publication” in the pages “Bochmann” and “DSRG” have no URL

Both events “publication” must be executed, and the crawler finds out that they both lead to the same client state.

AJAX: asynchronous interactions with the server

21

We ignore the intermediate states in our current work, by simply waiting that a new stable state is reach after each user input

RIA: Need for DOM equivalence

A given page often contains information that changes frequently, e.g. advertizing, time of the day information. This information is usually of no importance for the purpose of crawling. In the traditional web, the page identification (i.e. the URL)

does not change when this information changes. In RIA, states are identified by their DOM. Therefore

similar states with different advertizing would be identified as different states (which leads to a too large state space).

– We would like to have a state identifier that is independent of the unimportant changing information.

– We introduce a DOM equivalence, and all states with equivalent DOMs have the same identifier.

22

DOM equivalence

The DOM equivalence depends on the purpose of the crawl.

– In the case of security testing, we are not interested in the textual content of the DOM,

– however, this is important for content indexing.

The DOM equivalence relation is realized by a DOM reduction algorithm which produces (from a given DOM) a reduced canonical representation of the information that is considered relevant for the crawl. If the reduced DOMs obtained from two given DOMs are

the same, then the given DOMs are considered equivalent, that is, they represent the same application state (for this purpose of crawl).

23

Form of the state identifiers

The reduced DOM could be used as state identifier.

– however, it is quite voluminous• we have to store the application model in memory during its

exploration, each edge in the graph contains the identifiers of the current and next states.

• This is necessary to check whether a state obtained after the execution of some event is a new state or a known one

Condensed state identifier:– A hash of the reduced DOM

– The crawler also stores for each state the list of events included in the DOM, and whether they are executed or not

• used to select the next event to be executed during the crawl

24

Performance objectives– Execution speed: How many events (state

transitions) can be executed per hour ?

– Complete crawl: Given enough time, the strategy terminates the crawl when all states of the application have been found.

– Efficiency of finding states - “finding states fast”: If the crawl is terminated by the user before a complete crawl is attained, the number of discovered state should be as large as possible.

• For many applications, a complete crawl cannot be obtained within a reasonable length of time.

• Therefore the third objective is very important.

25

Our working assumptions

– Deterministic RIA : the crawled RIA is deterministic from the point of view of the client (e.g. no dependence on updated database content)

– Given user input : we are provided a set of user inputs for text fields and build the model that corresponds to these inputs

– Reliable reset : we can reliably “reset” the system by reloading the seed URL (thus the graph is strongly connected)

26

Overview


Web Crawling– Traditional web crawling– RIA crawling– Performance objectives



Conclusions

27

Crawling StrategiesMost work on crawling RIA do not intend to build a

complete model of the application.Some consider standard strategies for the

exploration of the graph model, such as Depth-First and Breadth-First.

We have developed more efficient strategies based on the assumed structure of the application (“model-based strategies”, see below)

28

Example of crawling sequence Depth-first strategy

getURL(Bochmann); analyseDOM; execute(publications) and find new state Pub; analyseDOM; - go back (reset) - getURL(Bochmann); execute(research group) and find new state DSRG; analyseDOM; execute(publications) and find known state Pub; - go back (reset) - getURL(Bochmann); execute(hobbies) and find new state Hobbies; analyseDOM and find new URL PainterB; getURL(PainterB); analyseDOM; etc.

29

Bochmann

Hobbies

DSRG

Pub

Painter B

research group

hobbies

Gregor von Bochmann

publications

publicationsSuch a systematic approach will execute all events and eventually find all states.

Resets Each time there is a “go back” in the crawling sequence, the

crawler has to go back to a seed-URL (which takes more time than executing an event) and possibly execute several events in order to reach the desired state.

– For instance, in the Breadth-First strategy, the crawler has to later go back to the state DSRG in order to execute the event publications

30

Bochmann

Hobbies

DSRG

Pub

Painter B

research group

hobbies

Gregor von Bochmann

publications

publications

Resets are much more “expensive” (in terms of execution times) than event executions The number of resets should

be minimized.

Disadvantages of standard strategies

Breadth-First: – No long sequences of event executions

– Very many Resets

Depth-First: – Advantage: has long sequences of event executions

– Disadvantage: when reaching a known state, the strategy takes a path back to a specific previous state for further event exploration. This path through known edges is often long and may involve a Reset (overhead) – going back to another state with non-executed events may be much more efficient.

31

Greedy and model-based crawling

The “Greedy” strategy– Forward exploration until a state with no

unexecuted events is encountered• then find closest state with an unexecuted event, and continue

Model-based crawling– Meta-model: assumed structure of the application

– Crawling strategy is optimized for the case that the application follows these assumptions

– Crawling strategy must be able to adapt to applications that do not satisfy the meta-model

32

Model-based crawling: Two phases

State exploration phase– finding all states assuming that the application follows

the assumptions of the meta-model

Transition exploration phase– executing all remaining events in all known states (that

have not been executed during the state exploration phase)

Order of execution– First state exploration; then transition exploration

– Adaptation: If new states are discovered during transition exploration phase, go back to state exploration phase, etc.

33

````

28710191281180311233

164

Comparing efficiency of finding states

Cost (number of event executions + reset cost)

Number of states discovered

Total: 129This is for a specific application such comparisons should be done for many different types of applications

Note: Hypercube gives similar results to Greedy

Note:log scale

Comparing efficiency of exploring all edges

32025199141264312356

Cost (number of event executions + reset cost)

Number of edges explored

Total: 10364

Model-based crawling: Hypercube

Hypercube– The state reached by a sequence of events from the

initial state is independent of the order of the events.

– The enabled events at a state are those at the initial state minus those executed to reach that state.

36Example: 4-dim. Hypercube

++ : One can find optimal paths for state and transition exploration phases

-- : very few applications follow the hypercube model

Model-based crawling: Menu model

Example web site: Ikebana-Ottawa ( ikebanaottawa.ca )

Hypothesis: There are three types of events:– Menu events: The next state obtained is independent of the state

where the event is executed

– Normal events: Next state depends on current page

– Self-loop events: Next state is equal to current state

Crawling strategy– Explore Normal events before Menu events, because menu

events do not find any new states

– To classify the events, they must be executed from two differentstates

Menu strategy: state exploration

From the current state, choose the next event according to the following event priority

1. Unclassified events – not yet executed

2. Unclassified events – once executed from a different state

3. Normal events

4. Menu events (we do not expect to find a new state)

5. Self-loop events (we do not expect to find a new state)

If all events have already be executed on the current page: find a “short” path to a page with an event of high priority

38

Menu model: finding a path to next event

Find path on current application model, based on

– executed edges

– predicted edges: Locally non-executed, but globally executed once are predicted to be of type menu

39

Predicted edges

Executed edges

Probability strategy This is a variation of the Greedy strategy. Inspired

by the Menu strategy, we introduce event priorities. The priority of an event is based on statistical

observations (during the crawl of the application) about the number of new states discovered when executing the given event. The strategy is based on the belief that an event

which was often observed to lead to new states in the past will be more likely to lead to new states in the future.

40

Probability strategy: event priorities

Priority of events from current state: Probability of a given event e finding a new state from the current state is

P(e) = ( S(e) + pS ) / ( N(e) + pN )– S : number of states found by e - N : number of times executed

– This is a Bayesian formula, with pS = 1 and pN = 2 gives initial probability = 0.5

If current state s has no non-executed event: Find a locally non-executed event e from some nearby state s’ such that P(e) is high and the path from s to s’ is short

– Note: the path from s to s’ is through events already executed

– How to find an optimal combination of high-priority event on a nearby state is described in our paper at ICWE 201241

Experiments

We did experiments with the different crawling strategies using the following web sites:

– Periodic table (Local version: http://ssrg.eecs.uottawa.ca/periodic/)

– Clipmarks (Local version: http://ssrg.eecs.uottawa.ca/clipmarks/)

– TestRIA ( http://ssrg.eecs.uottawa.ca/TestRIA/ )

– Altoro Mutual (http://www.altoromutual.com/ )

42

Results: State exploration

43

Results: Transition exploration

Cost for a complete crawl– Cost = number of event executions + R * number of resets

• R = 18 for the Clipmarks web site

44

Component-based crawling

In many web sites, the number of “pages” is immense because of different ordering of elements or combinations of several components :a complete crawl is not feasible

Revised coverage criteria: Cover all components of pages in the application (but not all combinations or ordering of these components)Assumption: Components are independent of

one another.

45

Examples of components

46

Assumed structure of a page

47

Example: The Bebop application

48

Performance

49

Scalability

50

Execution time of crawl as a function of items stored in the application

– As expected: normal crawling has exponential complexity

– Component-based crawl appears to have quadratic complexity

Overview





Conclusions

51

Distributed crawling

Observation: On average, event execution and analysis of the next state discovered takes about 20 times more time than deciding on the next event to be executed.

Question: Can the crawling of a complex application be accelerated by distributing the crawling over several computers / cores?

52

Different distributed architectures1. Central coordinator keeps information about the

discovered application model1. Each crawler contacts the Coordinator after each execution of

an event and obtains the next event to be executed (coordinator performs the crawling strategy) – dynamic event allocation to crawlers

2. Static event allocation to crawlers (crawlers obtain application model from coordinator and perform crawling strategy locally, only for allocated events)

2. Several coordinators share the information about the application model – A distributed hash table is used to allocate the states of the model to the

different coordinators– Each coordinator is associated with approximately 20 crawlers– Coordinators perform the crawling strategy, but using partial model information

– different sharing schemes can be envisioned for exchanging information between the coordinators

53

Experimental results (architecture 1.2 – BF strategy)

Notes: The BF strategy has bad

performance, but has the advantage that only the states of the model must be shared with the crawlers (not the transitions).

One sees the expected decrease in crawling time

The delay due to the coordinator is negligible, even for 15 crawlers

The static allocation of events leads to unequal loads – dynamic load sharing among crawlers may be useful

54

Experimental results (architecture 1.1 – Greedy strategy)

Notes: The greedy strategy has

good performance. In this architecture, the

model information is not shared with the crawlers.

Again, one sees the expected decrease in crawling time

55

Simulation results (architecture 2 – Greedy strategy)

Performance depends on sharing scheme

In case that there is no unexecuted event from the current state, the coordinator has to find another state with an unexecuted event

Reset-only: Use reset to reach a different state

Local Knowledge: Find shortest path (SP) to new state based on local knowledge of the application model

Shared Knowledge: Use SP based on knowledge sharing, piggy-backed on other messages

Forward Exploration: A distributed algorithm for finding SP

56

Notes: Fixed number of crawlers, varying number of Coordinators (overload ignored)

Overview





Conclusions

57

ConclusionsRIA crawling is quite different from

traditional web crawlingDifferent crawling strategies can improve

the efficiency of crawlingThe crawling of a RIA can be effectively

distributed over several crawling enginesWe have developed prototypes of our

crawling strategies, integrated with the IBM AppScan product

58

ReferencesBackground; MESBAH, A., DEURSEN, A.V. AND LENSELINK, S., 2011. Crawling Ajax-based Web Applications through Dynamic

Analysis of User Interface State Changes. ACM Transactions on the Web (TWEB), 6(1), a23. Our Papers: Seyed M. Mirtaheri, Mustafa Emre Dincturk, Salman Hooshmand, Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., A

Brief History of Web Crawlers, in Proceedings of the CASCON 2013, November 2013. 15 pages Seyed M. Mirtaheri, Zou, D., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V. Dist-RIA Crawler: A Distributed Crawler

for Rich Internet Applications, in Proceedings of the 8TH International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC 2013), Compiegne, France, October 2013. 8 pages

Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Jourdan, G.-V., Bochmann, G.v. and Onut, I.V. Building Rich Internet Applications Models: Example of a Better Strategy, in Proceedings of the 13th International Conference on Web engineering (ICWE 2013), Aalborg, North Denmark, July 2013. 15 pages

Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Moosavi, A., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., Crawling Rich Internet Applications: The State of the Art, in Proceedings of the CASCON 2012, November 2012. 15 pages

Dincturk, M.E., Choudhary, S., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., A Statistical Approach for Efficient Crawling of Rich Internet Applications, in Proceedings of the 12th International Conference on Web engineering (ICWE 2012), Berlin, Germany, July 2012. 8 pages

Choudhary, S., Dincturk, M.E., Bochmann, G.v., Jourdan, G.-V., Onut, I.V. and Ionescu, P., Solving Some Modeling Challenges when Testing Rich Internet Aplications for Security, in The Third International Workshop on Security Testing (SECTEST 2012), Montreal, Canada, April 2012. 8 pages

Benjamin, K., Bochmann, G.v., Dincturk, M.E., Jourdan, G.-V. and Onut, I.V., A Strategy for Efficient Crawling of Rich Internet Applications, in Proceedings of the 11th International Conference on Web engineering (ICWE 2011), Paphos, Cyprus, July 2011. 15 pages

Benjamin, K., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., Some Modeling Challenges when Testing Rich Internet Applications for Security, in First International Workshop on Modeling and Detection of Vulnerabilities (MDV 2010), Paris, France, April 2010. 8 pages

Dincturk, M.E., Jourdan, G.-V. , Bochmann, G.v. and Onut, I.V., A Model-Based Approach for Crawling Rich Internet Applications, submitted to a journal.

59

Questions ??

Comments ??

These slides can be downloaded fromhttp://www.site.uottawa.ca/~bochmann/talks/RIAcrawling.pdf

Crawling Rich Internet Applications

Documents