Top Banner
EVALUATION of DOM TREE SIMILARITIES UNDERGRADUATE THESIS, JUNE 2015 --------------------------------------------- TEOMAN TURAN 040100014 SUPERVISOR INSTRUCTOR: ASST. PROF. DR. TOLGA OVATMAN
26

My Undergraduate Thesis (Graduation Project) Presentation

Aug 12, 2015

Download

Software

Teoman Turan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: My Undergraduate Thesis (Graduation Project) Presentation

EVALUATION of DOM TREE SIMILARITIES

UNDERGRADUATE THESIS , JUNE 2015-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -TEOMAN TURAN040100014

SUPERVISOR INSTRUCTOR: ASST. PROF. DR. TOLGA OVATMAN

Page 2: My Undergraduate Thesis (Graduation Project) Presentation

A Brief Introduction to the Issue

Today, the total amount of web pages all around the world is growing in a terrific pace. It is possible to encounter several sorts of web sites that serve the same purpose: bulletin boards (forums), video sharing sites, social networks, video game distrubition platforms, shopping sites, broadcasting sites, news portals etc.

From http://www.internetlivestats.com/total-number-of-websites/

Page 3: My Undergraduate Thesis (Graduation Project) Presentation

A Brief Introduction to the Issue

• How to meet the design requirement of these millions of web pages?

• It is impossible to supply a design solution for each web site dissimilar to other ones.

• «Templates» can be considered as a solution for the design issue. A template, in other words, skeleton or schema, can be used for the design of thousands of websites being modified by designers.

Page 4: My Undergraduate Thesis (Graduation Project) Presentation

An example template:

A Brief Introduction to the Issue

Page 5: My Undergraduate Thesis (Graduation Project) Presentation

Issue: Evaluation of Web Page Similarities

• More similar web page designs, like the solutions called «templates», involve studying on a specific problem: evaluation of web page similarities.

• A similarity ratio with respect to some criteria: How much two web pages’ designs are similar to each other?

Page 6: My Undergraduate Thesis (Graduation Project) Presentation

A Brief Theoretical and Basic Information: HTML

• Hyper-Text Markup Language, commonly called «HTML», is a markup language used to design web pages.

• Once a text file containing HTML codes is saved with the extension of .html/.htm, it immediately becomes the source of a new web page.

REFLECTION OUTPUT

(Interpreted by web browser)

Page 7: My Undergraduate Thesis (Graduation Project) Presentation

• Three major components that form the syntax of HTML: element, attribute, text

A Brief Theoretical and Basic Information: HTML

Element

Text

Attribute

Page 8: My Undergraduate Thesis (Graduation Project) Presentation

• Document Object Model, commonly called «DOM»

• An interface to access and update a component of a markup code having a syntax in a nested structure

• Provides structural representation for HTML, XML and SVG documents

• Along with special libraries like dom4j in Java, JavaScript also has a feature to benefit from HTML DOM that is examined within the context of this thesis project.

A Brief Theoretical and Basic Information: DOM

Page 9: My Undergraduate Thesis (Graduation Project) Presentation

• Major components of HTML corresponds to the major DOM objects: element object, attribute objects, text objects

• The order of the nesting of components in an HTML, XML or SVG code (also DOM objects) form a tree called «DOM Tree».

• A software solution using DOM can traverse on a DOM tree extracted from an HTML, XML or SVG code.

A Brief Theoretical and Basic Information: DOM

Page 10: My Undergraduate Thesis (Graduation Project) Presentation

<element1>

<element2>

<element3>

</element3>

<element4>

</element4>

</element2>

</element1>

DOM Tree Example

Page 11: My Undergraduate Thesis (Graduation Project) Presentation

• Element nodes representing elements in an HTML code

• Attribute nodes representing attributes in an HTML code

• Text nodes representing texts in an HTML code

• The HTML document itself forms «document node» that can be said to be the root node in the tree.

• Comments also form «comment nodes» in the relevant DOM tree.

The Major Components of an HTML DOM Tree

Page 12: My Undergraduate Thesis (Graduation Project) Presentation

The Major Components of an HTML DOM Tree

Page 13: My Undergraduate Thesis (Graduation Project) Presentation

• The similarity ratio between the designs of two web pages leads the similarity ratio between their HTML files.

• The similarity ratio between these HTML files leads the similarity ratio between their DOM trees.

• Thus, evaluation of DOM tree similarities is a solution to the main problem of the web page design similarities.

The Main Problem: Similarity of DOM Trees

Page 14: My Undergraduate Thesis (Graduation Project) Presentation

• To develop an algorithm that measures the similarity level between two DOM trees having been extracted from two HTML files

• Hence, the main objective: to develop a system that measures the similarity between two web pages with respect to their designs

The Objective of the Project

Page 15: My Undergraduate Thesis (Graduation Project) Presentation

• 1 – Parse two HTML files having been loaded to the system.

• 2 – For each file, extract the DOM objects with their relatives, that means extract the DOM tree being formed by the code in the file.

• 3 – Develop an algorithm to compare these DOM trees, and to calculate the similarity level among them. (This is the core of the project.)

• 4 – Develop a graphical user interface (GUI) as a simple application.

Major Steps for the Development

Page 16: My Undergraduate Thesis (Graduation Project) Presentation

• The project has been developed in Java programming language, using Eclipse IDE for Java EE Developers.

• The output of the project: «DOM Similarity Evaluator»

• A Java application with a simple GUI

• Can be launched directly from its JAR file

• It is open source, and cross-platform owing to being a Java application.

«DOM Similarity Evaluator» Application

Page 17: My Undergraduate Thesis (Graduation Project) Presentation

• An easy-to-use Java library used to work with HTML, XML (and XPath and alike languages), SVG etc

• The library parses such a file, then extracts the DOM tree formed by the code in the file by traversing the tree.

• The built-in data types within the library corresponds to the node types in the DOM tree, like element nodes of an HTML-DOM tree extracted from a parsed HTML file.

• The methods in the library provides the way of acquiring the parent and the children of a node if exist.

dom4j Library of Java

Page 18: My Undergraduate Thesis (Graduation Project) Presentation

• Forms the core of the project

• The key value processed through the algorithm: the frequencies of distinct elements

• The «frequency» of a distinct element means how many instances of the distinct elements exist in a DOM tree extracted from the relevant HTML file.

• For example, let the iteration over a DOM tree gives the following sequence of elements:

html head title body h1 p p p ul li li li p ul li li a li button button

Here, «distinct element-frequency» couples are as follows:

html-1 head-1 title-1 body-1 h1-1 p-4 ul-2 li-6 a-1 button-2

About the Similarity Evaluation Algorithm

Page 19: My Undergraduate Thesis (Graduation Project) Presentation

• The algorithm compares the nodes of two DOM trees, then calculates three sort of similarity ratios with respect to element nodes, attribute nodes, text nodes respectively.

• There is also the fourth similarity ratio called «overall similarity» calculated based on the formula below. This is actually the major ratio that can be used to evaluate the design similarity between two web pages.

overall = (element * 60%) + (attribute * 30%) + (text * 10%)

Here, the percentages have been assigned considering their influence greatness, in other words, importance in the design of a web page.

Sorts of Similarity Ratios Calculated

Page 20: My Undergraduate Thesis (Graduation Project) Presentation

• 1 – Extract all distinct elements with their frequencies for both DOM trees. For example, for both trees, let the following two «element-frequency» collections be obtained.

Tree 1: html-1 head-1 title-1 body-1 h1-1 p-4 ul-2 li-6 a-1 button-2

Tree 2: html-1 head-1 title-1 body-1 h2-1 p-6 ul-3 li-12 a-3 img-2 table-1 td-2 tr-2

How to Calculate the Element Node Similarity

Page 21: My Undergraduate Thesis (Graduation Project) Presentation

• 2 – Find the elements that commonly exist in both trees. In the previous slide, the following elements are common: html, head, title, body, p, ul, li, a.

• 3 – For each common element from both DOM trees, take the frequency of the one with the less frequency is taken, and push it to a special frequency list. (For ones with the same frequency, directly take the value.) For the current example being studied, except for common elements with the same frequency in both trees;

p, ul, li, and h have lower frequencies in the first tree.

Special frequency list: 1, 1, 1, 1, 4, 2, 6, 1

How to Calculate the Element Node Similarity

Page 22: My Undergraduate Thesis (Graduation Project) Presentation

• 4 – Sum up the frequencies in the list, then divide the total value by the number of the all element nodes in either tree containing more element nodes. Finally, multiply the result by 100 in order to obtain the percentage.

For the example being studied, the total value is 17. The second tree has more element nodes compared to that of the first one: 36.

17/36 = 0.47

0.47 * 100 = 47.0% The element node similarity between Tree 1 and Tree 2

How to Calculate the Element Node Similarity

Page 23: My Undergraduate Thesis (Graduation Project) Presentation

• For the calculation of attribute node similarity, along with the attribute nodes of a DOM tree themselves, their parents, the element nodes they are connected to as children, are also considered.

• The way followed to calculate the element node similarity is followed for the attribute nodes themselves. But, the same way is also followed for these nodes’ parents, in other words, the element nodes whose children are these attribute nodes. As a result, two ratios are acquired.

• The average of these two values is the final similarity ratio with respect to the attribute nodes between two HTML files. For percentage, it is just multiplied by 100.

How to Calculate the Attribute Node Similarity

Page 24: My Undergraduate Thesis (Graduation Project) Presentation

• Here, the parents of the text nodes, in other words, the element nodes owning text nodes connected as their children, play the main role.

• The way followed to calculate the element node similarity is followed here as well taking the parent element nodes of text nodes into consideration.

How to Calculate the Text Node Similarity

Page 25: My Undergraduate Thesis (Graduation Project) Presentation

• The system deals with the design of two web pages.

• No reflection of comment lines in an HTML code on the output page: They are not taken into consideration.

• Since only the schema (skeleton, structure, construction) of DOM trees are considered, the values attribute and text nodes take do not play a role here.

• «Node connections» (and node existences and numbers of course) play the basic role in this system.

What About Comment Nodes, Attribute Values and Text Values?

Page 26: My Undergraduate Thesis (Graduation Project) Presentation

Thank you for listening!

(Here is also a short demonstration for the application…)