8/10/2019 Optimizing keyword queries in XML tree structure
1/45
Running Head: Optimizing Keyword Queries in XML tree structures 1
Optimizing Keyword Queries in XML Tree Structure
Name:
Instructor:
Institution:
Date
AST!A"T
XML which stands for Extensible markup language remains to be the most
popular and frequently used format for representing and exchanging data in
the World Wide Web. Its application is wide based on the arious di!erent
data types and applications that exist. "he data may take di!erent forms
which may include unstructured heterogeneous# semi structured and
8/10/2019 Optimizing keyword queries in XML tree structure
2/45
Optimizing Keyword Queries in XML Tree Structures 2
structured data types. XML has been a progressie language increasing its
functionalities with arious inentions and researches to the leel of
deelopment of data streaming applications. "hese types of inentions hae
receied numerous signi$cance and attention by many experienced users of
the web. "hese deelopments hae led to the centrali%ation of e&cient
processing and querying of XML streams.
"his study focuses on retrieing queries through a combination of structural
constraints which essentially use key words as a search tool to represent an
essential executable function in the XML systems of data management.
'arious expectations are forecasted on and they are expected to yield best
case answers in an e!ectie an e&cient manner like the traditional key
search while factoring in the arious additional constraints that may exist.
"he de$nition of studying the new problem of top(k keyword query and
search oer XML probabilistic data with the aim of retrieing k )L*+ $ndingwhere k has the highest existence capabilities. ,inally the study is going to
preiew arious other forms of keyword searches using di!erent forms and
make a comparison through the analysis of the algorithms that hae been
used.
TAL# O$ "ONT#NTS%&' INT!OD("TION.....................................................................................-
roblem /e$nition........................................................................................0
8/10/2019 Optimizing keyword queries in XML tree structure
3/45
Optimizing Keyword Queries in XML Tree Structures
roposal........................................................................................................0)&' O*#!*I#+ O$ !#LAT#D +O!KS........................................................0
*ost(1ased 2uery 3ptimi%ation in 4/1M)s..................................................52uery 3ptimi%ation ,rameworks...................................................................6XML 2uery 3ptimi%ation...............................................................................6
7eyword 2uery in 3rdinary XML /ocuments..............................................89robabilistic XML.........................................................................................88X2uery )treaming 3ptimi%ation.................................................................8:
Querying XML streams.............................................................................8:XML temporal model................................................................................8:
"ime Interals and Model............................................................................8:Mode 4elationship Ealuation.....................................................................8:robability...................................................................................................8;/ominance Lowest *ommon +ncestor ?/L*[email protected]
Dominance relationship...........................................................................85Dominance...............................................................................................A9,&' M#AS(!#M#NT O$ T-# !#LATIONS-I. #T+##N NOD#S IN ADATA T!##...................................................................................................A9
Mutual Information *oncepts......................................................................A8Mutual information and entropy..............................................................A8Mutual information...................................................................................A8
/&' ANS+#!S !#T!I#*#D $!OM TO.0K................................................A:/ominating )core.......................................................................................A:/ominated )core........................................................................................A:/ominance )core........................................................................................A;
1&' AL2O!IT-MS (S#D TO !#T!I#*# TO.0K!#S(LTS......................A-=aBe +lgorithm for )election of "op(K +nswers.........................................A-"op(K /ominated +lgorithm ?"7//@............................................................A0"op(K /ominating +lgorithm ?"7/[email protected]
3&' #X.#!IM#NTAL #*AL(ATION...........................................................:8Experimental )etup....................................................................................:82uery )ets..................................................................................................:8)earch 2uality............................................................................................:8E&ciency and )calability of "op(K+lgorithms.............................................:;
4&' "ON"L(SIONS....................................................................................:-!#$#!#N"#S................................................................................................:-
LIST O$ $I2(!#S
8/10/2019 Optimizing keyword queries in XML tree structure
4/45
Optimizing Keyword Queries in XML Tree Structures !
,igure 8D robabilistic XML document 85F.....................................................89,igure AD +n example of an XML data tree structure# " 86F.........................8;,igure :D /ata "ree "A 86F..............................................................................85,igure ;D List of candidates sorted in descending order using their M()aluesA;F................................................................................................................A5
,igure -D 4anking e&ciency of "7//# "7/C and "7/ algorithms 6F............:;
LIST O$ TAL#S"able 8D "he Goin probability of the two speci$c nodes at context nodeDHpaper.......................................................................................................................8-"able AD "he Goin probability of the two speci$c nodes at context nodeDHproceeding....................................................................................................8-"able :D "wo dimension set candidate data...................................................A>"able ;D )earch spaces L+ and L1 deried from a list lof candidates............A0"able -D *andidates list sorted in descending order of f()alues..................A0
"able >D /ominance checks count used in calculating dominated candidatescores............................................................................................................A0"able 0D recision and recall of queries on mondial data...............................::"able 5D recision and recall of queries on auction data................................::"able 6D recision and recall of queries on dblp data.....................................::"able 89D *omparisons on ranking e!ectieness of the algorithms...............::
8/10/2019 Optimizing keyword queries in XML tree structure
5/45
Optimizing Keyword Queries in XML Tree Structures "
%&'INT!OD("TION
XML ?Extensible Mark(up Language@ has oer the years eoled to become a
de facto standard used for the exchange and representation of data which
results into the distribution of proliferated XML documents which are spread
all oer the internet. In the past# there are arious query languages that were
used to retriee xml documents and data. "hese included languages such as
X2uery and Xath. +nd twig pattern queries. "hese languages made it
essential for the users of the systems to be ersed with the speci$c query
languages and the releant data schemas so that they may be able to
execute the XML queries e&ciently -F. "his therefore limited the type of
users since the adanced users since the query languages and the data
schemas seemed to be complex concepts to understand. "he data search
through X2ueryHXath languages therefore was a ery big limiting factor.
"he use of maGor keywords to search for documents has been widely
accepted as a ery conenient way in retrieing resources from ariousremote serers that hold that speci$c type of data on the internet. MaGority
of the search engines such as Coogle# 1ing# Jahoo and many more hae
adopted the use of these technologies so as to e&ciently facilitate the
process of data mining and data warehousing. "he adaptation and use of
keywords for querying arious databases has attracted arious researches to
be conducted by the research community from the a!ected $elds of
database and information retrieal ?I4@ -FAF:F. "his is a ery e&cient way
of facilitating the retrieal of documents because it does not inole the
learning of any concepts. "his process is an adancement of the traditional
search algorithms that were speci$cally inoling and required the mastering
of the particular I ?Internet rotocol@ addresses of arious documents or
information content and typing them in the K4L bar. is later adanced and
the I addresses were able to be attached to arious web links. It is from this
that the search engines were deeloped with a more interactie and
responsie algorithm that was able to handle a lot of bits and pieces of
8/10/2019 Optimizing keyword queries in XML tree structure
6/45
Optimizing Keyword Queries in XML Tree Structures #
information including data mining >F. + ariety of approaches hae been
accessed and preiewed to $nd alternaties to the keyword queries as
opposed to the XML data. "he basic approaches that currently exist use
lowest common ancestor ?L*+@ type of semantics as opposed to the common
graph theory for identi$cation of the hit list gien a certain keyword query.
"his particular approach generates results composed of all candidates# also
known as sub trees# containing an instance of the queried keywords -F. "he
L*+ returned alues can be numerous yet the user may Gust be interested in
a portion or bit of the whole hit list. It therefore remains and unsoled issue
to be able to identify the exact dataset that is required by the user of the
system. "he ideal situation and the best case scenario would be for the
system to be able to generate an exact piece that is required by the user as
opposed to proiding a whole set of hits which also gies the user an extra
Gob to $lter the content until they obtain an exact piece8FAF:F;F-F>F.
'arious researches and studies hae not been successful in theimplementation of exact retrieal of the search queries hence it remains an
unresoled and challenging issue.
Many proposals hae been drafted on the basis of improing the baseline
approach precision. "he application of heuristic(based rules in enhancing the
"ree )earch for the best case scenario at the shortest time possible has been
the main foundation of maGority of the proposals. "hese approach though it is
intuitie# it portrays characteristics of being ad(hoc in the sense that data
$ltration takes place in maGority of the operations. *andidate data sets that
meet the speci$ed criterion are separated from those that do not meet what
the user seeks to $nd out. "he results of these process of $ltration is what is
shown as a result of the algorithmic computation AF:F;F-F. oweer#
research and studies show that I as much as it is assumed that the results of
these algorithms yield best case results# they do not only miss out on the
false negaties ?releant results@ but they also return false posities
?irreleant results@.
"he results that receied from XML queries can howeer be boosted and
made more reliable if the following considerations are addressed and madeD
If the candidate hits can be measured in terms of releance. )econdly# if
there can be a mechanism that can be able to ultra($lter from the releantcandidate hits to produce a more speci$c list with $ner details that closely
match the users search-F:F. "here third proposition is about the
positioning of the candidate hits in a descending order from the one that is
most probable to the least probable case among the best case scenario hits.
In this particular paper# we are going to focus on the optimi%ation of keyword
queries in XML tree structures so as to yield a ery e&cient result.
8/10/2019 Optimizing keyword queries in XML tree structure
7/45
Optimizing Keyword Queries in XML Tree Structures $
)mallest ?9-0>050-@ Lowest *ommon +ncestor ?)L*+@ is a model keyword
search outcome that is a widely accepted semantic on a deterministic XML
tree named " in our scenario. + speci$c node named is therefore
considered )L*+ ifD
=ode is the root of the sub tree at "sub?@ and it consists of all the keywords
"here is no existence of a descendant node N of the root node in such a
way that "sub?N@ consists of all the probable keywords. ,or instance# consider
a scenario of Ok8# kAP being a keyword query on a certain document# p(
document as shown in the $gure below. "he particular )L*+s in this case are
OMKX:# I=/:P. When the algorithm attempts to perform a keyword search
on the speci$c p(document# the following challenges are encountered as
being a cumulatie example of a probabilistic XML document -FAF.
roblem /e$nition
*urrent keyword searches in XML can be diided into tree and graph
supported searches which are largely predicated on structural documentfeatures. oweer# these approaches on structure do not comprehensielyutili%e the hidden semantics within the XML documents leading to issues inthe processing of speci$c keyword query classes. "he growing reputation ofXML has intensi$ed the necessitation of an accessible and precise XML queryinterface that is predicated on natural language and search procedures thatexploit XML structures to simplify queries by ordinary users within XMLdatabases. *onentional methodologies howeer process queries rely basedon ad hoc and intuitie heuristics which frequently regain false positie andunranked answers.
roposal
"his paper systematically explores XML structure(based answers and userexpectations in order to identify the signi$cance of XML keyword searchsemantics. "his paper further posits a semantics(based methodology todeelop XML keyword queries principally through data(centric coherencyranking which is kernelled in the design of the domain and database which ispredicated on data dependence and mutual information models.
8/10/2019 Optimizing keyword queries in XML tree structure
8/45
Optimizing Keyword Queries in XML Tree Structures %
*onsequently# keyword query results occur within a under schemareorgani%ation structures which process# present rank and query algorithmsthrough coherency ranking to deelop answers. +ctual XML data indicatesthat coherency ranking is the methodology with the highest precision# recalland ranking as compared with approaches.
)&' O*#!*I#+ O$ !#LAT#D +O!KS
In this particular section# a brief assessment of releant publications on
optimi%ation frameworks# XML query optimi%ation and relational cost(basedquery optimi%ation is going to be preiewed.
*ost(1ased 2uery 3ptimi%ation in 4/1M)s
+ researcher proposed the $rst cost(based eer query optimi%er# which
formed part of )ystem 4 and therefore was the prototype of the relational
database system. "he optimi%er had arious capabilities which included the
optimi%ation of linear and simple select# proGect# and Goin ?)
8/10/2019 Optimizing keyword queries in XML tree structure
9/45
Optimizing Keyword Queries in XML Tree Structures &
introductions of products of the *artesian. Craefe and /eWitt showcased the
EX3/K) 3ptimi%er Cenerator and the purpose of this system was not
con$ned to a speci$c data model but it supported the algebraic
transformations speci$cation as rules -F80F A-F. Incorporated with a data
model that is concrete# the rules sere as input for the generator optimi%er#
creating a tailor(made query optimi%er.
,igure 8?a@ showing the p(document ". "ag names are used to show ordinarynodes# for instance *8# *A# and 18and 1A. +s for the distributional nodes#MKX is shown as rectangular rounded cornered boxes while I=/ is depictedas circles. "aking into consideration the I=/A node# there exist two childrennodes 1A and *8 with respectie existence of probabilities 9.> and 9.- -F."herefore# for neither *8 nor 1A# the absence probability being seen is
?8(9.>@Q?8(9.-@ R9.A.
*onsidering the MKXA node that consists of three children# I=/:# EA and/8node# their probabilities of existence respectiely are 9.-# 9.: and9.8."herefore# the probability for non(existence is
8 ( 9.- ( 9.: ( 9.8 R 9.8.
roided a p(document tree named "# there is a possibility of generating allpossible deterministic documents as sown belowS basically traersing " in atop(down manner# two situations arise that require to be dealt withindependently ifD
?8@ It occurs that it is an I=/ node consisting of m child nodes# Am copies of "
are generated# and the I=/ node deletedS m child nodes are replaced withone distinct subset as a copy which is a representation of them and theordinary parent node which is connected to each child node in the I=/ nodesubset. "he probability for this copy to occur for each copy is the product ofall probabilities that exist of the respectie child nodes in the particularsubset and the absence probabilities for instance the existence probabilitydeducted from 8# of the child nodes which are not existent in the subset -F.
?A@ ,or a MKX node consisting of m child nodes# m T 8 copies of " can begenerated# and the MKX node deleted# replacing the m child nodes with nochild or one distinct child node for a copy. +n established connection from
the child node of MKX to the ordinary parent node is made. "he existenceprobability for eery copy is the occurrence probability of the distinct childnode in the subset or the absence probability denoting no child nodeappearance. ,or eery " generated copy# the research adopts traersingusing the top(down approach until the deletion of all distributional nodes iscon$rmed -F.
8/10/2019 Optimizing keyword queries in XML tree structure
10/45
Optimizing Keyword Queries in XML Tree Structures 1'
"his study explores the retrieal of queries through a combination ofstructural constraints which fundamentally use key words as a search tool torepresent an essential executable function in the XML systems of datamanagement. "hrough exploration of eidence(based research and theextrapolation of related works# it is expected that the study will yield best
case answers in an e!ectie and e&cient manner similar to the conentionalkey search while factoring in the constraints that may exist.
2uery 3ptimi%ation ,rameworks.
Lan%elotte and 'aldurie% contributed a framework which is extensible for
query optimi%ation which incorporates the concepts of modeling the
independent search space of a particular type of search strategy 86FA8F.
Ksing this approach# highly(extensible plans can be built by deelopers on
enumeration frameworks. 7abra and /eWitt made a proposition on 3"TT
as an 33 approach for ery extensie query optimi%ing 8>F. + combinationof extensible search components together with physical algebra and
extensible logical representation# the work of Lan%elotte and 'aldurie% is
lifted to obGect(oriented leel 6F.
XML 2uery 3ptimi%ation
3n XML queries optimi%ation targets the strictly limited and isolated
problems of path expressions optimi%ation using naigational access paths
and is depried of "< and )< operators support. Wu et al made a proposition
of a dynamic programming $e noel algorithms for Goin reordering
structuring 89F. "heir orthogonal approach is unique# for instance# it can be
used to select the most e&cient Goin order in )
8/10/2019 Optimizing keyword queries in XML tree structure
11/45
Optimizing Keyword Queries in XML Tree Structures 11
framework for cost(based optimi%ation and a full(Uedged /1M) does not
seem to proide the solution. It described a cost(based Xath optimi%ation
$rst approach. *ontrary to this particular proposal# which is in support of the
optimi%ation of X2ueries# it has not considered "< and access paths that are
adanced *+) indexes or indexes 6F.
$igure %:robabilistic XML document 85F
7eyword 2uery in 3rdinary XML /ocuments
XML databases are inoled in the query of arious maGor keyword searches.Cien an XML data source keyword query# most of related work took lowest
common ancestors smallest L*+ of the nodes that matched as the results to
be returned. )chema(,ree X2uery and X4+=7 are able to compute )L*+s
and deelop stack(based algorithms 8AFA5F. "he Indexed Lookup Eager
algorithm is introduced when the appearance of the keywords with
frequencies which are di!erent signi$cantly. "he )can Eager algorithm will
take oer the process once keywords register similar frequencies. MaGority of
the authors and researchers of arious preious works focused on inferring
the de$nition of returned results and discussed the di!erentiations of result."he researchers proided more meaningful conclusions and utili%ed the
underlying XML statistics of the data for identi$cation of the return node
types 6F. "he researchers also proposed that a number of cleaning keyword
queries algorithms for optimality could be deeloped. "his therefore resulted
into the designing of M) approach for computation of )L*+s for queries of
keywords in multiple manners. "hey took the 'aluable L*+ as results by
8/10/2019 Optimizing keyword queries in XML tree structure
12/45
Optimizing Keyword Queries in XML Tree Structures 12
intentionally aoiding the false negatie and false positie )L*+ and L*+
8AFA-F. "he arious researchers also proposed Indexed )tack# which was
an e&cient algorithm for $nding answers based on semantics of Exclusie
L*+. In addition# there exist other related works which process keyword
search through the integration of keywords into speci$c structured queries.
XML2L# which is a new query language# has the structure of the keywords
and query separated. "he research also introduced a method to embed
arious keywords into X2uery for processing of the speci$c keyword search
6F.
robabilistic XML
"he probabilistic XML topic has been a recently studied subGect in which
maGority of the proposed models hae been incorporated together with
ealuations of structured query. =ierman et al $rst introduced the concept of
ro"/1# with the existent probabilistic types MKX ( mutually exclusie andI=/ ( independent. "he researchers modeled the probabilistic XML in the
form of acyclic graphs# which support distributions that are arbitrary oer
children sets. "he research adopted a probabilistic tree approach for the
purpose of data integration where its possibility and probability nodes are
similar to I=/ and MKX respectiely 6F A8FA6F.
+ p(document# which is a probabilistic document written in XML speci$es a
probability distribution across space of deterministic documents written in
XML. Each deterministic document that belongs to this space is referred to as
a possible word. + probabilistic document referenced as a tree that has beenlabeled has distributional and ordinary nodes 88F8;F. 3rdinary nodes are
basically regular and normal XML nodes and their appearance may be seen
in deterministic documents# whereas distributional nodes are only used in
the de$nition of probabilistic process that inoles the generating of
deterministic documents and their occurrence is not isible in those
documents-F8AF A;F. In the adaptation of rXML Oind# muxP as part of the
XML model which is probabilistic# two distributional nodes types appear in a
p(document# which are MKX and I=/89F8:F.
*onsidering an example 8D*onsider ,igure 8?a@ showing the p(document ". "ag names are used to show
ordinary nodes# for instance *8# *A# and 18and 1A. +s for the distributional
nodes# MKX is shown as rectangular rounded cornered boxes while I=/ is
depicted as circles. "aking into consideration the I=/A node# there exist two
children nodes 1A and *8 with respectie existence of probabilities 9.> and
8/10/2019 Optimizing keyword queries in XML tree structure
13/45
Optimizing Keyword Queries in XML Tree Structures 1
9.- -F. "herefore# for neither *8 nor 1A# the absence probability being seen
is
?8(9.>@Q?8(9.-@ R9.A.
*onsidering the MKXA node that consists of three children# I=/:# EA and
/8node# their probabilities of existence respectiely are 9.-# 9.: and
9.8."herefore# the probability for non(existence is
8 ( 9.- ( 9.: ( 9.8 R 9.8.
roided a p(document tree named "# there is a possibility of generating all
possible deterministic documents as sown belowS basically traersing " in a
top(down manner# two situations arise that require to be dealt with
independently ifD
?8@ It occurs that it is an I=/ node consisting of m child nodes# Am copies of "
are generated# and the I=/ node deletedS m child nodes are replaced with
one distinct subset as a copy which is a representation of them and the
ordinary parent node which is connected to each child node in the I=/ nodesubset. "he probability for this copy to occur for each copy is the product of
all probabilities that exist of the respectie child nodes in the particular
subset and the absence probabilities for instance the existence probability
deducted from 8# of the child nodes which are not existent in the subset -F.
?A@ ,or a MKX node consisting of m child nodes# m T 8 copies of " can be
generated# and the MKX node deleted# replacing the m child nodes with no
child or one distinct child node for a copy. +n established connection from
the child node of MKX to the ordinary parent node is made. "he existence
probability for eery copy is the occurrence probability of the distinct child
node in the subset or the absence probability denoting no child node
appearance. ,or eery " generated copy# the research adopts traersing
using the top(down approach until the deletion of all distributional nodes is
con$rmed -F.
'arious researchers hae proposed the adoption of a fu%%y trees model
where nodes are speci$cally associated with probabilistic eent ariables
conGunctions. + full complexity query analysis update on the Vfu%%y trees in
the research is also referenced. "hey also proposed algorithms that sole the
constraint(satisfaction which were e&cient >F85F A;FA6F. "he speci$csampling problem and query ealuation under constraints set can be well
de$ned to yield e&cient query results that are expected. 3ther publications
summari%ed and extended the preiously proposed probabilistic XML models#
tractability of queries and the expressieness on di!erent models were
discussed with the consideration of MKX and I=/-F8:F 8-F85F A0F.
'arious studies on the ealuation problem of twig queries considered oer
8/10/2019 Optimizing keyword queries in XML tree structure
14/45
Optimizing Keyword Queries in XML Tree Structures 1!
probabilistic XML that may generate partial and incomplete answers with
particular respect to user probability threshold. "he researchers also
addressed and proposed the ranking top(k probabilities problem of answers
of a twig query. In summary# the work that has been cited focused on
discussions of arious probabilistic XML data models on a structured XML
query# for instance a twig query8F0F 8>FA:F. 3ur research howeer is
going to be di!erent in the sense that the keyword search problem in
probabilistic XML data is going to be critically preiewed and analy%ed 6F.
X2uery )treaming 3ptimi%ation
Querying XML streams
)eeral streaming algorithms exist that particularly focus on the querying
problem and the $ltration procedure. Many of these algorithms center their
operations on tree(pattern queries ?"2s@. "2s e&ciently correspond to
Xath queries which inole mainly descendant and child axes 8;FA>F. "2
streaming algorithms can be extended to facilitate the process obtaining
Xath queries which come along with ordered axes that inole preceding#
preceding(sibling# following# and following(sibling@ 89F.rocessing techniques
are therefore introduced on ordered axes )treaming algorithms broadly fall in
three categoriesD "he array(based approach# automaton(based approach and
the stack(bas.
XML temporal model
reious studies conducted on time(based XML model hae identi$ed seeral
disadantages and bene$ts. "he bitemporal approach is inclusie of bothalid time and transaction time in timestamp attributes AAFA>F. =ormatie
texts will always comprise of four time interals. =ormatie texts consisting
of temporal alues in an XML database represent new attributes of interals
for instance e&cacy time and publication time. "his particular approach of
XML tree partitioning guarantees the distribution of data into partitions of
equal si%e making considerations of both the query processing load and data
storage cost 89F.
"ime Interals and Model"he interals are publication time e&cacy time transaction time and alidity
time. "ransaction time refers to the time a transaction is reUected in the
database as a representation of an important factor for all transactions that
occur in time(referenced databases. 'alid time refers to the interal that
indicates the time when the data becomes alid for general use or it may be
inalid and unusable 5F86FA9F. E&cacy time is when data is used under
8/10/2019 Optimizing keyword queries in XML tree structure
15/45
Optimizing Keyword Queries in XML Tree Structures 1"
arious conditions or in speciali%ed cases only. ublication time represents
the alert time as to when or during the publication of data. 4eason nodes
particularly hold data that is sensitie and may $nd utility in decision support
)ystems 89F.
Mode 4elationship Ealuation
"his research deiates from the other researches which focus on the
ealuation of the relationship between multiple nodes with the
implementation of heuristics(based intuitie rule. It will focus on the
relationship between multiple nodes in a data tree structure which is
measured by the adaptation of mutual information concept which deries its
application arious data mining processes-F>F88F. "his operates on the
correlation of arious database relation attributes. 1eing common for XML
data tree structures to consist of arious nodes haing the same labels but
occurring in di!erent contexts# pre$x labeled nodes are used to depict the
types of nodes. + pre$x label path refs to a sequence of names of elementsappearing in along the path from the root node to the speci$c node in
question. "he node types are used to identify the speci$c nodes that are
found in the data tree 8>FAAF.
$igure ): +n example of an XML data tree structure# " 86F
,or instance at node ;# the pre$x labeled path in the data tree is de$ned bydblpHproceedingHpaperHauthor. Many occurrences can exist in a speci$c pre$x
labeled path in the XML data tree structure# and these occurrences are
referred to as node instances AF0FA8FA-F. It is therefore de$nite that all
the instances of a speci$c node will consist of the same pre$x label path.
Eery instance consists of a unique alue which constitutes to the speci$c
set of key words contained directly in that particular instance. ,or instance#
8/10/2019 Optimizing keyword queries in XML tree structure
16/45
Optimizing Keyword Queries in XML Tree Structures 1#
using the tree structure in ,igure A# pre$x label pathD
dblpHproceedingHpaperHauthor has instances A5# A9# 8-# 88 and ; which
consist of the alues 4ichards# Wang# hang# Liu and
8/10/2019 Optimizing keyword queries in XML tree structure
17/45
Optimizing Keyword Queries in XML Tree Structures 1$
TAL# %:"E
8/10/2019 Optimizing keyword queries in XML tree structure
18/45
Optimizing Keyword Queries in XML Tree Structures 1%
"his therefore indicates that there is ariance in the type of mutual
information between two nodes that are in di!erent contexts. "he mutual
information between the author?s@ and title in the paper context is higher
than the mutual information between the author?s@ and title of two papers
that are di!erent or which may be considered in the context of a proceeding.
We can therefore conclude that MI acts as a superb measure of showing how
closely nodes are interrelated to each other. "he MI alues scale has no
speci$c unique range as depicted by property - A-F. "he property states
that the nodes can be bound by the minimum alue of their entropy. In a
proper application of this particular concept# we require a uni$ed scale for
the sole purpose of measuring the MI along with global node sets ;F85F.
=ode 4elationship
*onsidering two nodes# u and which are Goined at the context of c# the
relationship of the two nodes is de$ned asD
In particular case# 0 (u)and 0 (')refer to the speci$c entropy of nodes u and
respectiely and the alues are calculated the same way as the random
ariable entropies 8>F.
When the alue of rel (uS 'c)is high# this means that the relationship
between nodes u and is strong at the context node c. ,or instance# the
entropy of nodes dblpH proceedingHpaperHtitle and
dblpHproceedingHpaperHauthor can be obtained byD
0(dblpHproceedingHpaperHtitle)
R Y(8-) log(8-) T (8-) log(8-)(8-) log(8-) T (8-) log(8-) T (8-) log(8-)F
R Ylog(8-) R log - R 9.09
0(dblpHproceedingHpaperHauthor)
R Y(8-) log(8-) T (8-) log(8-)
(8-) log(8-) T (8-) log(8-) T (8-) log(8-)F
8/10/2019 Optimizing keyword queries in XML tree structure
19/45
Optimizing Keyword Queries in XML Tree Structures 1&
R Ylog(8-) R log - R 9.09
"his therefore implies that the relationship between
dblpHproceedingHpaperHtitle and dblpHproceedingHpaperHauthor meeting at
context node dblpHproceedingH paper can be deried as
rel(dblpHproceedingHpaperHauthorS dblpHproceedingHpaperHtile
dblpHproceeding) R 9.;6
9.09
R 9.0
"his is also similar toD
dblpHproceedingHpaperHtitleS rel(dblpHproceedingHpaperHauthor
dblpHproceedingHpaper) R 9.09
9.09
R 8.9
"his therefore shows that the relationship that exists between any two giennodes must fall in a speci$c range of 9#8F in any XML data tree gien
multiple nodes u and at any particular context node c.
9 Z rel(uS 'c) Z 8
"his can further be proed when property : as stated preiously is closely
examined. It states that
/(uS 'c) [ 9# which therefore implies that
roperty - also states that /(-Syc) Z 0(y)and /(-Syc) Z 0(-). 1his thereforegenerates2
/ominance Lowest *ommon +ncestor ?/L*+@
In order to retriee a particular hit or answer in a wide mass of L*+ based
candidates# this research proposes the use of new semantics referred to as
/ominance L*+. We begin by the introduction of the relationship between
L*+(based candidates.
Dominance relationship2uery candidates are represented by their root nodes can be depicted as
subsets of the sub trees. ,or instance gien a keyword search query say Q
RO38" . . . " 34Pa speci$c candidate of the query 2 called ) is represented in
the form 5(nlca" On8S . . . S n4P) in this particular result n4refers to a leaf
node that contains 3iand nlca becomes the distant and the lowest common
ancestor of the series On8S . . . S nm. Identi$ers are used to identify each
8/10/2019 Optimizing keyword queries in XML tree structure
20/45
Optimizing Keyword Queries in XML Tree Structures 2'
node in the candidate series which according to this research is encoded as a
/ewey code.
"he foundations of the /ewey code are deried from the /ewey /ecimal
*lassi$cation which were deeloped for the purpose of classi$cation of
general knowledge8FA:F. With the implementation of the /ewey coding# a
ector is assigned to each node which is a representation of the path to the
node from the tree root. "he local order of the ancestral node is represented
by the each component that is found along the path. "his can eidently be
illustrated in the ,igure :D
$igure ,: /ata "ree "A 86F
"he researcher selected to encode the speci$c node identi$ers with /ewey
code since it is ery useful in the representation of the hierarchical
relationships that exist between nodes of a tree that forms a ery importantariable in the tree structure. "he corresponding label path of the speci$c
node can be found from the /ewey code. ,or instance# considering the
sample data tree structure "Ain the ,igure :# eery node s always identi$ed
by the /ewey code. ,or a node identi$cation of 9.8.9.9F# the corresponding
label path of the corresponding node n8Hn:Hn;Hn-. We therefore gie a name#
I/AL?id@ which is an id which represents the /ewey code that seres as an
input and reUects the corresponding path label8:FA;F. "here is a ast
chance of the possibility that the key words in a particular search tree may
yield many occurrences in the speci$c sub tree candidate 5(nlca" On8S . . . SnmP). Eery keyword yields a set of Li R Oni'al (ni) 6ith the 3ey6ords 3i (8
Z i Z m)P
"he relationship between the arious keywords that are produced in the
speci$c search tree is gien asD
8/10/2019 Optimizing keyword queries in XML tree structure
21/45
Optimizing Keyword Queries in XML Tree Structures 21
In this particular case scenario# I/AL(ni) is an important function which
returns corresponding node types w6th the /ewey code niand rel(I/AL(ni)S
I/AL(n)I/AL(lca(ni" n@@@ which is normally calculated by the formulae
stipulated in formula ?8@. "his therefore measures the correlation between
the nodes that hae been tagged with the /ewey codesniand n at the
lowest common ancestor 85FA-F. "his therefore implies that the
relationship between the keywords kiand kGcontained in the candidate
structures is analy%ed as the maximum relationship that exists between two
nodes that contain two keywords in that speci$c candidate8:F.
,or instance# taking a query Q R O38" 3A" 3:P with a speci$c data tree "A#
only one of the sub(tree 2 candidate is present and is rooted at a place node
of n: 9.8F. In "ree "Aand this can be represented and can also take the form
of 5(9.8" O9.8.9.9S 9.8.8.9S 9.8.8.8P)."he relationship that exists in the
keyword queries in the speci$ed candidate ) can be calculated as followsD
rel(3A" 3:) R rel(I/AL(9.8.8.9)S I/AL(9.8.8.8)I/AL(9.8.8))R rel(n8n:n>n0S n8n:n>n5n8n:n>)
rel(38" 3:) R rel(I/AL(9.8.9.9)S I/AL(9.8.8.8)I/AL(9.8))
R rel(n8n:n;n-S n8n:n>n5n8n:)
rel(38" 3A) R rel(I/AL(9.8.9.9)S I/AL(9.8.8.9)I/AL(9.8))
R rel(n8n:n;n-S n8n:n>n0n8n:)
roided the keyword query Q R O38" . . . " 34P the relationship of each pair
calculated is stored in the ector /s of the query keywords in sub(tree
candidate ). "he keyword relationship ector in this particular research is
de$ned byD
D5 R rel(3i" 3)3i" 3 \Q ](i 7 )F
aing a total of *Aq combinations of two(keywords deried from a stable set
of say q keywords O38" . . . " 34Pthe ector /s therefore contains *Aq number
of elements. "his is normally denoted as D5 R *Aq . ,or instance# the ector
of the keyword relationship that corresponds to the candidate 5(9.8"
O9.8.9.9" 9.8.8.9" 9.8.8.8P) 4uery Q R O38" 3A" 3:P consistsD
of *A: R: R : 4espectie elementsA^(:Y8)^
D5 R rel(38" 3A)" rel(38" 3:)" rel(3A" 3:)F.
Letting /s and /s become the two speci$c types of keywords in a speci$c
relationship# of the candidates ) and )# the dominance relationship that
exists between the candidates ) and ) id can therefore be de$ned 8AF.
8/10/2019 Optimizing keyword queries in XML tree structure
22/45
Optimizing Keyword Queries in XML Tree Structures 22
Dominance
Letting ) and ) to become the two candidates of the XML search query 2
oer a speci$c named and gien database "# ) dominates ). "his is
represented as ) _ ) and this condition will only hold if the following aremetD
(8 Z Z d)D5F 7 D5 F and i(8 Z i Z d)D5iF Z D5 iF
In this scenario d refers to the keyword length relationship ector of ) and
) which is (d R D5 R D5 R8A4). /siF is the element in the ithector /s*andidate ) dominates ) in the relationship ;F88FA;F.
,&' M#AS(!#M#NT O$ T-# !#LATIONS-I. #T+##N NOD#S IN A
DATA T!##
"his particular section reiews the mutual information ?MI@ concept alongside
with arious other concepts that are related. "he in depth detail of this
particular concept will be discussed with emphasis on the concept adaptationin the measurement of the meaningful relationship that exists between
arious nodes that exist in an XML data tree.
8/10/2019 Optimizing keyword queries in XML tree structure
23/45
Optimizing Keyword Queries in XML Tree Structures 2
Mutual Information *oncepts
Mutual information and entropy
"hese are ery central and fundamental concepts that do exist in the $eld of
the information theory. Entropy therefore refers to the measure of
uncertainty of a particular random ariable. MI quanti$es the existing mutualdependence of two particular random ariables :F5F86FA0F.
EntropyD "aking a discrete random ariable x which takes the alue '-
extracted from the set dom (-) which is generali%ed and goerned by a
probability distribution function of aluep ('-)."he de$nition of entropy of-
is de$ned as followsD
*onditional Entropy of a particular random ariable sayyproided a second
ariable-"which is referred to as entropyyconditional-# which usually takes
the general form of 0 (y,-) has the following de$nitionD
In this particular type of equation#p ('y"'-) refers to Goint probability of (y='y)
and (-='-)9 whereas p('-"'y)gien (-='-)is the conditional probability of the
equation (y='y) 89F85F.
Mutual informationIn reference to two random ariables it can be referred to as a quantity that
measures mutual independence between two ariables >F. In a gien case
scenario# discrete ariables x and y which are random# the de$nition of their
mutual information can be de$ned asD
In this particular scenario#p ('-"'y)refers to the Goint probability of the
de$ned ariables (- = '-) and (y= 'y). In this particular scenario# p?x@ and
p?y@ are the probabilities of (-='-)"(y='y) respectiely.
"here are arious properties that characteri%e Mutual Information# and some
of the existing properties are detailed as followsD
roperty 8D
/ (-Sy) R 0 (-) 0 (-y) R 0 (y) 0 (y-)
8/10/2019 Optimizing keyword queries in XML tree structure
24/45
Optimizing Keyword Queries in XML Tree Structures 2!
"his deals with the interpretation of Mutual Information. It indicates that the
information that has been proided by y concerning x is the reduction or
decrease in the uncertainty of x proided the knowledge possessed by y.
)imilarly# this occurs for all the bits of information aailed by x concerning
random ariable y. "he alue of the mutual information is directly
proportional to the information that is reealed by both the ariables x and y
in this particular property 8>FA9FA;FA6F.
roperty AD
/ (-Sy) R / (yS-)
It puts forward that mutual information takes a symmetric form# meaning
that information aailed by x concerning y is the ery same type of
information y coneys about x 8>FA9FA;F.
roperty :D
/(-Sy) [ 9
"he lower bound of the mutual information is gien in this particularscenario. Cien /(-Sy) R 9# we get the resultp('-" 'y) Rp ('-) p ('y) for the
possible alues of x and y. "hese means that the ariables x and y are
independent# therefore obtaining the alue of x does not necessarily proide
clues of the probable or exact alue of the ariable y. "his therefore puts
their mutual information at %ero -F8>FA9FA6F.
roperty ;
/(-S-) R 0(-).
"his property puts forward that mutual information of ariable x is by itself
the entropy of x. "his therefore means that entropy is also referred to as self(
information 8>FA;F.
roperty -D
/(-Sy) Z 0(-) and /(-Sy) Z 0(y)
"he mutual information that exists between two ariables is limited and
bound to the minimum of their speci$c entropy 8>FA9FA;F.
8/10/2019 Optimizing keyword queries in XML tree structure
25/45
Optimizing Keyword Queries in XML Tree Structures 2"
/&' ANS+#!S !#T!I#*#D $!OM TO.0K
"he researcher obseres that /L*+ answers alter with di!erent search
queries. *onducting a data and information search# users usually are
interested in top(3 answers. "hey are sorted in descending order using theirrespectie releance degrees to the need of users information. "his section
de$nes three ranking functions used for identi$cation of the top(3 results for
a keyword(based sequential search through XML data. "he particular ranking
functions used in this study exploit di!erent and seeral aspects of
dominance relationships existing between query candidates for ranking their
releance degree to the speci$c search query8FA9FA5F.
roided 8 (Q" 1) as a set of candidates of a speci$c query Q in an existing
XML database 1# the degree of releance of a candidate based is measured
on the following three ranking scoresD
/ominating )core
roided a candidate answer structure 5# the dominating score of 5 is
de:ned as follows.
scoredg(5) R O5 \8 (Q" 1)5 _5P
?:@
"he dominating score of a speci$c candidate scoredg(5) shows the cumulatie
count of candidates which 5 dominates. + candidate portrays more releance
if it dominates as numerous and many other candidates as it possibly can."herefore# a higher dominating score of a speci$c candidate 5 denotes that )
is more signi$cant to the speci$ed query 0F8AF.
#5amp6e o7 an Instance %:
Letting 5 \8 (Q" 1) and 5\8 (Q" 1) e t6o respecti'e candidates of a
speci:ed 4uery Q in a stated XML data tree 1. 1herefore" if 5 _5# then this
implies that scoredg(5) [ scoredg(5).
"his example can be proed through using the transitie property which is a
subset of a dominance relationship. "herefore# for any two candidates on aquery 2# 5 \8 (Q" 1) and 5 \8 (Q" 1)# if 5;5# then 5i \8 (Q" 1)5 _5i# we
therefore hae 5 _5i. ,inally#
O5i \8 (Q" 1)5 _5iP [ O5i \8 (Q" 1)5 _5iP# or it can be stated as
scoredg(5) [ scoredg(5)
8/10/2019 Optimizing keyword queries in XML tree structure
26/45
Optimizing Keyword Queries in XML Tree Structures 2#
"his particular example gies an assurance that candidate 5 dominates
candidate 5# which then means that 5 is ranked higher as compared to 5 in
the top 3 results that hae been returned ;F88FA;F.
/ominated )core
roided a candidate answer )# the dominated score of ) is de$ned as
followsD
scoredd(5) R O5\8 (Q" 1)5_5P
?;@
"he dominated score of the speci$ed candidate 5# scoredd (5)# shows the
number of other di!erent candidates which can dominate 5. "herefore# the
lower the dominated score# the more meaningful to the query for candidate
5 A9FA;F. "his therefore implies that candidate 5 is more releant whendominated by fewer candidates as possible.
/ominance )core
Example of Instance :
Letting5 \8 (Q" 1) and5 \8 (Q" 1) be two respectie candidates of a
speci$ed query 2 in a stated XML data tree ".
/f 5 _5" then scoredd(5) Z scoredd(5).
"his example can be proed in a similar manner like the preious examples.,or any existing two candidates 5 \8 (Q" 1) and 5 \8 (Q" 1)# if 5 _5 then
5i \8 (Q" 1)5i _5# we hae 5i 5 8>FA9FA;F. "herefore# O5i \8 (Q" 1)5i
5P Z O5i \8 (Q" 1)5i _5P# or scoredd(5) Z scoredd(5)
8/10/2019 Optimizing keyword queries in XML tree structure
27/45
Optimizing Keyword Queries in XML Tree Structures 2$
1&' AL2O!IT-MS (S#D TO !#T!I#*# TO.0K!#S(LTS
"his particular section is meant as an introduction to algorithms which
identify releant search results and the top(3 answers# normally based on
arious skyline semantics in accordance to the aforementioned criteria ofranking. In order to obtain the speci$ed set of L*+(based candidates of a
particular gien keyword query# gien other signi$cant approaches in the
literature# the research adopts the inerted indexes 80F. "hese particular
indexes are built oine during a time it parsed the XML database tree
structure. )peci$cally# letting Q R O38" . . . " 34P be parsed a gien keyword
query and /Li be the inerted list consisting of keyword 3i. Eery entry
contained in the inerted list /Li is the /ewey code of a particular node
containing the keyword 3i. "he candidate set 8 of query Q is de$ned as
8 R Olca(n8" n4)n8 \/L8" . . . " n4 \/L4P#
Cien that lca(n8" . . . " n4) is an operation that gies the lo6est common
ancestor of On8" . . . " n4P# the keyword relationship ector of eery
candidate is concurrently fed as input during the candidate generation
process. "he generated candidates are stored in a speci$ed list ordered by
the alues of their releant keyword relationship ectors ;F80FA-F. "he
detailed explanations will be in the following subsections.
8/10/2019 Optimizing keyword queries in XML tree structure
28/45
Optimizing Keyword Queries in XML Tree Structures 2%
=aBe +lgorithm for )election of "op(K +nswers
A6gorit8m %:=aBe +lgorithm 80F
"he naBe algorithm used for identi$cation of the top(3 results that are
desired corresponding to their respectie dominated scores ?similarly#
dominance and dominating scores@ is illustrated in the +lgorithm 8. "his
speci$c algorithm iterates through eery candidate in the speci$ed
candidate set and facilitates the calculation of its score by performing pair
wise dominance checks between these candidates and all other candidates
de$ned in the set ?lines @ 80FA8FA:F. "he resultant set is then updated
depending on the result obtained on the score compared between the
current 3Yth candidate and the new candidate in the current top(3 results
?lines ?*@@ 80FA8FA;FA5F.
"he maGor drawback of this particular algorithm is that its speci$ed
computational cost is ery high because regardless of the alue of 3# there is
need to iterate through each component candidate found in the candidate
set and calculate the score deried by each candidate by performing thespeci$ed pair wise dominance checks that occur between the candidate with
all other present candidates in the existing set 8>FA9FA;F. "his therefore
means that no matter what the deried alue of 3 is# the algorithm
exhaustiely performs and conducts all pair wise dominance tests across all
candidates.
8/10/2019 Optimizing keyword queries in XML tree structure
29/45
Optimizing Keyword Queries in XML Tree Structures 2&
TAL# ,:"W3 /IME=)I3= )E" *+=/I/+"E /+"+ A;F
D% D)
S% 9.6- 9.6
S) 9.8- 9.-
S, 9.8 9.6-
S/ 9.- 9.;
S1 9.5 9.5
S3 9.6 9.;
S4 9.; 9.;
S9 9.: 9.A
S 9.0 9.>
S%' 9.: 9.:
,or instance# proided a set of candidates in "able :# in order for the proper
identi$cation of the top(: results# there is need to calculate the score deried
by of each candidate 5i(8 Z i Z 89) through iteration oer 6 other candidates
and conducting a pair wise dominance check A;FA5F. "his therefore impliesthat it takes 89 6 R 69 pair wise dominance checks. Cenerally# for
calculation of the score of a particular candidate in a gien set of n
candidates# there is need to do pair wise dominance checks between that
speci$c candidate together with (n Y 8) other candidates found in the set.
"op(K /ominated +lgorithm ?"7//@
"he chief aim of "7// is algorithm to each candidate is to e&ciently $nd the
number of other candidates which dominate it# while aoiding exhaustie
pair wise comparisons between the candidatesAF5FA;F . +fter the retrieal
of 3 results# the score of the 3(th result is used as a maximum threshold and
therefore pruning occurs for the candidates whose oerall dominated scores
extend the threshold A;F. "o add to that fact# safe termination of the
algorithm is guaranteed if the scores of all the remaining candidates exceed
the proided threshold. More speci$cally analy%ed# the "7// takes course
through the following four stepsD
8/10/2019 Optimizing keyword queries in XML tree structure
30/45
Optimizing Keyword Queries in XML Tree Structures '
?i@ /nitialiAation
(line *)D the result set and min'alue are initiali%edS
?ii@ 1ermination condition
TAL# /: )E+4* )+*E) L+ +=/ L1 K)E/ "3 *+L*KL+"E scoredd(5i)+=/
scoredg(5i)4E)E*"I'ELJ /E4I'E/ ,43M + LI)" L3, *+=/I/+"E) WI*
+4E )34"E/ I= /E)*E=/I=C 34/E4 3, )E*I,I* B()'+LKE) A;F
TAL# 1: *+=/I/+"E) LI)" )34"E/ I= /E)*E=/I=C 34/E4 3, B()'+LKE)
A;F
TAL# 3: /3MI=+=*E *E*7) *3K=" K)E/ ,34 *+L*KL+"I3= 3, "E/3MI=+"E/ *+=/I/+"E )*34E) A;F
$igure /: List of candidates sorted in descending order using their M()
alues A;F
8/10/2019 Optimizing keyword queries in XML tree structure
31/45
Optimizing Keyword Queries in XML Tree Structures 1
A6gorit8m ):"7// 80F
(lines C>)D roided that M?@ alue of the present candidate 5 is below the
minimum alue of the current 3(th candidate in # the algorithm terminates
and the resultant set is returnedS
?iii@ Dominance chec3s (lines ?*)D
8/10/2019 Optimizing keyword queries in XML tree structure
32/45
Optimizing Keyword Queries in XML Tree Structures 2
"he pair wise dominance checks between 5E and eery other candidate 5in
the respectie search space of 5 6here the operation takes place. "he
dominated score of 5 is found to be increased by 8 eery time another
candidate dominates 80FAAF.
?i@ esult updates (lines ***?)D proided that 3 results are existent and the
dominated score of the 3(th candidate is larger than the current candidates
score# the 3(th candidate is eGected and the current candidate is put into S
otherwise if it becomes less than 3 results exist in # there is an insertion of
the current candidate into . ,inally# taking the si%e of as 3# the threshold
minFalue undergoes updating (lines *G
8/10/2019 Optimizing keyword queries in XML tree structure
33/45
Optimizing Keyword Queries in XML Tree Structures
candidate found to exist in the search space proided will be
performed A;F.
*oncurrently# the dominating score of the candidate is calculatedS
?i@ 4esult set update (lines *+
8/10/2019 Optimizing keyword queries in XML tree structure
34/45
Optimizing Keyword Queries in XML Tree Structures !
3&' #X.#!IM#NTAL #*AL(ATION"he researcher performed and designed a couple of experiments to analy%e
the search performance of the approach. In the experiment the researcher
ealuates the outcomes and results of the arious experiments in order to
compare the e&ciency and quality of the approach that the researchers used
and other possible approaches that would hae been used:F-F>F 88F80F
A;FA5F.
Experimental )etup
"he experiments were conducted on the entium ;# :.AC% computeroperating on windows X rofessional and it had an internal memory of AC1.
.0 M1# Mondial
8M1 and /1L *omputer )cience 1ibliography 500 M1. /1L *omputer
)cience 1ibliography includes a list of bibliographic information of maGor
computer science proceedings and Gournals. Mondial on the other hand is a
worldwide geographic database or platform that has been integrated from
the world fact book of the *I+# "E44+ database# and the international atlasamong many other sources. +uction is a form of synthetic benchmark set of
data that has been generated by the XML generator using default /"/ from
XMark 8:FA-FA0F.
8/10/2019 Optimizing keyword queries in XML tree structure
35/45
Optimizing Keyword Queries in XML Tree Structures "
2uery )ets
"he researchers asked a group of learners to submit $fty arious keyword
questions to search and ealuate on eery data set. Eery query contained a
speci$c set of search key words and also a brief description of each query
was also ery necessary in order to understand and identify the key
intension of the query:F-F>F 88F80FA;FA5F. "he researchers at the same
time obsered that searching on a speci$c domain like the three main data
sets that they were experimenting on was not e!ectie as the keyword
queries were ambiguous. "his made it had for the users to express the
search intention. /ue to this# it is sometimes di&cult to obtain the releant
results and outcomes of the queries at hand which are prerequisite for the
researchers to analy%e the performance of their approach and other
aailable approaches:F-FA;FA5F.
)earch 2uality
"he researchers compared the quality of the /L*+ approach with the other
arious approaches that exist likeS EL*+# *'L*'+# X4eal# ML*+# )L*+ and
X)earch. "he quality of these approaches were measured in three metrics
popular for retrieal of informationD recall ?4@# ,(measure and precision ?@ >F
88F80FA5F. In order for the researchers to recall and compute precision
they reformulated manually the keyword questions into schemas aware
queries based on the data sets schemas and the keywords query
descriptions. "he researchers then took the results of transformed queriesresults as a platform on which they computed the recall and precision of the
queries according to the platform as followsS gien the key word query 2 and
its corresponding X2uery that has been transformed AF85FA0F. "he
accurate outcome set of 2 which is the result a speci$c algorithm on 2 is
recorded as retrieed results AF88FA;FA5F. "he precision and recall of this
algorithm can be de$ned as follows.
"he precision is a fraction of retrieed results releant to the searchD
R ??releant results@n?retrieed results@@
?4etrieed results@"he recall is a fraction of the releant results which are successfully retrieed
by the search system
4R ??releant results@ n ?retrieed results@@
?4eleant results@
"he ,( measure which shows the trade(o! between the recall and precision is
computed asS
8/10/2019 Optimizing keyword queries in XML tree structure
36/45
Optimizing Keyword Queries in XML Tree Structures #
,(measureR ?8T1A@ 4
1A T4
Where 1 R 8 the recall and precision are equal# where 1 8 precision is
emphasi%ed and where 1 _ 8 recall is emphasi%ed.
,rom the calculations it is clear that the releant results of each key word
query needs to be determined before the calculation of the appropriate
ealuation metrics. "o acquire the releant results of the tested queries the
researchers formed the manual corresponding schema aware Xquery with
the assistance of users >F80F. "he appropriate result of the queries was
then used as the basis for performance ealuation of the researchers
approach and other aailable approaches.
"he researchers conducted experiments with a set of -9 keyword queries by
using arious approaches and they measured the recall and precision of
eery approach by $nding the aerage of recall and precision alues of the
tested queries."he relationship and comparisons of recall and precision of the researchers
approaches in the three arious data sets are shown below.
TAL# 4: 4E*I)I3= +=/ 4E*+LL 3, 2KE4IE) 3= M3=/I+L /+"+ 88F
#L"A SL"A XSearc8
"*L"A
ML"A
X!#AL
DL"A
.recision
9 .- 9:
3 .0 8A
9 .> :-
9 .> 09
9 .0 8A
9 .0 A8
9 .6 AA
!eca66 8 .9 9
9
9 .> A
;
9 .6 ;
:
9 .6 8
9
9 .6 9
:
9 .> ;
0
9 .6 :
6
TAL# 9: 4E*I)I3= +=/ 4E*+LL 3, 2KE4IE) 3= +K*"I3= /+"+ 88F
#L"A SL"A XSearc8
"*L"A
ML"A
X!#AL
DL"A
8/10/2019 Optimizing keyword queries in XML tree structure
37/45
Optimizing Keyword Queries in XML Tree Structures $
.recision
9 .; 05
9 .0 9>
9 .> A:
9 .> ;9
9.> 66
9 .0 9:
9 .6 98
!eca66 8 .9 99
9 .> -9
9 .6 :8
9 .6 A9
9.6 90
9 .> -9
9 .6 :8
TAL# : 4E*I)I3= +=/ 4E*+LL 3, 2KE4IE) 3= /1L /+"+ 88F
#L"A SL"A XSearc8
"*L"A ML"A X!#AL DL"A
.recision
9 .- A:
3 .0 ::
9 .> ;9
9 .> 55
9 .0 A9
9 .0 ::
9 .6 :;
!eca66 8 .9 99
9 .> ;0
9 .6 ;8
9 .6 88
9 .6 A:
9.> ; 0 9 .6 ;8
TAL# %':*3M+4I)3=) 3= 4+=7I=C E,,E*"I'E=E)) 3, "E
+LC34I"M) 88F
M A . !0.!#"
;pre7 !0!ANK
. < % . < 1 . < %'
TKDD 9.509 9.5:9 9.0 - 9 9 .5 > 9 9 .5 69
9.5 0 9 9 .5 89
TKD2 9 .5 -9
9 .5 A9
9 .0 69
3 .5 0 9 9.6 A9
9.5 6 9 9 .5 ;9
TK D0' &)1
9 .5 ;9
9 .5 99
9 .0 69
9 .5 > 9 9 .6 89
9.6 8 9 9 .5 >9
TK D0' &1'
9 .5 >9
9 .5 A9
9 .0 >9
9 .5 > 9 9.6 99
9 .5 09
9.5 A 9
TK D0' &41
9 .5 09
9.5 - 9 9.0 : 9 9.5 5 9 9.5 59
9 .5 -9
9.5 9 9
X!ANK 9 .> 0
9
9 .0 -
9
9 .> 8
9
9 .0 8 9 9 .> 6
9
9 .> 5
9
9 .> -
9
XS#A!"- 9 .0 99
9 .0 09
9 .> :9
9 .> 5 9 9 .0 :9
9.> 5 9 9 .> >9
8/10/2019 Optimizing keyword queries in XML tree structure
38/45
Optimizing Keyword Queries in XML Tree Structures %
+ll the ranking algorithms makes it possible to identify the top ten results at
a precision ranging between eighty to eighty $e percent. "he mean
aerage precision of the algorithm is 5- and the researcher could een
achiee more accurate precision by selecting a suitable alue which can
maximi%e the balance and relationship the dominating and dominated scores
>F 88FA;FA5F.
E&ciency and )calability of "op(K +lgorithms
"he researchers tested ten queries with arious lengths in eery data set.
"hey tested about $e thousand candidates in default scenarios and the
number of results found was thirty. "he queries which had less than required
number of results from candidates# the researchers made a replica of the
candidates repeatedly until they obtained the required candidate number:F
88F. "he researchers then selected randomly the required candidate number
from the set that was duplicated. "he cost of computation of the algorithm is
shown in the $gure below. It is clear that when the candidate number
increases the algorithm processing time also increases but at di!erent trends
:F6F8;F80F. "7// in this case is the most e!ectie and e&cient method it
is less a!ected by the increase in the number of candidates. "7// is mainly
concerned with in the results which are dominated fewest number of
candidates as possible. "his is because the results are usually located at the
top of the list of sorted candidates and as a result it searches a small portion
of the candidate list. ,or "7/C the search space is much larger and as a
result there is expected delay. 3n the other hand the lower performance of
"7// is also as a result of the score that is dominating hence it explains whyits processing time rises the same way as the "7/C which has a small
oerhead used in calculating and $nding the dominating score -F6F 8:FA;F
A5F. ,rom the results in ,igure -# it is clear that the "7// processing time is
less a!ected by the increase in number of k of the returned results and it can
return from ten to one hundred results from the set of $e thousand
candidates within a second. "he "7/C processing time algorithm is more
a!ected by the change of the parameter but it takes A.- seconds to get back
to the top one hundred results from asset of $e thousand candidates.
$igure 1: 4anking e&ciency of "7//# "7/C and "7/ algorithms 6F
8/10/2019 Optimizing keyword queries in XML tree structure
39/45
Optimizing Keyword Queries in XML Tree Structures &
4&' "ON"L(SIONS
In the thesis the researchers hae studied the issue of identifying the most
accurate outcomes and results and the top(k appropriate results for XMLkeyword questions or queries in this matter. "he use of maGor keywords tosearch for documents has been widely accepted as a ery conenient way inretrieing resources from arious remote serers that hold that speci$c typeof data on the internet. MaGority of the search engines such as Coogle# 1ing#Jahoo and many more hae adopted the use of these technologies so as toe&ciently facilitate the process of data mining and data warehousing. "headaptation and use of keywords for querying arious databases has attractedarious researches to be conducted by the research community from thea!ected $elds of database and information retrieal ?I4@. XML documents arecomposed of nested XML attributes from the root elements to the nested
sub(elements. XML elements often reference other elements which arequeried as XML alues and therefore the text content is captured using thedeputation contains ?u# k@. *onsequently# the predicate returns true when theelement u has keyword k while an XML query 2 is mapped from an XMLdatabase / to XML documents that characteri%e the query output. +s aresult# when the XML database enironment is K/ is and the XML documentsequence enironment is )# the outcome is 2D K/ ). 2?/@ is the result ofquery 2 oer database / whereby the query is identi$ed using XML querylanguage for instance X2uery. "herefore considering a sequence s# then e \s is true when e is in s. *onsider a p(document# which is a probabilisticdocument written in XML speci$es a probability distribution across space of
deterministic documents written in XML. Each deterministic document thatbelongs to this space is referred to as a possible word. + probabilisticdocument referenced as a tree that has been labeled has distributional andordinary nodes. 3rdinary nodes are basically regular and normal XML nodesand their appearance may be seen in deterministic documents# whereasdistributional nodes are only used in the de$nition of probabilistic processthat inoles the generating of deterministic documents and their occurrenceis not isible in those documents. In the adaptation of rXML Oind# muxP as
8/10/2019 Optimizing keyword queries in XML tree structure
40/45
Optimizing Keyword Queries in XML Tree Structures !'
part of the XML model which is probabilistic# two distributional nodes typesappear in a p(document# which are MKX and I=/.
"he researchers hae stried to address the three ital requirements andconditions for e!ectie keyword searches of the XML. "he researchers haeintroduced new methods of analy%ing the relationship between query keywords in the candidates using mutual information idea and come up with anew /L*+ keyword queries semantic. "he researchers also hae a proposedstrategy and method of selecting the results of /L*+ from multiplecandidates and the three ranking methods used in selecting top(k resultsbased on skyline queries semantics. )ome of the properties which hae beenproen hae been acquired to accelerate proposed algorithms. "he $ndingsand experiments hae been conducted to analy%e and ealuate theresearchers experimental results and approach and they show that theapproach performs better than the approaches that hae been used in thedata sets that hae been tested and the ealuation metrics. "his is a erye&cient way of facilitating the retrieal of documents because it does notinole the learning of any concepts. "his process is an adancement of thetraditional search algorithms that were speci$cally inoling and required themastering of the particular I ?Internet rotocol@ addresses of ariousdocuments or information content and typing them in the K4L bar. is lateradanced and the I addresses were able to be attached to arious weblinks. It is from this that the search engines were deeloped with a moreinteractie and responsie algorithm that was able to handle a lot of bits andpieces of information including data mining. + ariety of approaches haebeen accessed and preiewed to $nd alternaties to the keyword queries asopposed to the XML data. "he basic approaches that currently exist uselowest common ancestor ?L*+@ type of semantics as opposed to the commongraph theory for identi$cation of the hit list gien a certain keyword query."his particular approach generates results composed of all candidates# alsoknown as sub trees# containing an instance of the queried keywords. "he L*+returned alues can be numerous yet the user may Gust be interested in aportion or bit of the whole hit list. It therefore remains and unsoled issue tobe able to identify the exact dataset that is required by the user of thesystem. "he ideal situation and the best case scenario would be for thesystem to be able to generate an exact piece that is required by the user asopposed to proiding a whole set of hits which also gies the user an extraGob to $lter the content until they obtain an exact piece. "he researchers
hae stried to address the three ital requirements and conditions fore!ectie keyword searches of the XML. "he researchers hae introduced newmethods of analy%ing the relationship between query key words in thecandidates using mutual information idea and come up with a new /L*+keyword queries semantic. "he researchers also hae a proposed strategyand method of selecting the results of /L*+ from multiple candidates andthe three ranking methods used in selecting top(k results based on skylinequeries semantics. )ome of the properties which hae been proen hae
8/10/2019 Optimizing keyword queries in XML tree structure
41/45
Optimizing Keyword Queries in XML Tree Structures !1
been acquired to accelerate proposed algorithms. "he $ndings andexperiments hae been conducted to analy%e and ealuate the researchersexperimental results and approach and they show that the approachperforms better than the approaches that hae been used in the data setsthat hae been tested and the ealuation metrics examined.
+ simple cost model introduced by the authors was based on *K costs andweighted I3 which used statistics on data page numbers consumed byrelations that bound the cost model concrete alues. "he dynamicprogramming algorithm proides a selected optimal operator $tting forspeci$c access paths. +fter that# an optimal Goin order is eri$ed based on anassumption of local optimality. In order to prune early the search space# notall possible enumerations are considered. In their place# focus is laid oninteresting Goin orders# for instance orders which can do without additionalintroductions of products of the *artesian. Craefe and /eWitt showcased theEX3/K) 3ptimi%er Cenerator and the purpose of this system was notcon$ned to a speci$c data model but it supported the algebraictransformations speci$cation as rules. Incorporated with a data model that isconcrete# the rules sere as input for the generator optimi%er# creating atailor(made query optimi%er. "his paper systematically explores XMLstructure(based answers and user expectations in order to identify thesigni$cance of XML keyword search semantics. "his paper further posits asemantics(based methodology to deelop XML keyword queries principallythrough data(centric coherency ranking which is kernelled in the design ofthe domain and database which is predicated on data dependence andmutual information models. *onsequently# keyword query results occurwithin a under schema reorgani%ation structures which process# present rankand query algorithms through coherency ranking to deelop answers. +ctualXML data indicates that coherency ranking is the methodology with thehighest precision# recall and ranking as compared with approaches. *urrentkeyword searches in XML can be diided into tree and graph supportedsearches which are largely predicated on structural document features.oweer# these approaches on structure do not comprehensiely utili%e thehidden semantics within the XML documents leading to issues in theprocessing of speci$c keyword query classes. "he growing reputation of XMLhas intensi$ed the necessitation of an accessible and precise XML queryinterface that is predicated on natural language and search procedures thatexploit XML structures to simplify queries by ordinary users within XML
databases.
8/10/2019 Optimizing keyword queries in XML tree structure
42/45
Optimizing Keyword Queries in XML Tree Structures !2
!#$#!#N"#S
8F+lghamdi# =orah )aleh# Wenny 4ahayu# and Eric ardede. j3bGect(based
semantic partitioning for XML twig query optimi%ation.j InId'anced
/nformation Jet6or3ing and Ipplications (I/JI)"
8/10/2019 Optimizing keyword queries in XML tree structure
43/45
Optimizing Keyword Queries in XML Tree Structures !
ngineering #or3shop"
8/10/2019 Optimizing keyword queries in XML tree structure
44/45
Optimizing Keyword Queries in XML Tree Structures !!
03ctober A989#
httpDHHdownload.oracle.comHdocsHcdH189-9998Hserer.6A9Ha6>-::Hsqltrac
e.htm5:;;_
8-F Memory 8on:guration and Sse A995# 1A5A0;(9A# 3racle# iewed
8-)eptemberA989#httpDHHdownload.oracle.comHdocsHcdH1A5:-698Hserer.888HbA5A0;Hmemory
.htm _
8>F =. 3nose et al.# j4ewriting =ested XML 2ueries Ksing =ested 'iews#j in
roceedings of the I8M 5/RMD /nternational conference on
Management of Data# *hicago# IL# K)+# A99># pp. ;;: ;-;.
80F 1. )tantic et al.# jandling of *urrent "ime in =atie XML /atabases#j in
roceedings of the *?th Iustralasian Dataase 8onference ('olume ;6#
obart# +ustralia# A99># pp. 80- 85A.
85F,. Liu# *. ". Ju# W. Meng# and +. *howdhury# VE!ectie keyword search in
relational databases# in 5/RMD 8onference# A99># pp. ->:-0;.
86F '. ristidis# =. 7oudas# J. apakonstantinou# and /. )riastaa# V7eyword
proximity search in xml trees# / 1rans. Kno6l. Data ng.# ol. 85# no. ;#
pp. -A--:6# A99>.
A9F J. Xu and J. apakonstantinou# VE&cient L*+ based keyword search
inxml data# in DN1# A995# pp. -:--;>.
A8F . Liu and J. *hen# VIdentifying meaningful return information for XMLkeyword search# in 5/RMD 8onference# A990# pp. :A6:;9.
AAF *. )un# *. J. *han# and +. 7. Coenka# VMultiway )L*+(based keyword
search in xml data# in #### A990# pp. 89;:89-A.
A:F . Liu and J. *hen# V4easoning and identifying releant matches for xml
keyword search# FLDN# ol. 8# no. 8# pp. 6A86:A# A995.
A;F ). +mer(Jahia and M. Lalmas# VXml searchD languages# index and
scoring# 5/RMD ecord# ol. :-# no. ;# pp. 8>A:# A99>.
A-F J. Luo# X. Lin# W. Wang# and X. hou# V)parkD top(k keyword queryin
relational databases# in 5/RMD 8onference# A990# pp. 88-8A>.
A>F =. Mamoulis# 7. . *heng# M. L. Jiu# and /. W. *heung# VE&cient
aggregation of ranked inputs# in /8D# A99># p. 0A.
8/10/2019 Optimizing keyword queries in XML tree structure
45/45
Optimizing Keyword Queries in XML Tree Structures !"
A0F /. Xin#