1 roRDF: Optimization for RDF Query on monetary cost via …static.tongtianta.site/paper_pdf/f71978ec-8834-11e9-93bb-00163e08… · roRDF: Optimization for RDF Query on monetary cost
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
CroRDF: Optimization for RDF Query on monetary cost via Crowdsourcing
Depeng Dang, Member, IEEE
Abstract—The proliferation of structured data and the advances in knowledge graph have enabled the construction of knowledge
bases using the RDF data model to represent various resources and their relationships. But some rdf queries cannot provide
knowledges completely. In this paper, we present CroRDF, an inquiry system that provides users with low cost query services based
on existing and crowdsourced RDF data. We propose crowdsourcing query plan (CQPs) enumeration optimization algorithms that
enumerate the CQPs in the search space based on the selection of acquisition rules of high scores for each triple pattern in the
basic graph pattern (BGP); To find the optimal CQPs, we describe the monetary cost estimation algorithm. The algorithms reduce
the total time required to traverse the search space and improves the optimization efficiency. We present the monetary cost
estimation algorithm, which considers the relationship between triple patterns, in detail, and this algorithm is combined with the
multi-choice of the crowdsourcing direction to estimate the query monetary cost. To evaluate CroRDF, we create different queries
on DBpedia dataset. The crowed use Amazon Mechanical Turk to contribute their knowledge. Experimental results clearly show
that our solution accurately low monetary cost through crowdsourcing platforms and integrating existing data.
Index Terms—Crowdsourcing, RDF, Monetary cost estimation, Crowdsourcing cost optimization
—————————— ◆ ——————————
1 INTRODUCTION
ince Google optimized its search services with
knowledge graphs, knowledge graphs have grown
rapidly. A variety of semantic knowledge bases have
emerged in both industry and academia. This like DBpedia1,
YAGO-NAGA2, Freebase3 and Geo-Names4. the Resource
Description Framework 5 (RDF) is a W3C standard that
describes network resources. It is widely used to represent
various resources and their relationships in the knowledge
graph. RDF is a semi-structured data model where entities
are represented as resources; connection between resources
are described as triples composed of subjects, predicates and
objects[1]. Many semantic knowledge bases use the RDF
semantic model to express millions of fact entities and their
relations. Rich and substantial knowledge bases provide not
Existing data in the knowledge base as show in Fig.4. If the
target of β is less than 5, the query process switches to the
Collect phase. In this phase, based on the partial results
obtained in the Search phase, the TPGenerate processor
generates ordered BGP graphs according to certain rules, i.e.,
different execution sequences of triples. The Acquire
processor determines the crowdsourcing direction and
acquisition rule set of each triple pattern according to the
acquisition rule scores to generate candidate optimal CQPs in
the effective search space. Then, the CostEst module is
utilized to estimate the crowdsourcing cost and help to find
the optimal plan. Finally, the CreateQuestions and
LoadAnswer processors in the crowdsourcing module
perform crowdsourcing questions and collect the results.
This paper focuses on crowdsourcing query optimization.
Therefore, the specific query optimization process of the
Search phase is not discussed. In Sections 4 and 4, we will
explain the cost estimation for a CQP and the acquisition rule
evaluation algorithms used in the Collect phase.
4 SEARCH PHASE
In this phase, the SPARQL query process is transformed
into a sub-graph matching problem using graph exploration
[31]. The process order of the triple patterns in the SPARQL
query is sorted with {q1, ... , qn}, and the matching set of the
i-th triple qi is calculated through the whole graph.
According to the matching set of qi, qi+1 is mapped with the
graph exploration query. In an ordered set of triples, there is
an effect of the impact of the interactions between the triples,
and each step of the matching operations is based on the
previous results to reduce the intermediate result sets and
improve the query performance.
Algorithm 1 illustrates the main process of the Search
phase. Where q⃗ represents a triple pattern with a direction, i.e., the crowdsourcing direction from the subject to the object, that indicates the common nodes with another triple pattern as the subject. And q⃗⃖ represents the crowdsourcing direction from the object to the subject. We call the source of q⃗ and q⃗⃖ “src” and call the target of them “tgt”. “p” represents predicate and “dir” represents to correspond relationship
between “src” and “tgt”. When src is a variable, the
LoadNodes initialize the candidate set by predicate indexes;
when src is a constant, B(src) is initialized as the constant.
Then, for each candidate item in B(src), the
SelectByPredicate searches for the suitable candidate set of
tgt. The result is added into R only when the tgt matches
B(tgt).
In example 2, assume that the existing data in the
knowledge base are as shown in Fig. 4. For q⃗ {?doctor
WorkIn ?hospital}, there will be 4 matching results in R
according to algorithm 1: {(wang3, WorkIn, Chinese
Medicine Hospital), (wang1, WorkIn, Jishuitan Hospital),
and questions dynamically. Then, the crowdsourcing
platform can handle the crowdsourcing questions and
collect new data later.
5.1 Generate Ordered BGP Graphs
For a SPARQL query Q, we first construct a BGP graph
to describe the structural relationship between the triple
patterns. Then, all possible ordered BGP graphs of the triple
patterns that describe the process orders are determined.
Based on the BGP graphs, we construct all possible logical
plans.
Definition 1. A Logical Plan is a sequence of triple
patterns corresponding to an ordered BGP graph.
Assume the triple pattern set TP1 = {q1, q2,..., qn} as the
initial ordered BGP graph that appears in the query. Based on
TP1, the positions of the two pairs of triples are exchanged
according to the rule 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛(𝑞𝑖) ↔ 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛(𝑞𝑗)(𝑖 ≠ 𝑗)
to form different triple pattern sequences corresponding to
different ordered BGP graphs. When there are n triples, n!
triple pattern sequences are generated. The generation
process of triple pattern sequences is shown in Algorithm 2.
Different ordered BGP graphs may have different
crowdsourcing costs. When the number of triple
wang3Chinese
Medicine Hospital
2
wang1Jishuitan Hospital
Orthopedics
Beiyi Hospital
3
3
wang2Beijing
Hospital29
8
WorkIn
Has_level
MajorIn
Has_level
Has_level
Has_level
WorkIn
WorkIn
WorkIn
Has_rate
Has_rate
PROFESSOR
PositionTitle
ASSOCIATEPROFESSOR
PositionTitle
PROFESSOR PositionTitle
MajorIn
Dermatology
Fig. 4. Existing data in the knowledge base Algorithm 1 MatchPatter
Input: Triple patter e (e=q⃗⃖ or e=q⃗ ) Output: The matching set R
1: Initial src, tgt, p and dir from e
2: if src is a variable then
3: B(src)= LoadNodes(p, dir)
4: else if src is a constant then
5: for each s in B(src) do
6: Id_ListSet=LoadNeighbors(src, dir) //Get
adjacency list corresponding with src
7: N = SelectByPredicate(Id_ListSet, p)
8: for each n in N ∩ B(tgt) do
9:
10:
R=R∪(s, p, o)
return R
6
patterns in the TP set is large, a large number of TP
sequences will be generated, which may affect the cost
estimation and the efficiency of the crowdsourcing
optimization process. Therefore, a pruning process is
necessary. Given that the crowdsourcing process of
triple patterns is in a certain order, there exists a
binding set of associated values among them that
limits and reduces the unnecessary acquisition
rule generation. Therefore, when generating TP
sequences, we consider TP sequences (line 6) that have
an association between every two triple patterns to
effectively reduce the candidate ordered BGP graphs.
4.2 Evaluate Acquisition Rules
Based on the ordered BGP graphs, there are different acquisition rules for each triple pattern that generate different crowdsourcing questions.
4.2.1 Acquisition Rules
Definition 2. An Acquisition Rule is the rule extracted from a triple pattern in a BGP graph that defines how to generate crowdsourcing questions and acquire data from crowdsourcing platforms.
The general form of the acquisition rule is Predicate(subject, object). There are two specific forms when generating acquisition rules: one is Predicate(?, object), with a known object and an unknown subject; the other is Predicate(subject, ?), with a known subject and an unknown object. The acquisition process obtains an unknown value according to a known value. We can set a certain reward for each acquisition rule based on the predicate and pay workers when they complete the crowdsourcing question generated by the acquisition rule later. We take the hospital system as an example. Some acquisition rules are as follows:
Is(?, doctor): Ask a doctor's name.
WorkTime(NAME, ?): Ask the working time
according to the name of the doctor. A triple pattern in the WHERE clause of a
SPARQL query can generate a specific set of acquisition rules. The triple pattern is formally
expressed as ? _var1 <P>? _var2 / CONST, where ?_ var1 and ?_var2 represent variables (subject or object). The object may also be a constant. According to the definition of the acquisition rules, we can generate the following three types of acquisition rules: I: P(?_var1, CONST); II: Is(?_var1, var1), Is(?_var2, var2); III: P(VAR1, ?_var2), P(?_var1, VAR2). var1 and var2, respectively, represent the category where the subject and the object node belong. VAR1 and VAR2, respectively, denote the corresponding values of the subject and the object. Different acquisition rules can be selected under different conditions, and the data for the corresponding triple pattern can be acquired.
4.2.2 Acquisition Rules Selection
Definition 3. A Physical Plan is a sequence of
acquisition rules. It is converted from a logical plan by
choosing the crowdsourcing direction for each triple
pattern in the logical plan and determining the
acquisition rule for the corresponding triple pattern.
Fig. 5. CQPs and acquisition rules for plans A and B
Algorithm 3:SearchBestPlanOriginal Procedure
1 bestPlan <- NULL
2 minCost <- ∞
3 for each seqBGP do
4 for each fetchRuleSet do
5 plan <- GeneratePlan(seqBGP, fetchRuleSet)
6 plan.TriplePossEst()
7 cost <- plan.CostEst(plan.poss)
8 if cost < minCost then
9
bestPlan <- plan
1
0
return bestPlan
8
executable physical plans by selecting different
acquisition rules.
4.3.2 Enumeration Algorithms
Definition 5. PossiNum is the number of possible result tuples needed for each candidate acquisition rule for a triple pattern, which is related to the cost of the corresponding crowdsourcing plan. The details of how to estimate the PossiNum are discussed in Section 5.
Note: Different physical plans have different
acquisition rules, and different acquisition rules have
different turns ratios, which indicates that the
generated one-to-one crowdsourcing questions need
different numbers of result tuples (PossiNum) to find
the right answer. The number of result tuples needed is
directly related to the monetary cost of crowdsourcing.
We now consider the problem of efficiently
enumerating all CQPs in the search space. In CroRDF,
the same logical plan may correspond to different
physical plans, resulting in different crowdsourcing
costs. Thus, the PossiNum estimation is applied at the
physical plan level to help select the optimal CQP.
Moreover, the CroRDF PossiNum estimation is holistic
and is based on an ordered triple pattern sequence in
which the PossiNum of each triple pattern partly
depends on the other parts of the CQP and affects the
other triple patterns. Therefore, the goal of the
enumeration algorithm is to generate a complete CQP
in the search space while maximally reusing the
common triple pattern subsequence. First, we propose
a native enumeration algorithm. Then, we propose an
improved efficient enumeration algorithm based on
reuse. The performance of the two enumeration
algorithms is compared in the experiment.
4.3.2.1 Native Algorithm
The native enumeration algorithm iteratively
generates all valid CQPs in the search space.
Algorithm 3 illustrates the whole process. First, all
ordered BGP graphs (line 3) are enumerated using the
EnumerateBGP algorithm in Section 4.1. For one
ordered BGP, a set of complete acquisition rules is
generated and combined according to the evaluation
scores proposed in Section 4.2, which constructs a
candidate CQP (lines 4 and 5). The optimal CQP is then
selected by using the PossiNum estimation and cost
model (lines 6-9).
4.3.2.2 Improved Algorithm
The native enumeration algorithm processes each
CQP independently. Since different CQPs may have
common triple pattern subsequences, it is possible to
generate a duplicate estimation for the same
subsequence. To improve the enumeration efficiency,
we can record the estimated results of these common
triple subsequences. Note that there are associated
values between the triple patterns, but we cannot
directly save the estimated PossiNum, although saving
the PossiNum calculation relationship between the
triples is feasible. Therefore, the algorithm does not
have to repeat to determine the relationship between
two triple patterns and can perform the calculation
directly based on the input parameters.
For a SPARQL query with n triple patterns,
although the computational complexity increases with
the value of n, the computational time is reduced
compared to repeatedly calculating the triple patterns
of all CQPs. Therefore, we can enumerate the physical
plans by using the combination of every two triple
patterns that consider the acquisition rules. The native
enumeration algorithm first selects an ordered BGP
graph and then enumerates the physical plans by
selecting the rules for the triple patterns. All possible
CQP are thus enumerated.
5 MONETARY COST ESTIMATION
This section describes how the CroRDF system
estimates the cost of a CQP. Assume that each
acquisition rule has a fixed cost that can be set by the
CroRDF system. Although cost may vary with
different acquisition rules, we simplify the assumption
that the cost of each acquisition rule is not dependent
on the specific predicates. Therefore, we convert the
cost estimation into a PossiNum estimation, which is
the number of possible result tuples needed for the
acquisition rule that each triple pattern in the SPARQL
query needs to generate to satisfy the overall query
target. Therefore, the cost estimation formula is as
follows:
𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑠𝑡 = ∑ ∑ 𝑐𝑖𝑗 × 𝑓𝑖𝑗𝑓𝑖𝑗𝜖𝐹𝑖𝑞𝑖∈𝑇𝑃
(1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑗 ≤ 𝑚𝑖), where TP is the set of triple patterns in the
SPARQL query, qi is a triple pattern, Fi is a set of
9
candidate acquisition rules generated by qi, fij is the PossiNum of the j-th acquisition rule in Fi, and cij is the cost of the acquisition rule corresponding to fij. To estimate the PossiNum of the triple pattern, we should fully consider the associations and restrictions among the triple patterns.
5.1 PossiNum Estimation
When executing a SPARQL query, CroRDF generates a BGP graph composed of triple patterns. A CQP corresponds to an ordered BGP graph that indicates the order in which the triple pattern is executed. Therefore, the PossiNum estimation algorithm can be regarded as a graph exploration and traversal process that considers the association among triple patterns. Based on the resolution rule turns ratio and predicate density, the whole process starts from the extended query target, estimates the result tuples that each triple pattern needs to deliver to the next triple pattern, and computes the PossiNum of each triple pattern until the entire BGP graph traversal is complete and returns the calculation result.
5.2 Important Parameters
In the PossiNum estimation, the resolution rule turns ratio and predicate density can be applied to estimate the PossiNum.
5.2.1 Resolution Rules
Resolution rules are applied to eliminate the ambiguity and inconsistency of crowdsourcing result triples, and the results are returned to the knowledge base. The form of the resolution rule is Rule(S->O, predicate), where S and O represent the subject and the object (S can be empty), respectively, and predicate is the predicate involved in the rule. The specific process groups all crowdsourcing result tuples by S, whereas for each group it regards the set of values in O as the input and outputs a result according to a specific resolution rule. Each resolution rule limits the number of inputs as a minimum or average number, and more inputs are needed if they are insufficient for the limitation. The number of inputs can be used for the query cost estimation. The resolution rules involved in the query process include distinct, majority, average, etc. In the example of the hospital system, there may be some resolution rules as follows:
Distinct(∅->hospital, Is): Distinct.
Average-3(doctor->score, Has_rate):
Calculate the average of three scores.
Majority-3(doctor->hospital, WorkIn): Take
most items of the three results.
5.2.2 Resolution Rule Turns Ratio
The resolution rule turns ratio can estimate the average number of output tuples for each input tuple. For example, the resolution rule Average-n represents the average value of n values and the turns ratio is 1/n;
Majority-n represents the majority of n results and the turn ratio is between 1/3 and 1/2 when n = 3 (when the two results are consistent, 1/2; when inconsistent, 1/3).
5.2.3 Predicate Density
The predicate density in an acquisition rule implies a probability value that owns the predicate for all possible RDF resources. The predicate density is related to the predicate category, such as for the acquisition rules Is(?, doctor) and WorkIn(?doctor, “Beiyi Hospital”), whose possible predicate densities are 1 and 0.1, respectively.
5.3 Calculate the PossiNum
First, we define four types of relationships
between triple patterns, as shown in Table 1. The
crowdsourcing process for each triple pattern has a
direction, which refers to the direction between the
source and target, represented by src and tgt,
respectively. The source and target differ from the
subject and object. The right arrow ‘→’ represents the
matching direction from subject to object, whereas the
left arrow ‘←’ indicates from object to subject. For
10
example, for q2⃖⃗ ⃗⃗⃗ , src represents the source and the
object of the triple, whereas for q2⃗⃗⃗⃗ , src indicates the
subject of the triple.
TP
Relationship
Description (example in Fig. 3)
R1: src-src q1⃗⃗ ⃗⃗ ⃗ and 𝑞3⃗⃗ ⃗⃗ ⃗
R2: tgt-src 𝑞3⃗⃗ ⃗⃗ ⃗ and 𝑞4⃗⃗⃗⃗
R3: src-tgt q1⃗⃗ ⃗⃗ ⃗ and 𝑞2⃖⃗ ⃗⃗⃗
R4: tgt-tgt 𝑞1⃖⃗ ⃗⃗ ⃗⃗ and 𝑞3⃖⃗ ⃗⃗⃗
Table. 1. Relationships between the triple patterns
Now, we explain the TriplePossEst PossiNum
estimation algorithm in terms of the four types of
relationships between triple patterns. The basic
process unit of the algorithm is a single triple pattern.
In the implementation process, two input parameters
are involved:
target: The number of target tuples to be
output for one triple pattern.
binding: The candidate set of association
values between the triples.
According to the input parameters and a CQP,
the TriplePossEst algorithm estimates the PossiNum
for a specific triple pattern, and the output is passed
as the target input of the next triple pattern. Then, the
total estimated cost of all tuples is calculated
cumulatively. Four local variables are referenced in
each triple pattern estimation:
fets: The acquisition rule set of a triple
pattern.
preds: The predicate set with the density of
the involved triple pattern.
res_sel: The resolution rule set and its
turns-ratio.
poss: The PossiNum of the current triple
pattern.
Algorithm 4 illustrates the basic process of the
TriplePossEst algorithm. The input is the CQP,
including the process order and crowdsourcing
direction of TP. The output is the estimated PossiNum
of CQP, which is the number of possible result tuples
Maria-Esther, Vidal. Enhancing answer completeness of SPARQL queries
via crowdsourcing. Web Semantics: Science, Services and Agents on the
World Wide Web, v 45, p 41-62, 2017.
19
Depeng Dang received his PhD degree in Computer
Science and Technology from Huazhong University of Science
and Technology, China, in 2003. From Jul. 2003 to Jun. 2005, he
did his postdoctoral research in the Department of Computer
Science and Technology, Tsinghua University, China. Now, he is
a full professor and supervisor of Ph.D. students of in Computer
Science and Technology from Beijing Normal University, China.
Up to now, he has chaired Four NSFC projects. His research
interests include crowdsourcing computing, RDF data
management.
Wenhui Yu received her Bachelor’s
degree in Computer Science and Technology
from Beijing Normal University. She is currently
studying at the college of Information Science
and Technology, Beijing Normal University,
China. Her research interests include RDF data
management and crowdsourcing computing.
Shaofei Wang has received her Master’s degree in computer software and theory from Northwestern Polytechnical University. She is currently studying at college of Information Science and Technology, Beijing Normal University, China. Her research interests include crowdsourcing computing, RDF data management.
Nan Wang has received her Bachelor’s degree in Computer
Science and Technology from Beijing Normal University. She is currently studying at the college of Information Science and Technology, Beijing Normal University, China. Her research interests include crowdsourcing computing, RDF data management.