-
1Dynamic Query Forms for Database QueriesLiang Tang, Tao Li,
Yexi Jiang, and Zhiyuan Chen
AbstractModern scientific databases and web databases maintain
large and heterogeneous data. These real-world databasescontain
over hundreds or even thousands of relations and attributes.
Traditional predefined query forms are not able to satisfyvarious
ad-hoc queries from users on those databases. This paper proposes
DQF, a novel database query form interface, which isable to
dynamically generate query forms. The essence of DQF is to capture
a users preference and rank query form components,assisting him/her
to make decisions. The generation of a query form is an iterative
process and is guided by the user. Ateach iteration, the system
automatically generates ranking lists of form components and the
user then adds the desired formcomponents into the query form. The
ranking of form components is based on the captured user
preference. A user can alsofill the query form and submit queries
to view the query result at each iteration. In this way, a query
form could be dynamicallyrefined till the user satisfies with the
query results. We utilize the expected F-measure for measuring the
goodness of a queryform. A probabilistic model is developed for
estimating the goodness of a query form in DQF. Our experimental
evaluation anduser study demonstrate the effectiveness and
efficiency of the system.
Index TermsQuery Form, User Interaction, Query Form
Generation,
F
1 INTRODUCTIONQuery form is one of the most widely used user
inter-faces for querying databases. Traditional query formsare
designed and predefined by developers or DBA invarious information
management systems. With therapid development of web information
and scientificdatabases, modern databases become very large
andcomplex. In natural sciences, such as genomics anddiseases, the
databases have over hundreds of entitiesfor chemical and biological
data resources [22] [13][25]. Many web databases, such as Freebase
andDBPedia, typically have thousands of structured webentities [4]
[2]. Therefore, it is difficult to design aset of static query
forms to satisfy various ad-hocdatabase queries on those complex
databases.Many existing database management and devel-
opment tools, such as EasyQuery [3], Cold Fu-sion [1], SAP and
Microsoft Access, provide severalmechanisms to let users create
customized querieson databases. However, the creation of
customizedqueries totally depends on users manual editing [16].If a
user is not familiar with the database schemain advance, those
hundreds or thousands of dataattributes would confuse him/her.
1.1 Our ApproachIn this paper, we propose a Dynamic Query
Formsystem: DQF, a query interface which is capable of dy-namically
generating query forms for users. Differentfrom traditional
document retrieval, users in database
Liang Tang, Tao Li, and Yexi Jiang are with School of Computer
Sci-ence, Florida International University, Miami, Florida, 33199,
U.S.A.E-mail: fltang002,taoli, [email protected]
Zhiyuan Chen is with Information Systems Department,
Universityof Maryland Baltimore County, Baltimore, MD, 21250,
U.S.A
TABLE 1Interactions Between Users and DQF
Query FormEnrichment 1) DQF recommends a ranked list of
query form components to the user.2) The user selects the
desired form
components into the current queryform.
QueryExecution 1) The user fills out the current query
form and submit a query.2) DQF executes the query and shows
the results.3) The user provides the feedback
about the query results.
retrieval are often willing to perform many rounds ofactions
(i.e., refining query conditions) before iden-tifying the final
candidates [7]. The essence of DQFis to capture user interests
during user interactionsand to adapt the query form iteratively.
Each itera-tion consists of two types of user interactions:
QueryForm Enrichment and Query Execution (see Table 1).Figure 1
shows the work-flow of DQF. It starts with abasic query form which
contains very few primaryattributes of the database. The basic
query form isthen enriched iteratively via the interactions
betweenthe user and our system until the user is satisfied withthe
query results. In this paper, we mainly study theranking of query
form components and the dynamicgeneration of query forms.
1.2 ContributionsOur contributions can be summarized as follows:
We propose a dynamic query form system whichgenerates the query
forms according to the usersdesire at run time. The system provides
a solution
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:PP NO:99
YEAR 2013
-
2Fig. 1. Flowchart of Dynamic Query Form
for the query interface in large and complexdatabases.
We apply F-measure to estimate the goodness ofa query form [30].
F-measure is a typical metricto evaluate query results [33]. This
metric is alsoappropriate for query forms because query formsare
designed to help users query the database.The goodness of a query
form is determinedby the query results generated from the
queryform. Based on this, we rank and recommend thepotential query
form components so that userscan refine the query form easily.
Based on the proposed metric, we develop effi-cient algorithms
to estimate the goodness of theprojection and selection form
components. Hereefficiency is important because DQF is an
onlinesystem where users often expect quick response.
The rest of the paper is organized as follows. Section
2describes the related work. Section 3 defines the queryform and
introduces how users interact with ourdynamic query form. Section 4
defines a probabilisticmodel to rank query form components. Section
5describes how to estimate the ranking score. Section6 reports
experimental results, and finally Section 7concludes the paper.
2 RELATED WORKHow to let non-expert users make use of the
relationaldatabase is a challenging topic. A lot of researchworks
focus on database interfaces which assist usersto query the
relational database without SQL. QBE(Query-By-Example) [36] and
Query Form are twomost widely used database querying interfaces.
Atpresent, query forms have been utilized in mostreal-world
business or scientific information systems.Current studies and
works mainly focus on how togenerate the query forms.Customized
Query Form: Existing database clientsand tools make great efforts
to help developers designand generate the query forms, such as
EasyQuery [3],Cold Fusion [1], SAP, Microsoft Access and so on.
They provide visual interfaces for developers to createor
customize query forms. The problem of those toolsis that, they are
provided for the professional devel-opers who are familiar with
their databases, not forend-users [16]. [17] proposed a system
which allowsend-users to customize the existing query form at
runtime. However, an end-user may not be familiar withthe database.
If the database schema is very large, it isdifficult for them to
find appropriate database entitiesand attributes and to create
desired query forms.Automatic Static Query Form: Recently, [16]
[18] pro-posed automatic approaches to generate the databasequery
forms without user participation. [16] presenteda data-driven
method. It first finds a set of dataattributes, which are most
likely queried based on thedatabase schema and data instances.
Then, the queryforms are generated based on the selected
attributes.[18] is a workload-driven method. It applies
clusteringalgorithm on historical queries to find the
represen-tative queries. The query forms are then generatedbased on
those representative queries. One problem ofthe aforementioned
approaches is that, if the databaseschema is large and complex,
user queries couldbe quite diverse. In that case, even if we
generatelots of query forms in advance, there are still userqueries
that cannot be satisfied by any one of queryforms. Another problem
is that, when we generatea large number of query forms, how to let
usersfind an appropriate and desired query form wouldbe
challenging. A solution that combines keywordsearch with query form
generation is proposed in[12]. It automatically generates a lot of
query formsin advance. The user inputs several keywords to
findrelevant query forms from a large number of pre-generated query
forms. It works well in the databaseswhich have rich textual
information in data tuplesand schemas. However, it is not
appropriate when theuser does not have concrete keywords to
describe thequeries at the beginning, especially for the
numericattributes.Autocompletion for Database Queries: In [26],
[21],novel user interfaces have been developed to assist theuser to
type the database queries based on the queryworkload, the data
distribution and the databaseschema. Different from our work which
focuses onquery forms, the queries in their work are in the formsof
SQL and keywords.Query Refinement: Query refinement is a
commonpractical technique used by most information retrievalsystems
[15]. It recommends new terms related tothe query or modifies the
terms according to thenavigation path of the user in the search
engine. Butfor the database query form, a database query is
astructured relational query, not just a set of terms.Dynamic
Faceted Search: Dynamic faceted search isa type of search engines
where relevant facets arepresented for the users according to their
naviga-tion paths [29] [23]. Dynamic faceted search engines
-
3are similar to our dynamic query forms if we onlyconsider
Selection components in a query. However,besides Selections, a
database query form has other im-portant components, such as
Projection components.Projection components control the output of
the queryform and cannot be ignored. Moreover, designs ofSelection
and Projection have inherent influences toeach other.Database Query
Recommendation: Recent studiesintroduce collaborative approaches to
recommenddatabase query components for database exploration[20]
[9]. They treat SQL queries as items in the col-laborative
filtering approach, and recommend similarqueries to related users.
However, they do not con-sider the goodness of the query results.
[32] proposesa method to recommend an alternative database
querybased on results of a query. The difference from ourwork is
that, their recommendation is a completequery and our
recommendation is a query componentfor each iteration.Dynamic Data
Entry Form: [11] develops an adaptiveforms system for data entry,
which can be dynamicallychanged according to the previous data
input by theuser. Our work is different as we are dealing
withdatabase query forms instead of data-entry forms.Active Feature
Probing: Zhu et al. [35] develop theactive featuring probing
technique for automaticallygenerating clarification questions to
provide appro-priate recommendations to users in database
search.Different from their work which focuses on findingthe
appropriate questions to ask the user, DQF aimsto select
appropriate query components.
3 QUERY FORM INTERFACE3.1 Query FormIn this section we formally
define the query form.Each query form corresponds to an SQL query
tem-plate.Definition 1: A query form F is defined as a tuple
(AF , RF , F , ./ (RF )), which represents a databasequery
template as follows:
F = (SELECT A1; A2; :::; AkFROM ./ (RF ) WHERE F );
where AF = fA1; A2; :::; Akg are k attributes forprojection, k
> 0. RF = fR1; R2; :::; Rng is the set ofn relations (or
entities) involved in this query, n > 0.Each attribute in AF
belongs to one relation in RF .F is a conjunction of expressions
for selections (orconditions) on relations in RF . ./ (RF ) is a
joinfunction to generate a conjunction of expressions forjoining
relations of RF .In the user interface of a query form F , AF
is
the set of columns of the result table. F is the setof input
components for users to fill. Query formsallow users to fill
parameters to generate differentqueries. RF and ./ (RF ) are not
visible in the user
interface, which are usually generated by the systemaccording to
the database schema. For a query form F ,./ (RF ) is automatically
constructed according to theforeign keys among relations in RF .
Meanwhile, RFis determined by AF and F . RF is the union set
ofrelations which contains at least one attribute of AFor F .
Hence, the components of query form F areactually determined by AF
and F . As we mentioned,only AF and F are visible to the user in
the userinterface. In this paper, we focus on the projection
andselection components of a query form. Ad-hoc joinis not handled
by our dynamic query form becausejoin is not a part of the query
form and is invisiblefor users. As for Aggregation and Order by
inSQL, there are limited options for users. For example,Aggregation
can only be MAX, MIN, AVG, and soon; and Order by can only be
increasing orderand decreasing order. Our dynamic query formcan be
easily extended to include those options byimplementing them as
dropdown boxes in the userinterface of the query form.
3.2 Query ResultsTo decide whether a query form is desired or
not,a user does not have time to go over every data in-stance in
the query results. In addition, many databasequeries output a huge
amount of data instances. Inorder to avoid this Many-Answer problem
[10], weonly output a compressed result table to show a high-level
view of the query results first. Each instance inthe compressed
table represents a cluster of actualdata instances. Then, the user
can click through in-terested clusters to view the detailed data
instances.Figure 2 shows the flow of user actions. The com-pressed
high-level view of query results is proposedin [24]. There are many
one-pass clustering algorithmsfor generating the compressed view
efficiently [34],[5]. In our implementation, we choose the
incrementaldata clustering framework [5] because of the
efficiencyissue. Certainly, different data clustering methodswould
have different compressed views for the users.Also, different
clustering methods are preferable todifferent data types. In this
paper, clustering is just toprovide a better view of the query
results for the user.The system developers can select a different
clusteringalgorithm if needed.
Fig. 2. User Actions
Another important usage of the compressed view isto collect the
user feedback. Using the collected feed-
-
4back, we can estimate the goodness of a query formso that we
could recommend appropriate query formcomponents. In real world,
end-users are reluctantto provide explicit feedback [19]. The
click-throughon the compressed view table is an implicit feedbackto
tell our system which cluster (or subset) of datainstances is
desired by the user. The clicked subset isdenoted by Duf . Note
that Duf is only a subset ofall user desired data instances in the
database. Butit can help our system generate recommended
formcomponents that help users discover more desireddata instances.
In some recommendation systems andsearch engines, the end-users are
also allowed toprovide the negative feedback. The negative
feedbackis a collection of the data instances that are not
desiredby the users. In the query form results, we assumemost of
the queried data instances are not desiredby the users because if
they are already desired, thenthe query form generation is almost
done. Therefore,the positive feedback is more informative than
thenegative feedback in the query form generation. Ourproposed
model can be easily extended for incorpo-rating the negative
feedback.
4 RANKING METRICQuery forms are designed to return the users
de-sired result. There are two traditional measures toevaluate the
quality of the query results: precision andrecall [30]. Query forms
are able to produce differentqueries by different inputs, and
different queries canoutput different query results and achieve
differentprecisions and recalls, so we use expected precision
andexpected recall to evaluate the expected performanceof the query
form. Intuitively, expected precision is theexpected proportion of
the query results which areinterested by the current user. Expected
recall is theexpected proportion of user interested data
instanceswhich are returned by the current query form. Theuser
interest is estimated based on the users click-through on query
results displayed by the query form.For example, if some data
instances are clicked bythe user, these data instances must have
high userinterests. Then, the query form components whichcan
capture these data instances should be rankedhigher than other
components. Next we introducesome notations and then define
expected precisionand recall.Notations: Table 2 lists the symbols
used in this paper.Let F be a query form with selection condition F
andprojection attribute set AF . Let D be the collectionof
instances in ./ (RF ). N is the number of datainstances in D. Let d
be an instance in D with a setof attributes A = fA1; A2; :::; Ang,
where n = jAj.We use dAF to denote the projection of instance d
onattribute set AF and we call it a projected instance.P (d) is the
occurrence probability of d in D. P (F jd)is the probability of d
satisfies F . P (F jd) 2 f0; 1g.
TABLE 2Symbols and Notations
F query formRF set of relations involved in FA set of all
attributes in ./ (RF )AF set of projection attributes of query form
FAr(F ) set of relevant attributes of query form FF set of
selection expressions of query form FOP set of relational operators
in selectiond data instance in ./ (RF )D the collection of data
instances in ./ (RF )N number of data instances in DdA1 data
instance d projected on attribute set A1DA1 set of unique values D
projected on attribute set
A1Q database queryDQ results of QDuf user feedback as clicked
instances in DQ fraction of instances desired by users
P (F jd) = 1 if d is returned by F and P (F jd) =
0otherwise.Since query form F projects instances to attribute
set AF , we have DAF as a projected database andP (dAF ) as the
probability of projected instance dAFin the projected database.
Since there are often dupli-cated projected instances, P (dAF ) may
be greater than1=N . Let Pu(d) be the probability of d being
desiredby the user and Pu(dAF ) be the probability of the userbeing
interested in a projected instance. We give anexample below to
illustrate those notations.
TABLE 3Data Table
ID C1 C2 C3 C4 C5I1 a1 b1 c1 20 1I2 a1 b2 c2 20 100I3 a1 b2 c3
30 99I4 a1 b1 c4 20 1I5 a1 b3 c4 10 2
Example 1: Consider a query form Fi with one re-lational data
table shown in Table 3. There are 5data instances in this table, D
= fI1; I2; :::; I5g, with5 data attributes A = fC1; C2; C3; C4;
C5g, N = 5.Query form Fi executes a query Q as SELECT C2,C5 FROM D
WHERE C2 = b1 OR C2 = b2. The queryresult is DQ = fI1; I2; I3; I4g
with projected on C2and C5. Thus P (Fi jd) is 1 for I1 to I4 and is
zerofor I5. Instance I1 and I4 have the same projectedvalues so we
can use I1 to represent both of themand P (I1C2;C5 ) = 2=5.Metrics:
We now describe the two measures expectedprecision and expected
recall for query forms.Definition 2: Given a set of projection
attributes A
and a universe of selection expressions , the expectedprecision
and expected recall of a query form F=(AF ,RF , F , ./ (RF )) are
PrecisionE(F ) and RecallE(F )
-
5respectively, i.e.,
PrecisionE(F ) =
Pd2DAF Pu(dAF )P (dAF )P (F jd)NP
d2DAF P (dAF )P (F jd)N;
(1)
RecallE(F ) =
Pd2DAF Pu(dAF )P (dAF )P (F jd)N
N;
(2)where AF A, F 2 , and is the fraction of in-stances desired
by the user, i.e., =
Pd2D Pu(d)P (d).
The numerators of both equations represent theexpected number of
data instances in the query resultthat are desired by the user. In
the query result,each data instance is projected to attributes in
AF . SoPu(dAF ) represents the user interest on instance d inthe
query result. P (dAF )N is the expected number ofrows in D that the
projected instance dAF represents.Further, given a data instance d
2 D, d being desiredby the user and d satisfying F are
independent.Therefore, the product of Pu(dAF ) and P (F jd) canbe
interpreted as the probability of d being desired bythe user and
meanwhile d being returned in the queryresult. Summing up over all
data instances gives theexpected number of data instance in the
query resultbeing desired by the user.Similarly, the denominator of
Eq.(1) is simply the
number of instances in the query result. The denom-inator of
Eq.(2) is the expected number of instancesdesired by the user in
the whole database. In bothequations N cancels out so we do not
need to con-sider N when estimating precision and recall.
Theprobabilities in these equations can be estimated usingmethods
described in Section 5. =
Pd2D Pu(d)P (d)
is the fraction of instances desired by user. P (d) isgiven by
D. Pu(d) could be estimated by the methoddescribed in Section
5.1.For example, suppose in Example 1, after projecting
on C2; C5, there are only 4 distinct instances I1, I2, I3,and I5
(I4 has the same projected values as I1). Theprobability of these
projected instances are 0:4, 0:2, 0:2,and 0:2, respectively.
Suppose Pu for I2 and I3 are 0.9and Pu for I1 and I5 are 0.03. The
expected precisionequals 0:030:4+0:90:2+0:90:2+00:4+0:2+0:2+0 =
0:465. Suppose =0:4, then the expected recall equals (0:03 0:4+
0:90:2 + 0:9 0:2 + 0)=0:4= 0:93.Considering both expected precision
and expected re-
call, we derive the overall performance measure, ex-pected
F-Measure as shown in Equation 3. Note that is a constant parameter
to control the preference onexpected precision or expected
recall.Definition 3: Given a set of projection attributes A
and an universe of selection expressions , the expectedF-Measure
of a query form F=(AF , RF , F , ./ (RF ))is FScoreE(F ), i.e.,
FScoreE(F )
=(1 + 2) PrecisionE(F ) RecallE(F )2 PrecisionE(F ) +RecallE(F )
:
Problem Definition: In our system, we provide aranked list of
query form components for the user.Problem 1 is the formal
statement of the rankingproblem.Problem 1: Let the current query
form be Fi and the
next query form be Fi+1, construct a ranking of allcandidate
form components, in descending order ofFScoreE(Fi+1), where Fi+1 is
the query form of Fienriched by the corresponding form
component.FScoreE(Fi+1) is the estimated goodness of the
next query form Fi+1. Since we aim to maxi-mize the goodness of
the next query form, theform components are ranked in descending
order ofFScoreE(Fi+1). In the next section, we will discusshow to
compute the FScoreE(Fi+1) for a specific formcomponent.
5 ESTIMATION OF RANKING SCORE
5.1 Ranking Projection Form Components
DQF provides a two-level ranked list for projectioncomponents.
The first level is the ranked list of en-tities. The second level
is the ranked list of attributesin the same entity. We first
describe how to rank eachentitys attributes locally, and then
describe how torank entities.
5.1.1 Ranking Attributes
Suggesting projection components is actually suggest-ing
attributes for projection. Let the current queryform be Fi, the
next query form be Fi+1. Let AFi =fA1; A2; :::; Ajg, and AFi+1 =
AFi[fAj+1g, j+1 jAj.Aj+1 is the projection attribute we want to
suggest forthe Fi+1, which maximizes FScoreE(Fi+1). From
theDefinition 3, we obtain FScoreE(Fi+1) as follows:
FScoreE(Fi+1)
=(1 + 2) PrecisionE(Fi+1) RecallE(Fi+1)2 PrecisionE(Fi+1)
+RecallE(Fi+1)
=(1 + 2) Pd2DAFi+1 Pu(dAFi+1 )P (dAFi+1 )P (Fi+1 jd)P
d2D P (dAFi+1 )P (Fi+1 jd) + 2:
(3)
Note that adding a projection component Aj+1 doesnot affect the
selection part of Fi. Hence, Fi+1 = Fiand P (Fi+1 jd) = P (Fi jd).
Since Fi is already usedby the user, we can estimate P (dAFi+1 )P
(Fi+1 jd) asfollows. For each query submitted for form Fi, wekeep
the query results including all columns in RF .Clearly, for those
instances not in query results theirP (Fi+1 jd) = 0 and we do not
need to consider them.For each instance d in the query results, we
simplycount the number of times they appear in the resultsand P
(dAFi+1 )P (Fi+1 jd) equals the occurrence countdivided by N .
-
6Now we only need to estimate Pu(dAFi+1 ). As forthe projection
components, we have:
Pu(dAFi+1 ) = Pu(dA1 ; :::; dAj ; dAj+1)
= Pu(dAj+1 jdAFi )Pu(dAFi ): (4)Pu(dAFi ) in Eq.(4) can be
estimated by the users
click-through on results of Fi. The click-throughDuf D is a set
of data instances which are clickedby the user in previous query
results. We apply kerneldensity estimation method to estimate
Pu(dAFi ). Eachdb 2 Duf represents a Gaussian distribution of
theusers interest. Then,
Pu(dAFi ) =1
jDuf jX
x2Duf
1p22
exp(d(dAFi ; xAFi )2
22);
where d(; ) denotes the distance between two data in-stances, 2
is the variance of Gaussian models. For nu-merical data, the
Euclidean distance is a conventionalchoice for distance function.
For categorical data, suchas string, previous literatures propose
several context-based similarity functions which can be employed
forcategorical data instances [14] [8].Pu(dAj+1 jdAFi ) in Eq.(4)
is not visible in the run-
time data, since dAj+1 has not been used before Fi+1.We can only
estimate it from other data sources.We mainly consider the
following two data-drivenapproaches to estimate the conditional
probabilityPu(dAj+1 jdAFi ). Workload-Driven Approach: The
conditional proba-bility of Pu(dAj+1 jdAFi ) could be estimated
fromquery results of historic queries. If a lot of usersqueried
attributes AFi and Aj+1 together on in-stance d, then Pu(dAj+1
jdAFi ) must be high.
Schema-Driven Approach: The database schema im-plies the
relations of the attributes. If two at-tributes are contained by
the same entity, thenthey are more relevant.
Each of the two approaches has its own draw-back. The
workload-driven approach has the cold-start problem since it needs
a large amount of queries.The schema-driven approach is not able to
identifythe difference of the same entitys attributes. In
oursystem, we combined the two approaches as follows:
Pu(dAj+1 jdAFi )= (1 )Pb(dAj+1 jdAFi ) + sim(Aj+1;AFi);
where Pb(dAj+1 jdAFi ) is the probability estimatedfrom the
historic queries, sim(Aj+1;AFi) is the sim-ilarity between Aj+1 and
AFi estimated from thedatabase schema, and is a weight parameter
in[0; 1]. is utilized to balance the workload-drivenestimation and
schema-driven estimation. Note that
sim(Aj+1;AFi) = 1P
A2AFi d(Aj+1; A)
jAFi j dmax;
where d(Aj+1; A) is the schema distance between theattribute
Aj+1 and A in the schema graph, dmax is the
diameter of the schema graph. The idea of consideringa database
schema as a graph is initially proposedby [16]. They proposed a
PageRank-like algorithm tocompute the importance of an attribute in
the schemaaccording to the schema graph. In this paper, weutilize
the schema graph to compute the relevance oftwo attributes. A
database schema graph is denotedby G = (R;FK; ;A), in which R is
the set of nodesrepresenting the relations, A is the set of
attributes,FK is the set of edges representing the foreign keys,and
: A ! R is an attribute labeling function toindicate which relation
contains the attribute. Basedon the database schema graph, the
schema distance isdefined as follows.Definition 4: Schema Distance
Given two attributes
A1,A2 with a database schema graph G=(R,FK,,A),A1 2 A, A2 2 A,
the schema distance between A1 andA2 is d(A1; A2), which is the
length of the shortestpath between node (A1) and node (A2).
5.1.2 Ranking EntitiesThe ranking score of an entity is just the
averagedFScoreE(Fi+1) of that entitys attributes. Intuitively,if
one entity has many high score attributes, then itshould have a
higher rank.
5.2 Ranking Selection Form Components
The selection attributes must be relevant to the
currentprojected entities, otherwise that selection would
bemeaningless. Therefore, the system should first findout the
relevant attributes for creating the selectioncomponents. We first
describe how to select relevantattributes and then describe a naive
method and amore efficient one-query method to rank
selectioncomponents.
5.2.1 Relevant Attribute SelectionThe relevance of attributes in
our system is measuredbased on the database schema as
follows.Definition 5: Relevant Attributes Given a database
query form F with a schema graph G=(R,FK,,A),the relevant
attributes is: Ar(F ) = fAjA 2 A; 9Aj 2AF ; d(A;Aj) tg, where t is
a user-defined thresholdand d(A;Aj) is the schema distance defined
in Defi-nition 4.The choice of t depends on how compact of
theschema is designed. For instance, some databases putall
attributes of one entity into a relation, then t couldbe 1. Some
databases separate all attributes of oneentity into several
relations, then t could be greaterthan 1. Using the depth-first
traversing of the databaseschema graph, Ar(F ) can be obtained in
O(jAr(F )jt).
5.2.2 Ranking Selection ComponentsFor enriching selection form
components of a queryform, the set of projection componentsAF is
fixed, i.e.,
-
7AFi+1 = AFi . Therefore, FScoreE(Fi+1) only dependson Fi+1 .For
the simplicity of the user interface, most query
forms selection components are simple binary rela-tions in the
form of Aj op cj, where Aj is an at-tribute, cj is a constant and
op is a relational operator.The op operator could be =, and so on.
Ineach cycle, the system provides a ranked list of suchbinary
relations for users to enrich the selection part.Since the total
number of binary relations are so large,we only select the best
selection component for eachattribute.For attribute As, As 2 Ar(F
), let Fi+1 = Fi [ fsg,
s 2 and s contains As. According to the formula ofFScoreE(Fi+1),
in order to find the s 2 that max-imizes the FScoreE(Fi+1), we only
need to estimateP (Fi+1 jd) for each data instance d 2 D. Note
that, inour system, F represents a conjunctive expression,which
connects all elemental binary expressions byAND. Fi+1 exists if and
only if both Fi and s exist.Hence, Fi+1 , Fi ^ s. Then, we
have:
P (Fi+1 jd) = P (Fi ; sjd) = P (sjFi ; d)P (Fi jd): (5)
P (Fi jd) can be estimated by previous queries exe-cuted on
query form Fi, which has been discussed inSection 5.1. P (sjFi ; d)
is 1 if and only if d satisfiesFi and s, otherwise it is 0. The
only problem is todetermine the space of s, since we have to
enumerateall the s to compute their scores. Note that s is abinary
expression in the form of As ops cs, in whichAs is fixed and given.
ops 2 OP where OP is afinite set of relational operators, f=;;;
:::g, and csbelongs to the data domain of As in the
database.Therefore, the space of s is a finite set OP DAs . Inorder
to efficiently estimate the new FScore inducedby a query condition
s, we propose the One-querymethod in this paper. The idea of
One-query issimple: we sort the values of an attribute in s
andincrementally compute the FScore on all possiblevalues for that
attribute.To find the best selection component for the next
query form, the first step is to query the databaseto retrieve
the data instances. In Section 5.2, Eq. (5)presents P (Fi+1 jd)
depends on the previous queryconditions Fi . If P (Fi jd) = 0, P
(Fi+1 jd) must be 0.Hence, in order to compute the P (Fi+1 jd) for
eachd 2 D, we dont need to retrieve all data instances inthe
database. What we need is only the set of datainstances D0 D such
that each d 2 D0 satisfiesP (Fi jd) > 0. So the selection of
One-Querys queryis the union of query conditions executed in Fi.In
addition, One-Query algorithm does not send
each query condition s to the database engine toselect data
instances, which would be a heavy burdenfor the database engine
since the number of queryconditions is large. Instead, it retrieves
the set of datainstances D0, and checks every data instance
with
every query condition by its own. For this purpose,the algorithm
needs to know the values of all selectionattributes of D0. Hence,
One-Query adds all theselection attributes into the projections of
the query.Algorithm 1 describes the algorithm of the
One-Querys query construction. The functionGenerateQuery is to
generate the database querybased on the given set of projection
attributes Aonewith selection expression one.
Algorithm 1: QueryConstructionData: Q = fQ1; Q2; :::; g is the
set of previous
queries executed on Fi.Result: Qone is the query of
One-Querybegin
one 0for Q 2 Q do
one one _ QAone AFi [ Ar(Fi)Qone GenerateQuery(Aone,one)
When the system receives the result of the queryQone from the
database engine, it calls the secondalgorithm of One-Query to find
the best query con-dition.We first discuss the condition. The basic
idea
of this algorithm is based on a simple property. For aspecific
attribute As with a data instance d, given twoconditions:
s1 : As a1,s2 : As a2,
and a1 a2, if s1 is satisfied, then s2 must be satisfied.Based
on this property, we could incrementally com-pute the FScore of
each query condition by scanningone pass of data instances. There
are 2 steps to dothis.
1) First, we sort the values of As in the order ofa1 a2 :::: am,
where m is the numberof Ass values. Let Daj denote the set of
datainstances in which Ass value is equal to aj .
2) Then, we go through every data instance inthe order of Ass
value. Let query conditionsj = \As aj" and its corresponding
FScorebe fscorej . According to Eq. (3), fscorej can becomputed
as
fscorej = (1 + 2) nj=dj ;
nj =X
d2DQonePu(dAFi )P (dAFi )P (Fi jd)P (sijd);
dj =X
d2DQoneP (dAFi )P (Fi jd)P (sijd) + 2:
For j > 1, nj and dj can be calculated incremen-
-
8tally:
nj = nj1 +Xd2Daj
Pu(dAFi )P (dAFi )P (Fi jd)P (sj jd);
dj = dj1 +X
d2DajP (dAFi )P (Fi jd)P (sj jd):
Algorithm 2 shows the pseudocode for finding thebest
condition.
Algorithm 2: FindBestLessEqConditionData: is the fraction of
instances desired by user,
DQone is the query result of Qone, As is theselection
attribute.
Result: s is the best query condition of As.begin
// sort by As into an ordered set DsortedDsorted Sort(DQone ;
As)s ;, fscore 0n 0, d 2for i 1 to jDsortedj do
d Dsorted[i]s As dAs// compute fscore of As dAsn n+ Pu(dAFi )P
(dAFi )P (Fi jd)P (sjd)d d+ P (dAFi )P (Fi jd)P (sjd)fscore (1 + 2)
n=dif fscore fscore then
s sfscore fscore
Complexity: As for other query conditions, such as=, , we can
also find similar incremental ap-proaches to compute their FScore.
They all share thesorting result in the first step. And for the
secondstep, all incremental computations can be mergedinto one pass
of scanning DQone . Therefore, the timecomplexity of finding the
best query condition for anattribute is O(jDQone jjAFi j). Ranking
every attributesselection component is O(jDQone j jAFi j
jAr(Fi)j).
5.2.3 Diversity of Selection ComponentsTwo selection components
may have a lot of overlap(or redundancy). For example, if a user is
interestedin some customers with age between 30 and 45, thentwo
selection components: age > 28 and age > 29could get similar
FScores and similar sets of datainstances. Therefore, there is a
redundancy of the twoselections. Besides a high precision, we also
requirethe recommended selection components should havea high
diversity. diversity is a recent research topicin recommendation
systems and web search engines[6] [28]. However, simultaneously
maximizing theprecision and the diversity is an NP-Hard problem[6].
It cannot be efficiently implemented in an inter-active system. In
our dynamic query form system,we observe that most redundant
selection components
are constructed by the same attribute. Thus, we onlyrecommend
the best selection component for eachattribute.
6 EVALUATIONThe goal of our evaluation is to verify the
followinghypotheses:H1: Is DQF more usable than existing
approaches
such as static query form and customizedquery form?
H2: Is DQF more effective to rank projectionand selection
components than the baselinemethod and the random method?
H3: Is DQF efficient to rank the recommendedquery form
components in an online userinterface?
6.1 System Implementation and ExperimentalSetupWe implemented
the dynamic query forms as a web-based system using JDK 1.6 with
Java Server Page.The dynamic web interface for the query forms
usedopen-source javascript library jQuery 1.4. We usedMySQL 5.1.39
as the database engine. All experimentswere run using a machine
with Intel Core 2 [email protected], 3.5G main memory, and running on
Win-dows XP SP2. Figure 3 shows a system prototype.Data sets: 3
databases: NBA 1, Green Car 2 andGeobase 3 were used in our
experiments. Table 4shows a general description of those
databases.
TABLE 4Data Description
Name #Relations #Attribute #InstancesNBA 10 180 44,590Green Car
1 17 2,187Geobase 9 32 1,329
Form Generation Approaches: We compared threeapproaches to
generate query forms: DQF: The dynamic query form system proposedin
this paper.
SQF: The static query form generation approachproposed in [18].
It also uses query workload.Queries in the workload are first
divided intoclusters. Each cluster is converted into a
queryform.
CQF: The customized query form generation usedby many existing
database clients, such as Mi-crosoft Access, EasyQuery,
ActiveQueryBuilder.
User Study Setup:We conducted a user study to eval-uate the
usability of our approach. We recruited 20
1. http://www.databasebasketball.com2.
http://www.epa.gov/greenvehicles3. Geobase is a database of
geographic information about the
USA, which is used in [16]
-
9Fig. 3. Screenshot of Web-based Dynamic Query Form
participants of graduate students, UI designers, andsoftware
engineers. The system prototype is shown byFigure 3. The user study
contains 2 phases, a querycollection phase and a testing phase. In
the collectionphase, each participant used our system to submitsome
queries and we collected these queries. Therewere 75 queries
collected for NBA, 68 queries collectedfor Green Car, and 132
queries for Geobase. Thesequeries were used as query workload to
train oursystem (see Section 5.1). In the second phase, weasked
each participant to complete 12 tasks (none ofthese tasks appeared
in the workload) listed in Table5. Each participant used all three
form generationapproaches to form queries. The order of the
threeapproaches were randomized to remove bias. We setparameter =
0:001 in our experiments because ourdatabases collect a certain
amount of historic queriesso that we mainly consider the
probability estimatedfrom the historic queries.Simulation Study
Setup: We also used the collectedqueries in a larger scale
simulation study. We useda cross-validation approach which
partitions queriesinto a training set (used as workload
information) anda testing set. We then reported the average
perfor-mance for testing sets.
6.2 User Study ResultsUsability Metrics: In this paper, we
employ somewidely used metrics in Human-Computer Interactionand
Software Quality for measuring the usability of asystem [31], [27].
These metrics are listed in Table 7.
TABLE 7Usability Metrics
Metric DefinitionACmin The minimal number of action for usersAC
The actual number of action performed by
usersACratio ACmin=AC 100:0%FNmax The total number of provided
UI function for
users to chooseFN The number of actual used UI function by
the userFNratio FN=FNmax 100%Success The percentage of users
successfully com-
pleted a specific task
In database query forms, one action means a mouseclick or a
keyboard input for a textbox. ACmin is theminimal number of actions
for a querying task. Onefunction means a provided option for the
user to use,such as a query form or a form component. In a webpage
based system, FNmax is the total number ofUI components in web
pages explored by the user.In this user study, each page at most
contains 5 UIcomponents. The smaller ACmin, AC, FNmax, andFN , the
better the usability. Similarly, the higher theACratio, FNratio,
and Success, the better the usability.There is a trade-off between
ACmin and FNmax. An
extreme case is that, we generate all possible queryforms in one
web page, the user only needs to chooseone query form to finish
his(or her) query task, soACmin is 1. However, FNmax would be the
number
-
10
TABLE 5Query Tasks
Task SQL MeaningT1 SELECT ilkid, firstname, lastname FROM
players Find all NBA players ID and full names.T2 SELECT p.ilkid,
p.firstname, p.lastname FROM players
p, player_playoffs_career c WHERE p.ilkid = c.ilkid
AND c.minutes > 5000
Find players who have played more than 5000 minutesin the
playoff.
T4 SELECT t.team, t.location, c.firstname, c.lastname,c.year
FROM teams t, coaches c WHERE t.team=c.team AND
t.location = Los Angeles
Find the name of teams located in Los Angeles with
theircoaches.
T5 SELECT Models, Hwy_MPG FROM cars WHERE City_MPG > 20 Find
the high way MPG of cars whose city road MPG isgreater than 20.
T6 SELECT Models, Displ, Fuel FROM cars WHERE Sales_Area= CA
Find the model, displacement and fuel of cars which issold in
California.
T7 SELECT Models, Displ FROM cars WHERE Veh_Class = SUV Find the
displacement of all SUV cars.T8 SELECT Models FROM cars WHERE Drive
= 4WD Find all 4 wheel-driven cars.T9 SELECT t0.population FROM
city t0 WHERE t0.state =
california
Find all cities in California with the population of
eachcity.
T10 SELECT t1.state, t0.area FROM state t0, border t1WHERE
t1.name = wisconsin and t0.area > 80000 and
t0.name = t1.name
Find the neighbor states of Wisconsin whose area isgreater than
80,000 square miles.
T11 SELECT t0.name, t0.length FROM river t0 WHERE t0.state=
illinois
Find all rivers across Illinois with each rivers length.
T12 SELECT t0.name, t0.elevation, t0.state FROM mountaint0 WHERE
t0.elevation > 5000
Find all mountains whose elevation is greater than 5000meters
and each mountains state.
of all possible query forms with their components,which is a
huge number. On the other hand, if usershave to interact a lot with
a system, that system wouldknow better about the users desire. In
that case, thesystem would cut down many unnecessary functions,so
that FNmax could be smaller. But ACmin would behigher since there
are a lot of user interactions.User Study Analysis: Table 6 shows
the average resultof the usability experiments for those query
tasks. Asfor SQF, we generated 10 static query forms based onthe
collected user queries for each database (i.e., 10clusters were
generated on the query workload).The results show that users did
not accomplish
querying tasks by SQF. The reason is that, SQF isbuilt from the
query workload and may not be ableto answer ad hoc queries in the
query tasks. E.g., SQFdoes not contain any relevant attributes for
query taskT3 and T11, so users failed to accomplish the queriesby
SQF.
TABLE 8Statistical Test on FNmax ( with CQF)
Task T1 T4 T5 T10 T11 T12P Value 0.0106
-
11
TABLE 6Usability Results
Task Query Form ACmin AC ACratio FNmax FN FNratio Success
T1DQF 6 6.7 90.0% 40.0 3 7.5% 100.0%CQF 6 7.0 85.7% 60.0 3 5%
100.0%SQF 1 1.0 100.0% 35.0 3 8.6% 44.4%
T2DQF 7 7.7 91.0% 65.0 4 6.2% 100.0%CQF 8 10.0 80.0% 86.7 4 4.6%
100.0%SQF 1 1.0 100.0% 38.3 4 10.4% 16.7%
T3DQF 10 10.7 93.5% 133.3 6 3.8% 100.0%CQF 12 13.3 90.2% 121.7 6
4.9% 100.0%SQF 1 N/A N/A N/A 6 N/A 0.0%
T4DQF 11 11.7 94.0% 71.7 6 8.4% 100.0%CQF 12 13.3 90.2% 103.3 6
5.8% 100.0%SQF 1 1.0 100.0% 70.0 6 8.6% 16.7%
T5DQF 5 5.7 87.7% 28.3 3 10.6% 100.0%CQF 6 6.7 90.0% 56.7 3 5.3%
100.0%SQF 1 1.0 100.0% 10 3 30.0% 66.7%
T6DQF 7 7.7 91.0% 61.7 4 6.5% 100.0%CQF 8 10 80.0% 61.7 4 6.5%
100.0%SQF 1 1 100.0% 23.3 4 17.2% 41.7%
T7DQF 5 6.0 83.3% 48.3 3 6.2% 100.0%CQF 6 6.7 90.0% 50.0 3 6.0%
100.0%SQF 1 1.0 100.0% 18.3 3 16.4% 44.4%
T8DQF 3 3.3 91.0% 21.7 2 9.2% 100.0%CQF 4 4.7 85.1% 26.7 2 7.5%
100.0%SQF 1 1.0 100.0% 38.3 2 5.2% 66.7%
T9DQF 5 6.3 79.3% 31.7 3 9.5% 100.0%CQF 6 6.7 90.0% 36.7 3 8.2%
100.0%SQF 1 1.0 100.0% 106.7 3 2.8% 66.7%
T10DQF 6 6.7 90.0% 43.3 4 9.2% 100.0%CQF 8 8.7 92.0% 63.3 4 6.3%
100.0%SQF 1 1.0 100.0% 75.0 4 5.3% 33.3%
T11DQF 5 6.3 79.4% 36.7 3 8.2% 100.0%CQF 6 6.7 90.0% 50.0 3 6.0%
100.0%SQF 1 N/A N/A N/A 3 N/A 0.0%
T12DQF 7 7.7 91.0% 46.7 4 8.6% 100.0%CQF 8 10.0 80.0% 85.0 4
4.7% 100.0%SQF 1 1.0 100.0% 31.7 4 12.6% 25.0%
On the premise of satisfying all users queries, thecomplexities
of query forms should be as small aspossible. DQF generates one
customized query formfor each query. The average form complexity is
8.1 forNBA, 4.5 for Green Car and 6.7 for Geobase. But forSQF, the
complexity is 30 for NBA and 16 for GreenCar (2 static query
forms). This result shows that,in order to satisfy various query
tasks, the staticallygenerated query form has to be more
complex.
6.4 Effectiveness
We compare the ranking function of DQF with twoother ranking
methods: the baseline method and therandom method. The baseline
method ranks projec-tion and selection attributes in ascending
order oftheir schema distance (see Definition 4) to the
currentquery form. For the query condition, it chooses themost
frequent used condition in the training set forthat attribute. The
random method randomly suggestsone query form component. The ground
truth of thequery form component ranking is obtained from thequery
workloads and stated in Section 6.1.Ranking projection components:
Ranking score is asupervised method to measure the accuracy of
the
recommendation. It is obtained by comparing thecomputed ranking
with the optimal ranking. In theoptimal ranking, the actual
selected component by theuser is ranked first. So ranking score
evaluates how faraway the actual selected component is ranked
fromthe first. The formula of ranks score is computed
asfollows:
RankScore(Q;Aj) =1
log(r^(Aj)) + 1;
where Q is a test query, Aj is the j-th projectionattribute of
Q, r^(Aj) is the computed rank of Aj .Figure 4 shows the average
ranking scores for all
queries in the workload. We compare three methods:DQF, Baseline,
and Random. The x-axis indicates theportion of the training
queries, and the rest queriesare used as testing queries. The
y-axis indicates theaverage ranking scores among all the testing
queries.DQF always outperforms the baseline method andrandom
method. The gap also grows as the portionof training queries
increases because DQF can betterutilize the training
queries.Ranking selection components: F-Measure is uti-lized to
measure ranking of selection components.Intuitively, if the query
result obtained by using the
-
12
0.3 0.4 0.5 0.6 0.7 0.8 0.9
Training Queries Ratio
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Average Ranking Score
DQF
Baseline
Random
(a) Ranking Scores on NBA Data
0.3 0.4 0.5 0.6 0.7 0.8 0.9
Training Queries Ratio
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Average Ranking Score
DQF
Baseline
Random
(b) Ranking Scores on Green Car Data
0.3 0.4 0.5 0.6 0.7 0.8 0.9
Training Queries Ratio
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Average Ranking Score
DQF
Baseline
Random
(c) Ranking Scores on Geobase Data
Fig. 4. Ranking Scores of Suggested Selection Components
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
Feedback Ratio
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
Average F-Measure
DQF
Baseline
Random
(a) Average F-Measure for NBA Data (Top5 Ranked Components)
0.00 0.01 0.02 0.03 0.04 0.05
Feedback Ratio
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Average F-Measure
DQF
Baseline
Random
(b) Average F-Measure on Green Car Data(Top 5 Ranked
Components)
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Feedback Ratio
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Average F-Measure
DQF
Baseline
Random
(c) Average F-Measure on Geobase Data(Top 3 Ranked
Components)
Fig. 5. Average F-Measure of Suggested Selection Components
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Data Instances of Query Result
0
500
1000
1500
2000
2500
3000
Running Time (Milliseconds)
Q1
Q2
Q3
Q4
(a) Running time on NBA Data
600 800 1000 1200 1400 1600 1800 2000
Number of Data Instances of Query Result
0
20
40
60
80
100
120
Running Time (Milliseconds)
Q1
Q2
Q3
Q4
(b) Running time on Green Car Data
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Data Instances of Query Result
0
100
200
300
400
500
Running Time (Milliseconds)
Q1
Q2
Q3
Q4
(c) Running time on Geobase Data
Fig. 6. Scalability of Ranking Selection Components
suggested selection component is closer to the actualquery
result, the F-Measure should be higher. Fora test query Q, we
define the ground truth as theset of data instances returned by the
query Q. Wealso constructed a query Q^, where Q^ is identical toQ
except for the last selection component. The lastselection
component of Q^ is constructed by the topranked component returned
by one of the three rank-ing methods. We then compared the results
of Q^ tothe ground truth to compute F-Measure. We randomlyselected
half of queries in the workload as training setand the rest as
testing. Since DQF uses users click-through as implicit feedback,
we randomly selectedsome small portion (Feedback Ratio) of the
groundtruth as click-through.Figure 5 shows the F-Measure (=2) of
all methods
on the data sets. The x-axis of those figures indicates
the Feedback Ratio over the whole ground truth.The y-axis is the
average F-Measure value among allcollected queries. From those
figures, DQF even per-forms well when there is no click-through
information( Feedback Ratio=0).
6.5 Efficiency
The run-time cost of ranking projection and selectioncomponents
for DQF depends on the current formcomponents and the query result
size. Thus we se-lected 4 complex queries with large result size
foreach data set. Table 10, Table 11 and Table 12 listthese
queries, where those join conditions are implicitinner joins and
written in WHERE clause. We variedthe query result size by query
paging in MySQLengine. The running times of ranking projection
are
-
13
all less than 1 millisecond, since DQF only computesthe schema
distance and conditional probabilities ofattributes. Figure 6 shows
the time for DQF to rankselection components for queries on the
data sets. Theresults show that the execution time grows
approxi-mately linearly with respect to the query result size.The
execution time is between 1 to 3 seconds for NBAwhen the results
contain 10000 records, less than 0.11second for Green Car when the
results contain 2000records, and less than 0.5 second for Geobase
whenresults contain 10000 records. So DQF can be used inan
interactive environment.
TABLE 10NBAs Queries in Scalability Test
Query SQLQ1 SELECT t0.coachid, t2.leag, t2.location,
t2.team, t3.d_blk, t1.fta, t0.season_win,
t1.fgm FROM coaches t0, player_regular_season
t1, teams t2, team_seasons t3 WHERE t0.team =
t1.team and t1.team = t3.team and t3.team =
t2.team
Q2 SELECT t2.lastname, t2.firstname, t1.wonFROM
player_regular_season t0, team_seasons
t1, players t2 WHERE t1.team = t0.team and
t0.ilkid = t2.ilkid
Q3 SELECT t0.lastname, t0.firstname FROM playerst0,
player_regular_season t1, team_seasons
t2 WHERE t2.team = t1.team and t1.ilkid =
t0.ilkid
Q4 SELECT t0.won, t3.name, t2.h_feet FROMteam_seasons t0,
player_regular_season t1,
players t2, teams t3 WHERE t3.team = t0.team
and t0.team = t1.team and t1.ilkid = t2.ilkid
TABLE 11Green Cars Queries in Scalability Test
Query SQLQ1 SELECT Underhood_ID, Displ, Hwy_MPG FROM cars
WHERE City_MPG
-
14
[12] E. Chu, A. Baid, X. Chai, A. Doan, and J. F.
Naughton.Combining keyword search and forms for ad hoc querying
ofdatabases. In Proceedings of ACM SIGMOD Conference, pages349360,
Providence, Rhode Island, USA, June 2009.
[13] S. Cohen-Boulakia, O. Biton, S. Davidson, and C.
Froidevaux.Bioguidesrs: querying multiple sources with a
user-centricperspective. Bioinformatics, 23(10):13011303, 2007.
[14] G. Das and H. Mannila. Context-based similarity measuresfor
categorical databases. In Proceedings of PKDD 2000, pages201210,
Lyon, France, September 2000.
[15] W. B. Frakes and R. A. Baeza-Yates. Information Retrieval:
DataStructures and Algorithms. Prentice-Hall, 1992.
[16] M. Jayapandian and H. V. Jagadish. Automated creation ofa
forms-based database query interface. In Proceedings of theVLDB
Endowment, pages 695709, August 2008.
[17] M. Jayapandian and H. V. Jagadish. Expressive query
spec-ification through form customization. In Proceedings of
Inter-national Conference on Extending Database Technology
(EDBT),pages 416427, Nantes, France, March 2008.
[18] M. Jayapandian and H. V. Jagadish. Automating the designand
construction of query forms. IEEE TKDE, 21(10):13891402, 2009.
[19] T. Joachims and F. Radlinski. Search engines that learn
fromimplicit feedback. IEEE Computer (COMPUTER),
40(8):3440,2007.
[20] N. Khoussainova, M. Balazinska, W. Gatterbauer, Y. Kwon,and
D. Suciu. A case for a collaborative query managementsystem. In
Proceedings of CIDR, Asilomar, CA, USA, January2009.
[21] N. Khoussainova, Y. Kwon, M. Balazinska, and D.
Suciu.Snipsuggest: Context-aware autocompletion for sql.
PVLDB,4(1):2233, 2010.
[22] J. C. Kissinger, B. P. Brunk, J. Crabtree, M. J.
Fraunholz,B. Gajria, A. J. Milgram, D. S. Pearson, J. Schug, A.
Bahl, S. J.Diskin, H. Ginsburg, G. R. Grant, D. Gupta, P. Labo, L.
Li,M. D. Mailman, S. K. McWeeney, P. Whetzel, C. J. Stoeckert,and
J. . D. S. Roos. The plasmodium genome database:Designing and
mining a eukaryotic genomics resource. Nature,419:490492, 2002.
[23] C. Li, N. Yan, S. B. Roy, L. Lisham, and G. Das.
Facetedpedia:dynamic generation of query-dependent faceted
interfaces forwikipedia. In Proceedings of WWW, pages 651660,
Raleigh,North Carolina, USA, April 2010.
[24] B. Liu and H. V. Jagadish. Using trees to depict a
forest.PVLDB, 2(1):133144, 2009.
[25] P. Mork, R. Shaker, A. Halevy, and P. Tarczy-Hornoch. Pql:
adeclarative query language over dynamic biological schemata.In In
Proceedings of American Medical Informatics Association
FallSymposium, pages 533537, San Antonio, Texas, 2007.
[26] A. Nandi and H. V. Jagadish. Assisted querying using
instant-response interfaces. In Proceedings of ACM SIGMOD,
pages11561158, 2007.
[27] J. Nielsen. Usability Engineering. Morgan Kaufmann,
SanFrancisco, 1993.
[28] D. Rafiei, K. Bharat, and A. Shukla. Diversifying web
searchresults. In Proceedings of WWW, pages 781790, Raleigh,
NorthCarolina, USA, April 2010.
[29] S. B. Roy, H. Wang, U. Nambiar, G. Das, and M. K. Moha-nia.
Dynacet: Building dynamic faceted search systems overdatabases. In
Proceedings of ICDE, pages 14631466, Shanghai,China, March
2009.
[30] G. Salton and M. McGill. Introduction to Modern
InformationRetrieval. McGraw-Hill, 1984.
[31] A. Seffah, M. Donyaee, R. B. Kline, and H. K. Padda.
Usabilitymeasurement and metrics: A consolidated model.
SoftwareQuality Journal, 14(2):159178, 2006.
[32] Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query by
output.In Proceedings of SIGMOD, pages 535548, Providence,
RhodeIsland, USA, September 2009.
[33] G. Wolf, H. Khatri, B. Chokshi, J. Fan, Y. Chen, and S.
Kamb-hampati. Query processing over incomplete autonomousdatabases.
In Proceedings of VLDB, pages 651662, 2007.
[34] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An
efficientdata clustering method for very large databases. In
Proceedingsof SIGMOD, pages 103114, Montreal, Canada, June
1996.
[35] S. Zhu, T. Li, Z. Chen, D. Wang, and Y. Gong. Dynamic
activeprobing of helpdesk databases. Proc. VLDB Endow.,
1(1):748760, Aug. 2008.
[36] M. M. Zloof. Query-by-example: the invocation and
definitionof tables and forms. In Proceedings of VLDB, pages
114,Framingham, Massachusetts, USA, September 1975.
Liang Tang Liang Tang received the master degree in
computerscience from the Department of Computer Science, Sichuan
Univer-sity, Chengdu, China, in 2009. He is currently a Ph.D.
student in theSchool of Computing and Information Sciences, Florida
InternationalUniversity, Miami. His research interests are data
mining, computingsystem management, data mining and machine
learning.
Tao Li Tao Li received the Ph.D. degree in computer sciencefrom
the Department of Computer Science, University of
Rochester,Rochester, NY, in 2004. He is currently an Associate
Professor withthe School of Computing and Information Sciences,
Florida Inter-national University, Miami. His research interests
are data mining,computing system management, information retrieval,
and machinelearning. He is a recipient of NSF CAREER Award and
multiple IBMFaculty Research Awards.
Yexi Jiang Yexi Jiang received the master degree in computer
sci-ence from the Department of Computer Science, Sichuan
University,Chengdu, China, in 2010. He is currently a Ph.D. student
in theSchool of Computing and Information Sciences, Florida
InternationalUniversity, Miami. His research interests are system
oriented datamining, intelligent cloud, large scale data mining,
database andsemantic web.
Zhiyuan Chen Zhiyuan Chen is an associate professor at the
De-partment of Information Systems, University of Maryland
BaltimoreCounty. He received PhD in Computer Science from Cornell
Uni-versity in 2002, and M.S. and B.S. in computer science from
FudanUniversity, China. His research interests are in database
systemsand data mining, including privacy preserving data mining,
dataexploration and navigation, health informatics, data
integration, XML,automatic database administration, and database
compression.