-
XWRAPComposer: A Multi-Page Data Extraction Servicefor
Bio-Computing Applications
Ling Liu, Jianjun Zhang, Wei Han, Calton Pu, James Caverlee,
Sungkeun ParkCollege of Computing, Georgia Institute of
Technology
{lingliu, zhangjj, weihan, calton, caverlee,
mungooni}@cc.gatech.eduTerence Critchlow, Matthew Coleman, David
Buttler
Lawrence Livermore Nationral Laboratory, California,
USA{critchlow1, coleman16, buttler1}@llnl.gov
Author Responsible for Correspondence
Jianjun ZhangCollege of Computing
Georgia Institute of TechnologyAtlanta, GA, 30318
Phone: 404-385-2027
1
-
XWRAPComposer: A Multi-Page Data Extraction Servicefor
Bio-Computing Applications
Ling Liu, Jianjun Zhang, Wei Han, Calton Pu, James Caverlee,
Sungkeun ParkTerence Critchlow, Matthew Coleman, David Buttler
Abstract
Bio-computing applications usually require gathering
infor-mation from multiple Bioinformatic services such as
proteindatabases, various BLAST services. Although Web service
tech-nology such as WSDL, SOAP, UDDI has provided a standard-ized
remote invocation interface, there exist other types of
het-erogeneity in terms of query capability, content structure,
andcontent delivery logics due to inherent diversity of different
ser-vices. A popular approach to handling such heterogeneity isto
use wrappers to serve as mediators to facilitate the automa-tion of
collecting and extraction data from multiple diverse
dataproviders.
This paper presents a service-oriented framework for the
de-velopment of wrapper code generators, including the methodol-ogy
of designing an effective wrapper program construction fa-cility
and a concrete implementation, called XWRAPComposerThree unique
features distinguish XWRAPComposer from exist-ing wrapper
development approaches. First, XWRAPComposeris designed to enable
multi-stage and multi-page data extrac-tion. Second, XWRAPComposer
is the only wrapper generationsystem that promotes the distinction
of information extractionlogic from query-answer control logic,
allowing higher level ofrobustness against changes in the service
provider’s web sitedesign or infrastructure. Third, XWRAPComposer
provides auser-friendly plug-and-play interface, allowing seamless
incor-poration of external services and continuous changing
serviceinterfaces and data format.
1 Introduction
With the wide deployment of Web service technology, the
In-ternet and the World Wide Web (Web) have become one of themost
popular means for disseminating scientific data from a vari-ety of
disciplines. Vast and growing amount of life sciences datareside
today in specialized Bioinformatics data sources, manyof them are
accessible online with specialized query process-ing capabilities.
The Molecular Biology Database Collection,for instance, currently
holds over 500 data sources [1], not evenincluding many tools that
analyze the information containedtherein. Bioinformatics data
sources over the Internet have awide range of query processing
capabilities. Typically, manyWeb-based sources allow only limited
types of selection queries.To compound the problem, data from one
source often must becombined with data from other sources to
provide scientists withthe information they need.
The extraordinary growth of service oriented computing hasbeen
fueled by the enhanced ability to make a growing amount
ofinformation available through the Web. This brings good newsand
bad news. The good news is that the Web services providethe
standard invocation interface for remote service calls and thebulk
of useful and valuable information is designed and pub-lished in a
human browsing format (HTML or XML). The badnews is that these
“human-oriented” Web pages returned by Web
services are difficult for programs to capture and extract
infor-mation of interests automatically and to fuse and integrate
datafrom multiple autonomous and yet heterogeneous data
producerservices. Also different web service provides use different
andevolving custom data formats.
A popular approach to handle this problem is to write
datawrappers to encapsulate the access to Web sources and to
au-tomate the information extraction tasks on behalf of human.A
wrapper is a software program specialized to a single datasource or
single Web service (e.g., a web site), which convertsthe source
documents and queries from the source data modelto another, usually
a more structured, data model [15]. Severalprojects have
implemented hand-coded wrappers for a varietyof sources [11, 3, 14,
13]. However, manually writing such awrapper and making it robust
is costly due to the irregularity,heterogeneity, and frequent
updates of the Web site and the datapresentation formats they use.
Hand-coding wrappers can be-come a major pain in situations where
the data integration ap-plications are more interested in
integrating new data sources orfrequently changing Web sources. We
observe that, with a gooddesign methodology, only a relatively
small part of the wrappercode deals with the source-specific
details, and the rest of thecode is either common among wrappers or
can be expressed ata higher level, more structured fashion. There
are a number ofchallenging issues in automation of the wrapper code
generationprocess.
• First, most Web pages are HTML or XML documents, whichare
semi-structured text files annotated with various HTMLpresentation
tags. Due to the frequent changes in presen-tation style of the
HTML documents, the lack of semanticdescription of their
information content, and the difficulty inmaking all applications
in one domain to use the same XMLschema, it is hard to identify the
content of interest usingcommon pattern recognition technology such
as string regu-lar expression specification used in LEX and
YACC.
• Second, wrappers for Web sources should be more robustand
adaptive in the presence of changes in both presenta-tion style and
information content of the Web pages. It isexpected that the
wrappers generated by the wrapper gen-eration systems will have
lower maintenance overhead thanhandcrafted wrappers for unexpected
changes.
• Third, wrappers often serve as interface programs and passthe
Web data extracted to application-specific informationbroker agents
or information integration mediators for moresophisticated data
analysis and data manipulation. Thus it isdesirable to provide a
wrapper interface language that is sim-ple, self-describing, and
yet powerful enough for extractingand capturing information from
most of the Web pages.
In scientific computing domains such as bioinformatics
andbioengineering, information extraction over multiple
differentpages imposes additional challenges for wrapper code
gener-ation systems due to the varying correlation of the pages
in-volved. The correlation can be either horizontal when group-
-
ing data from homogeneous documents (such as multiple
resultpages from a single search) or vertical when joining data
fromheterogeneous but related documents (a series of pages
contain-ing information about a specific topic). Furthermore, the
cor-relation can be extended into a graph of workflows as we
willdescribe in Figure 2. Therefore there is an increasing
demandfor automated wrapper code generation systems to incorporate
amulti-page information extraction service. A multi-page wrap-per
not only enriches the capability of wrappers to extract
infor-mation of interests but also increases the sophistication of
wrap-per code generation.
Surprisingly, almost all existing wrappers generated by
appli-cation code generators [8, 19, 2] are single-page wrappers in
thesense that the wrapper program responds to a keyword query
byanalyzing only the page immediately returned. Most wrapperscannot
follow the links within this page to continue the informa-tion
extraction from other linked pages, unless separate queriesare
issued to locate other linked pages.
Bearing all these issues in mind, we develop a code gener-ation
framework for building a semi-automated wrapper codegeneration
system that can generate wrappers capable of extract-ing
information from multiple inter-linked Web documents, andwe
implement this framework with XWRAPComposer, a toolkitfor
semi-automatically generating Java wrapper programs thatcan collect
and extract data from multiple inter-linked pages au-tomatically.
XWRAPComposer has three unique features withregard to supporting
multi-page data extraction.
• First, we introduce interface, outface, and composer scriptfor
each wrapper program we generate. By encoding wrap-per developers’
knowledge in Interface Specification, Outer-face Specification and
Composer Script, XWRAPComposerintegrates single-page wrapper
programs into a compositewrapper capable of extracting information
across multipleinter-linked pages from one service provider.
• Second, XWRAPComposer transforms the multi-page infor-mation
extraction problem into an integration problem ofmultiple
single-page data extraction results, and utilizes thecomposer
script to interconnect a sequence of single-pagedata extraction
results, offering flexible execution choicesto address diverse
needs of different users. It generatesplatform-independent Java
code that can be executed locallyon users’ machine. It also provide
a WSDL-plugin moduleto allow users to produce WSDL enabled wrappers
as WebServices [24].
• Third but not the least, XWRAPComposer supports micro-workflow
management, such as intermediate informationflow or result
auditing. We demonstrate this capability byintegrating
XWRAPComposer and its generated wrapperswith some process modeling
tools such as Ptolemy [4], al-lowing users to interactively manage
different componentsof a wrapper and the interaction between
them.
2 The Design FrameworkA multi-page wrapper code generation is a
complex process
and it is not reasonable, either from a logical point of view
orfrom an implementation point of view, to consider the
construc-tion process as occurring in one single step. For this
reason, wepartition the wrapper construction process into a series
of sub-processes calledphases, as shown in Figure 1. A phase is a
logi-
cally cohesive operation that takes as input one representation
ofthe source document and produces as output another
represen-tation. XWrapComposer wrapper generation goes through
sixphases to construct and release a Java wrapper. Tasks within
aphase run concurrently using a synchronized queue; each runsits
own thread. For example, we decide to run the task of fetch-ing a
remote document and the task of repairing the bad for-matting of
the fetched document using two concurrently syn-chronous threads in
a single pass of the source document. Thetask of generating a
syntactic-token parse tree from an HTMLdocument requires as input
the entire document; thus, it cannotbe done in the same pass as the
remote document fetching andthe syntax reparation. Similar analysis
applies to the other taskssuch as code generation, testing, and
packaging.
The interaction and information exchange between any two ofthe
phases is performed through communication with the book-keeping and
the error handling routines. Thebookkeepingrou-tine of the wrapper
generator collects information about all thedata objects that
appear in the retrieved source document, keepstrack of the names
used by the program, and records essentialinformation about each.
For example, a wrapper needs to knowhow many arguments a tag
expects, whether an element repre-sents a string or an integer. The
data structure used to recordthis information is called a symbol
table. Theerror handler isdesigned for the detection and reporting
errors in the fetchedsource document. The error messages should
allow a wrapperdeveloper to determine exactly where the errors have
occurred.Errors can be encountered at virtually all the phases of a
wrap-per. Whenever a phase of the wrapper discovers an error, it
mustreport the error to the error handler, which issues an
appropriatediagnostic message. Once the error has been noted, the
wrap-per must modify the input to the phase detecting the error,
sothat the latter can continue processing its input, looking for
sub-sequent errors. Good error handling is difficult because
certainerrors can mask subsequent errors. Other errors, if not
properlyhandled, can spawn an avalanche of spurious errors.
Techniquesfor error recovery are beyond the scope of this
paper.
Figure 1 presents an architecture sketch of the XWRAPCom-poser
system. The system architecture of XWRAPComposerconsists of four
major components: (1) Remote Connectionand Source-specific Parser;
(2) Multi-page Data Extraction; (3)Code Generation and Packaging;
and (4) Debugging and Re-lease. Other components include GUI
interface, bookkeepingand error handling. GUI interface allows
wrapper developers tospecify workflow of the multi-page data
extraction, the request-respond flow control rules and cross-page
data extraction rulesinteractively.
Remote Connection and Source-specific Parseris the
firstcomponent, which prepares and sets up the environment for
in-formation extraction process by performing the following
threetasks. First, it accepts an URL selected and entered by
theXWrapComposer user, issues an HTTP request to the remoteservice
provider identified by the given URL, and fetches thecorresponding
web document (or so called page object). Duringthis process, the
XWRAPComposer will learn the search inter-face and the remote
service invocation procedure in the back-ground and generate a set
of rules that describe the list of in-terface functions and
parameters as well as how they are usedto fetch a remote document
from a given web source. The list
3
-
Interface
Specification
Outerface
Specification
Wrapper
Java
Program
Configuration
FilesExtraction
Script
External
Software
Package
Code Generation and
PackagingStructure
Transformation
Generating WrapperProgram Code
Enter a URL
Remote Connection and
Source-specific ParserGenerating
Interface Spec.
GeneratingParse Tree
RepairingSyntax Erros
Multi-page Data Extraction
Document
Structure Spec
ExtractionRegion
Identification
InformationExtraction
Rules
Remote
Web Page
Search and Remote
Invocation Rules
URLs
Request-respond
Flow Control Rules
Information Extraction Rules
Debugging and Release
Wrapper Program
Testing
Wrapper ProgramRelease
XML presentation
of Sample Page
Testing Request
+ Feedbacks
XWRAPComposer System Wrapper
Repository
Wrapper
Web
Service
Ptolemy
Wrapper
Actors
Wrapper
Extension
Figure 1: XWRAPComposer System Architecture
of interface functions include the declaration to the standard
li-brary routines for establishing the network connection,
issuingan HTTP request to the remote web server through aHTTPGet or
HTTP Post method, and fetching the correspondingweb page. Other
desirable functions include building the correctURL to access the
given service and pass the correct parameters,and handling
redirection, failures, or authorization if necessary.Second, it
cleans up bad HTML tags and syntactical errors us-ing XWRAPComposer
plugin such as HTML TIDY [18, 22].Third, it transforms the
retrieved page object into a parse tree orso-called syntactic token
tree. This page object will be used as asample for XWRAPComposer to
interact with the user to learnand derive the important information
extraction rules, and the listof linked pages the user is
interested in extracting information inconjunction with this page.
In addition, all wrappers generatedby XWrap use the streaming mode
instead of the blocking mode(recall Section 2). Namely, the wrapper
will read the web pageone block1 at a time. An interface
specification will be createdin this phase.
Multi-page Data Extraction is the second component,which is
responsible for deriving information flow control logicand
multi-page extraction logic, both are represented in form ofrules.
The former describes the flow control logic of the tar-geted
service in responding to a service request and the latterdescribes
how to extract information content of interest fromthe answer page
and the linked pages of interest. XWRAP-Composer performs the
multi-page information extraction taskin four steps: (1) specify
the structure of the retrieved docu-ment (page object) in a
declarative extraction rule language. (2)identify the interesting
regions of the main page object and gen-erating information
extraction rules for this page; (3) identifythe list of URLs
referenced in the extracted regions in the mainpage; and (4)
generating information extraction rules for each ofthe pages linked
from the interesting regions of the main pageobject. We perform
single page data extraction process usingXWRAPElite [8] toolkit, a
single page data extraction servicedeveloped by the XWRAP team at
Georgia Tech. At the endof this phase, XWRAPComposer produces two
specifications:an outerface specification that describes the output
format of the
1A block here refers to a line of 256 characters or a transfer
unit definedimplicitly by the HTTP protocol.
extraction result will be produced, and a composer script that
de-scribes both the information flow control patterns and the
multi-page data extraction patterns.
Code Generation and Packagingis the third component,which
generates the wrapper program code by applying threesets of rules
about the target service produced in the first twosteps: (1) the
search and remote invocation rules, (2) the request-respond flow
control rules, and the information extraction rules.A key technique
in our implementation is the smart encodingof these three types of
semantic knowledge in the form of ac-tive XML-template format (see
Section 4 for detail). The codegenerator interprets the
XML-template rules by linking each ex-ecutable component with the
corresponding rule sets. The codegenerator also produces the XML
representation for the retrievedsample page object as a
byproduct.
Debugging and Releaseis the fourth component and the fi-nal
phase of the multi-page wrapping process. It allows the userto
enter a set of alternative service requests to the same ser-vice
provider to debug the wrapper program generated by run-ning the
XWRAPComposer’s code debugging module. For eachpage object
obtained, the debugging module will automaticallygo through the
syntactic structure normalization to rule out syn-tactic errors,
the flow control and information extraction stepsto check if new or
updated flow control rules or data extractionrules should be
included. In addition, the debug-monitoring win-dow will pop up to
allow the user to browse the debug report.Whenever an update to any
of the three sets of rules occurs, thedebugging module will run the
code generator to create a newversion of the wrapper program. Once
the user is satisfied withthe test results, he or she may invoke
the release to obtain therelease version of the wrapper program,
including assigning theversion release number, packaging the
wrapper program withapplication plug-ins and user manual into a
compressed tar file.
XWRAPComposer wrapper generator takes the followingthree inputs:
interface specification, outerface specification, andcomposer
script, and compiles them into a Java wrapper pro-gram, which can
be further extended into either a multi-pagedata extraction Web
service (with WSDL specification) or aPtolemy wrapper actor, which
can be used for large scale dataintegration.
In the subsequent sections, we first provide a walkthrough
4
-
example to illustrate the multi-page extraction process. Next,we
focus our discussion primarily on multi-page data
extractioncomponent of the XWrapComposer, and give a brief
descrip-tion of the wrapping interface and remote invocation
componentas the necessary preprocessing step for information
extraction,together with a short summary of code generation as the
post-processing for the multipage extraction.
3 Example WalkThroughBefore describing the detailed techniques
used in designing
multi-page data extraction services, we first present a
walk-through of XWRAPComposer using the motivating example
in-troduced in Figure 2, where a biologist first uses a
programcalledClusfavorto cluster genes that have changed
significantlyin a micro-array analysis experiment. After extracting
all geneids from the Clusfavor result, he feeds them into the NCBI
Blastservice, which searches all related sequences over a variety
ofdata sources. The returned sequences will be further examinedto
find promoter sequences. Let us focus on the NCBI BLASTservice.
Figure 2 shows the workflow of how a BLAST servicerequest to NCBI
will be served. It consists of four steps: BLASTresponse step
presents the user with a request ID, BLAST delaystep presents the
user with the time delay for the result. BLASTSummary presents the
user with an overview of all gene ids thatmatch well with the given
gene sequence id. Finally, BLASTDetail shows for each gene id
listed in the summary page, thefull sequence detail and the goal is
to extract approximately1000-5000 bases of the DNA sequence around
the alignment tocapture the promoter regulatory elements, the
region of a genewhere RNA polymerase can bind and begin
transcription to cre-ate the proteins that can regulate cell
function.
Microarray
analysis
Genes that changed
significantly
CLUSFAVOR
NCBI BLAST
Data Integration
Gene ids
AA045112
All related sequencesStatistical Clustering
of genes
BLAST search over a variety
of data sources for common
promoter elements to link
new candidate genes
BLAST
Response
BLAST
Delay
BLAST
Summary
BLAST
Detail
Request ID Summary URL Detail URL
ID, etc. Sequences
Promoter sequences
…
…
Figure 2: A Scientific Data Integration Example Scenario
Figure 3 illustrates a typical BLAST query using NCBI ser-vice
[17]. A BLAST query involves four steps. The first stepis to feed a
gene sequence into the text entry of the query inter-face. Due to
the time complexity of a BLAST search, the NCBIservice provider
typically returns a response page with a requestID and the first
estimate of the waiting time for each BLASTsearch. The biologist
may later ask NCBI for the BLAST resultsusing the request ID (Step
2), the NCBI service will presents adelay page if the BLAST search
is not completed and results arenot yet ready to display (Step 3).
Once the BLAST results are
delivered, they are displayed in a BLAST summary page,
whichcontains a summary of all genes matching the search query
con-dition. Each of the matching genes will provide a link to
theNCBI BLAST Detail page (Step 4). If the gene id used for
theBLAST query is incorrect gene id or NCBI does not provideBLAST
service for the given gene id, an error page will be dis-played. If
the summary page does not include detailed informa-tion that the
biologist is interested in, he has to visit each detailpage (Step
5) through the URLs embedded in the summary page.
A critical challenge for providing system-level support
forscientists to achieve such complex data integration tasks is
theproblem of locating, accessing, and fusing information from
arapidly growing, heterogeneous, and distributed collection ofdata
sources available on the Web. This is a complex searchproblem for
two reasons. First, as the example in Figure 2shows, scientists
today have much more complex data collec-tion requirements than
ordinary surfers on the Web. They oftenwant to collect a set of
data from a sequence of searches over alarge selection of
heterogeneous data sources, and the data se-lected from one search
step often forms the filter condition forthe next search step,
turning a keyword-based query into a so-phisticated search and
information extraction workflow. Second,such complex workflows are
manually performed daily by sci-entists or data collection lab
researchers (computer science spe-cialists). Automating such
complex search and data collectionworkflows presents three major
challenges.
• Different service providers use different request-respondflow
control logics to present the answer pages to searchqueries.
• Cross-page data extraction has more complex extractionlogic
than the single page extraction system. In addition,different
applications require different sets of data to be ex-tracted by the
cross-page data extraction engine. Typically,only portions of one
page and the links that lead the extrac-tion to the next page need
to be extracted.
• Data items extracted from multiple inter-linked pages
requirebeing associated with semantically meaningful naming
con-vention. Thus, mechanisms that can incorporate the knowl-edge
of the domain scientists who issued such cross-pageextraction job
are critical.
There are several ways to design NCBI BLAST wrapper.First, we
can develop two wrappers, one for NSBI BLAST sum-mary and one for
NCBI BLAST Detail. The NCBI BLAST sum-mer wrapper can be integrated
with the NCBI BLAST Detailwrapper by service composition. In this
approach, we need tocapture the request-respond flow control
through a flow controllogic in the composer script of NCBI Summary
wrapper. Theouterface specification of the NCBI summary wrapper
consistsof the general overview of the given gene id and the list
of geneids that are relevant to the given gene id. The NCBI
BLASTDetail wrapper needs to extract approximately 1000-5000
basesof the DNA sequence around the alignment. The compositewrapper
NCBI BLAST will be composed of the NCBI summarywrapper and a list
of executions of the NCBI BLAST Detailwrapper. In the next section
we describe the XWRAPComposerdesign using this example.
5
-
STEP 1
STEP 2
STEP 4
STEP 3
STEP 5
Figure 3: Multipage query with a NCBI web site
4 Multi-Page Data Extraction Service
We have developed a methodology and a framework for ex-traction
of information from multiple pages connected via webpage links. The
main idea is to separate what to extract fromhow to extract, and
distinguish information extraction logic fromrequest-respond flow
control logic. The control logic describesthe different ways in
which a service request (query) could beanswered from a given
service provider. The data extractionlogic describes the cross-page
extraction steps, including whatinformation is important to extract
at each page and how suchinformation is used as a complex filter in
the next search andextraction step.
We use interface description to specify the necessary
inputobjects for wrapping the target service and the outerface
descrip-tion to describe what should be extracted and presented as
thefinal result by the wrapper program. We design and developa
XWRAPComposer Script language (a set of functional con-structs) to
describe the request-respond flow control logic andmulti-page data
extraction logic, and to implement the outputalignment and tagging
of data items extracted based on the out-erface specification.
The compilation process of the XWRAPComposer includesgenerating
code based on three sets of rules: (1) Remote con-nection and
interface rules, (2) the request-respond flow controllogic and
multi-page extraction logic outlined in the composerscript, (3) the
correct output alignment and semantically mean-ingful tagging based
on the outerface specification.
4.1 Interface and Outerface Specification
Interface specification describes the schema of the data thatthe
wrapper takes as input. It defines the source location andthe
service request (query) interface for the wrapper to be gener-ated.
Outerface specification describes the schema of the result
that the wrapper outputs. It defines the type and structure of
ob-jects extracted. The composer script consists of two sets of
rule-based scripts. The request-respond flow control script
describesthe alternative ways that the target service will respond
to a re-mote service request, including result not found, multiple
resultsfound or single result found, or server errors. The
multi-pagedata extraction script which describes (1) the extraction
rules forthe main page, (2) the extraction rules for each of the
interest-ing pages linked from the main page, and (3) the rules on
howto glue single page data extraction components. XWRAPCom-poser’s
scripting language has domain-specific plugins to facili-tate the
incorporation of domain-dependent correlations betweenthe fragments
of information extracted and the domain-specifictagging scheme.
Each wrapper generated by XWRAPComposerwill be associated with an
interface specification, an outerfacedescription, and a composer
script.
The design of the XWRAPComposer Interface and Outer-face
Specification serves two important objectives. First and formost,
it will ease the use of XWRAP wrappers as external ser-vices to any
data integration applications. Second, it will facil-itate the
XWRAPComposer wrapper code generation system togenerate Java code.
Therefore, some components of the specifi-cation may not be
directly useful for the users of these wrappers.In the first
release of the XWRAPComposer implementation, wedescribe the input
and output schema of a multi-page (compos-ite) wrapper in XML
Schema and use the two XML schemas asthe interface and outerface
specification. Concretely, the inter-face specification describes
the wrapper name and which dataprovider’s service needs to be
wrapped by giving the sourceURL and other related information. The
outerface specificationdescribes what data items should be
extracted and produced bythe wrapper and the semantically
meaningful names to be usedto tag those data items. Figure 4 shows
a fragment of the in-
6
-
terface and outerface description of an example NCBI
BLASTsummary wrapper [21].
-
/ * Start constructing wrapper ncbisummary. * /WrapperName
"ncbisummary";
/ * Contruct the URL for NCBi Blast search * /Generate
blastSummaryPage :: ConstructHttpQuery (input) {
Set inputSource {Set url
{"http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?QUERY=$$&..."
};Set queryString { };Set method {"get" };Set variable { [text()] }
;
}}Generate blastSummaryData :: FetchDocument (blastSummaryPage)
{}Generate recordid :: ExtractContent (blastSummaryData) {
GrabSubstring {Set BeginMatch {"The request ID is
-
Figure 9: Ptolemy Wrapper Actor Result Example – NCBi Blast
Detail
4.4 WSDL-enabled Wrappers
Web Service Server
ProgramJava
Interface
Outerface
WSDLGenerator
XwrapComposer
WrapperServlet Config
WSDL requestWSDL response
Soap Handler
Package Output
Extract Input
SOAP Response
SOAP Request
Figure 10: Web-service Enabled Wrappers
XWRAPComposer is developed with two objectives in mind.First, we
want to generate wrapper programs that can be used incommand line
or embedded in an application system as a wrap-per procedure.
Second, we want XWRAPComposer to be ableto generate WSDL-enabled
wrappers to allow each wrapper pro-gram to be used as a Web service
[23]. Our discussion so far hasbeen focused on the first objective.
In this section we brieflydescribe how to generate WSDL enabled
wrappers.
In order to enable XWRAPComposer to generate WSDL-enabled
wrapper services, we add two extensions to theXWRAPComposer wrapper
generation system. First, we en-capsulate an XWRAPComposer wrapper
into a general Web ser-vice servlet. The servlet automatically
extracts the input from aSOAP request, feeds it into the wrapper,
and inserts the wrap-ping results in a SOAP envelope before sending
back to the user.Second, we incorporate a WSDL generator to
automatically gen-erate Web service description by binding the
wrapper’s interfaceand outerface with the servlet configuration.
Figure 10 shows theextensions added to the XWRAPComposer to produce
wrappersas WSDL web services.
5 Related Work and ConclusionThe very nature of scientific
research and discovery leads to
the continuous creation of information that is new in content
or
representation or both. Despite the efforts to fit molecular
biol-ogy information into standard formats and repositories such
asthe PDB (Protein Data Bank) and NCBI, the number of data-bases
and their content have been growing, pushing the enve-lope of
standardization efforts such as mmCIF [26]. Providingintegrated and
uniform access to these databases has been a se-rious research
challenge. Several efforts [6, 7, 10, 12, 16, 20]have sought to
alleviate the interoperability issue, by translatingqueries from a
uniform query language into the native query ca-pabilities
supported by the individual data sources. Typically,these previous
efforts address the interoperability problem froma digital library
point of view, i.e., they treat individual data-bases as well-known
sources of existing information. While theyprovide a valuable
service, due to the growing rate of scientificdiscovery, an
increasing amount of new information (the kindof hot-off-the-bench
information that scientists would be mostinterested in) falls
outside the capability of these previous inter-operability systems
or services.
Wrappers have been developed either manually or with soft-ware
assistance, and used as a component of agent-based sys-tems,
sophisticated query tools and general mediator-based in-formation
integration systems [27].
XWRAPComposer is different from those systems in threeaspects.
First, we explicitly separate tasks of building wrappersthat are
specific to a Web service from the tasks that are repet-itive for
any service, thus the code can be generated as wrap-per library
component and reused automatically by the wrappergenerator system.
Second, we use inductive learning algorithmsthat derive information
flow and data extraction patterns by rea-soning about sample pages
or sample specifications. More im-portantly, we design a
declarative rule-based script language formulti-page information
extraction, encouraging a clean separa-tion of the information
extraction semantics from the informa-tion flow control and
execution logic of wrapper programs.
Three unique features distinguish XWRAPComposer fromexisting
wrapper development approaches. First, XWRAPCom-poser is designed
to enable multi-stage and multi-page data ex-traction. Second,
XWRAPComposer is the only wrapper gen-eration system that promotes
the distinction of information ex-traction logic from
request-respond flow control logic, allow-ing higher level of
robustness against changes in the serviceprovider’s web site design
or infrastructure. Third, XWRAP-Composer provides a user-friendly
plug-and-play interface, al-lowing seamless incorporation of
external services and continu-ous changing service interfaces and
data format.
References[1] DBCAT, the public catalog of databases. see
http://www.infobiogen.fr/services/dbcat.[2] R. Baumgartner, S.
Flesca, and G. Gottlob. Visual web informa-
tion extraction with Lixto.Proceedings of VLDB, 2001.[3] R.
Bayardo, W. Bohrer, and R. B. and et al. Semantic integration
of information in open and dynamic environments. InProceed-ings
of ACM SIGMOD Conference, 1997.
[4] Berkeley. Ptolemy group in
eecs.http://ptolemy.eecs.berkeley.edu/, 2003.
[5] D. Buttler, L. Liu, and C. Pu. A fully automated object
extractionsystem for the world wide web.Proceedings of IEEE
ICDCS,April 2001.
[6] T. Critchlow, K. Fidelis, M. Ganesh, R. Musick, and T.
Slezak.Datafoundry: Information management for scientific
data.IEEE
9
-
Transactions on Information Technology in Biomedicine,
4(1):52-57, March 2000.
[7] S. Davidson, O. Buneman, J. Crabtree, V. Tannen, G.
Over-ton, and L. Wong. Biokleisli: Integrating biomedical dataand
analysis packages.Bioinformatics: Databases and Sys-tems, S.
Letovsky, Editor, Kluwer Academic Publishers, Norwell,MA:201-211,
1999.
[8] G. I. o. T. DISL Group. XWRAP Elite Project, 2000.[9] DISL
Group, Georgia Insitute of Technology. Xwrapcomposer.
http://www.cc.gatech.edu/projects/disl/XWRAPComposer/, 2003.[10]
C. A. Goble, R. Stevens, G. Ng, S. Bechhofer, N. Paton, P. G.
Baker, M. Peim, and A. Brass. Transparent access to multi-ple
bioinformatics information sources.IBM Systems
Journal,40(2):532-551, 2001.
[11] L. Haas, D. Kossmann, E. Wimmers, and J. Yan.
Optimizingqueries across diverse data sources. InProceedings of
VLDB,1997.
[12] L. Haas, P. Schwarz, P. Kodali, E. Kotlar, J. Rice, and W.
Swope.Discoverylink: A system for integrated access to life
sciencesdata sources.IBM Systems Journal, 40(2):489-511, 2001.
[13] C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J.
Modi,I. Muslea, A. Philpot, and S. Tejada. Modeling web sourcesfor
information integration. InProceedings of AAAI Conference,1998.
[14] C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y.
Papakon-stantinou, J. Ullman, and M. Valiveti. Capability based
mediationin tsimiss. InProceedings of ACM SIGMOD Conference,
1997.
[15] L. Liu, C. Pu, and W. Han. XWrap: An Extensible Wrapper
Con-struction System for Internet Information Sources.
InTechnicalReport, OGI/CSE, Feb., 1999.
[16] S. McGinnis. (genbank user services, national center for
bitech-nology information (NCBI), national library of medicine, us
na-tional institute of health).Personal Communication, 1, Jan.
[17] NCBI. National center for biotechnology information –
blastdatabases.http://www.ncbi.nlm.nih.gov/BLAST/, 2003.
[18] D. Raggett. Clean up your web pages with HTML
TIDY.http://www.w3.org/People/Raggett/tidy/, 1999.
[19] A. Sahuguet and F. Azavant. WysiWyg Web Wrapper
Factory(W4F). Proceedings of WWW Conference, 1999.
[20] A. C. Siepel, A. N. Tolopko, A. D. Farmer, P. A. Steadman,
F. D.Schilkey, B. Perry, and W. D. Beavis. An integration platform
forheterogeneous bioinformatics software components.IBM Sys-tems
Journal, 40(2):570-591, 2001.
[21] L. Team. LDRD Project, 2004.[22] W3C. Reformulating HTML in
XML.
http://www.w3.org/TR/WD-html-in-xml/, 1999.[23] W3C. Web
services.http://www.w3c.org/2002/ws/, 2002.[24] W3C. Web services
description language (wsdl) version 1.2 part
1: Core language.http://www.w3c.org/TR/wsdl12/, 2003.[25] H.
Wei. Wrapper Application Generation for Semantic Web: An
XWRAP Approach. PhD thesis, Georgia Institute of
Technology,2003.
[26] J. Westbrook and P. Bourne. Star/mmcif: An extensive
ontol-ogy for macromolecular structure and
beyond.Bioinformatics,16(2):159-168, 2000.
[27] G. Wiederhold. Mediators in the architecture of future
informa-tion systems.IEEE Computer, 1992.
10