Hybrid Keyword Search across Peer-to-Peer Federated Data PhD Dissertation Defense Florida State University Jungkee (Jake) Kim.

Hybrid Keyword Search Hybrid Keyword Search across Peer-to-Peer across Peer-to-Peer

Federated DataFederated Data

PhD Dissertation DefensePhD Dissertation Defense

Florida State UniversityFlorida State University

Jungkee (Jake) KimJungkee (Jake) Kim

MotivationMotivation

Internet

Where is the

Information?

OutlineOutline

Two Typical Search ParadigmsTwo Typical Search Paradigms

Problem Statements of Current ApproachesProblem Statements of Current Approaches

Hybrid Keyword SearchHybrid Keyword Search

Hybrid Search on Distributed DatabasesHybrid Search on Distributed Databases

Hybrid Search across Peer-to-Peer Hybrid Search across Peer-to-Peer Federated DatabasesFederated Databases

Two Two TypicalTypical Search Paradigms Search Paradigms

Searching over Searching over structured data structured data

Relational DatabasesRelational Databases

Searching over Searching over unstructured dataunstructured data

Information RetrievalInformation Retrieval

Internet EnvironmentInternet Environment

Semistructured Data – Semistructured Data – XMLXML

Keyword Search in DBKeyword Search in DB

Web Search Engines Web Search Engines – Technologies from – Technologies from Information RetrievalInformation Retrieval

Hybrid Keyword Search ?Hybrid Keyword Search ?

Current Approaches – Current Approaches – Keyword-only SearchKeyword-only Search

Web Search EnginesWeb Search Engines Web crawlers visit Web pages and collect the Web crawlers visit Web pages and collect the

keyword based text indexes.keyword based text indexes. Fast information retrievalFast information retrieval

Keyword Search in databasesKeyword Search in databases Web integration on legacy DBMSWeb integration on legacy DBMS Dynamic Web publication through embedded Dynamic Web publication through embedded

DBDB Easy to use without knowledge of DB schemaEasy to use without knowledge of DB schema

Problems of Current Problems of Current Approaches – Keyword-basedApproaches – Keyword-based

Web Search EnginesWeb Search Engines Can not collect every connected resourceCan not collect every connected resource Query results are often unrelatedQuery results are often unrelated

Keyword Search in DatabasesKeyword Search in Databases Losing the inherent meaning of the schemaLosing the inherent meaning of the schema Query results are not based on semantic Query results are not based on semantic

schemaschema

Current Approaches – Current Approaches – SemanticSemantic

Semantic WebSemantic Web Multiple relation links with directed Multiple relation links with directed

labeled graphs and machines can labeled graphs and machines can understand the relationship between understand the relationship between different resourcesdifferent resources

Describes Describes metadatametadata about resources about resources To represent the relations of the objects To represent the relations of the objects

on the Web; the object terms defined on the Web; the object terms defined under a specific description – an Ontologyunder a specific description – an Ontology

Problems of Current Problems of Current Approaches – Semantic WebApproaches – Semantic Web

Ontology design is sophisticatedOntology design is sophisticated

Lack of unified definition Lack of unified definition

Limited adoptionLimited adoption

Our ApproachOur Approach

Hybrid search mechanisms –Hybrid search mechanisms –Semantic metadata + Keyword searchSemantic metadata + Keyword search

Semantic SolutionSemantic SolutionSemantic WebSemantic Web might be better than might be better than Hybrid Hybrid

searchsearch

Hybrid searchHybrid search must be better than must be better than Web Web search enginessearch engines

SimplicitySimplicityHybrid searchHybrid search is simpler than is simpler than Semantic WebSemantic Web

Hybrid Keyword Search Hybrid Keyword Search ServiceService

A search service fetches target information A search service fetches target information data against a search query.data against a search query.

Unstructured dataUnstructured dataA file containing data – MS Word, PDF, PS documentsA file containing data – MS Word, PDF, PS documents

Metadata: Structured or semistructured data – Metadata: Structured or semistructured data – XMLXML

We utilized an XML-enabled relational DBMS We utilized an XML-enabled relational DBMS and a native XML DB along with a text search and a native XML DB along with a text search library (Apache Xindice + Jakarta Lucene) to library (Apache Xindice + Jakarta Lucene) to address the search against metadata and address the search against metadata and text.text.

How to Combine? (1)How to Combine? (1)

Two entities and a relationship in relational Two entities and a relationship in relational DBMSDBMS

We can obtain the hybrid search result We can obtain the hybrid search result using a nested subqueryusing a nested subquery

How to Combine? (2)How to Combine? (2)

A hash table is used for joining search results in non-A hash table is used for joining search results in non-DBMS based system (Apache Xindice + Lucene)DBMS based system (Apache Xindice + Lucene)

Local Query Processing – Local Query Processing – XML (1)XML (1)

XML-enabled RDBXML-enabled RDB DBLP XML recordDBLP XML record

(1,000 – 10,000)(1,000 – 10,000) Non indexed matches Non indexed matches

except year match bound except year match bound by the number of matches.by the number of matches.

Combined query time Combined query time depends on # of year query depends on # of year query resultsresults

Average XML Query Time

Local Query Processing – Local Query Processing – XML (2)XML (2)

Apache XindiceApache Xindice DBLP XML recordDBLP XML record (1,000 – 10,000)(1,000 – 10,000) Indexed approximate Indexed approximate

matches for text matches for text elements in XML elements in XML instances as bad as non-instances as bad as non-indexed queriesindexed queries

Exact matches bound by Exact matches bound by the number of matches.the number of matches.

Average XML Query Time

Local Query Processing – Local Query Processing – Hybrid (1)Hybrid (1)

Hybrid search query performance measurementHybrid search query performance measurement XML-enabled RDBXML-enabled RDB For 100,000 XML instances and 100,000 text documentsFor 100,000 XML instances and 100,000 text documents Small result set: 4 XML and a keyword matchesSmall result set: 4 XML and a keyword matches Large result set: 7,752 XML and 41,889 documentsLarge result set: 7,752 XML and 41,889 documents

MetadataMetadata AuthorAuthor YearYear

(Nested subquery)(Nested subquery)

YearYear

(Hash table)(Hash table)

FewFew

KeywordsKeywords

0.040.04

Sec.Sec.

82.9 Sec.82.9 Sec. 5.70 Sec.5.70 Sec.

ManyMany

KeywordsKeywords

0.480.48

Sec.Sec.

Half hourHalf hour 6.96 Sec.6.96 Sec.

Local Query Processing – Local Query Processing – Hybrid (2)Hybrid (2)

Hybrid search query performance measurementHybrid search query performance measurement Apache Xindice + Jakarta LuceneApache Xindice + Jakarta Lucene For 10,000 XML instances and 10,000 text documentsFor 10,000 XML instances and 10,000 text documents Small result set: 2 XML and a keyword matchesSmall result set: 2 XML and a keyword matches Large result set: 192 XML and 4,562 documentsLarge result set: 192 XML and 4,562 documents

Discussion – Local Hybrid SearchDiscussion – Local Hybrid Search

XML-enabled RDB provides proper XML-enabled RDB provides proper response except some extreme query response except some extreme query loads.loads.

A native XML DB (Apache Xindice) had A native XML DB (Apache Xindice) had very limited scalability. (No accurate very limited scalability. (No accurate query result over 16,000 XML instances)query result over 16,000 XML instances)

We will generalize hybrid search to a We will generalize hybrid search to a distributed environment.distributed environment.

Hybrid Search on Distributed Hybrid Search on Distributed DatabasesDatabases

Data IndependenceData Independence: logically and physically : logically and physically independent; the same schema – no change, independent; the same schema – no change, data encapsulation in each machinedata encapsulation in each machineNetwork TransparencyNetwork Transparency: depends on MOM or : depends on MOM or P2P frameworkP2P frameworkNo replicationNo replication – restricted to a computer cluster – restricted to a computer clusterFragmentFragment: full partition; horizontal fragmentation: full partition; horizontal fragmentationThe query result for the distributed databases is The query result for the distributed databases is the collection of query results from individual the collection of query results from individual database queries.database queries.

Scalable Hybrid Search Scalable Hybrid Search Architecture on DDBSArchitecture on DDBS

SearchService

MessageBroker

Client

SearchService

SearchService

Subscriber for a query topic

Publisher for a temporary topic

Publisher for a query topic

Subscriber for a temporary topic

QueryMessage

QueryMessage

ResultMessage

ResultMessage

Client Client

Cooperating Broker NetworkCooperating Broker Network

Distributed Databases based on NaradaBrokering Network Distributed Databases based on NaradaBrokering Network

Query Processing – DDBS (1)Query Processing – DDBS (1)

100,000 XML 100,000 XML and 100,000 and 100,000 Documents in 8 Documents in 8 machines – machines – 12,500 each12,500 eachFew keyword Few keyword match (1-3) on 1 match (1-3) on 1 machine onlymachine onlyRDB – 0.04 Sec. RDB – 0.04 Sec. for few keyword for few keyword match match

Avg. response time for an author exact match queryover 8 search services

Query Processing – DDBS (2)Query Processing – DDBS (2)

100,000 XML 100,000 XML and 100,000 and 100,000 Documents in 8 Documents in 8 machines – machines – 12,500 each12,500 each

RDB – RDB – half half hour or 6.96 hour or 6.96 Sec. (Hash Sec. (Hash table)table)

Avg. response time for a year match queryover 8 search services

Data Integration HubData Integration Hub

Partial integration – possible method to increase Partial integration – possible method to increase the data portion queried the data portion queried c.f. Supernode in P2Pc.f. Supernode in P2PWe designed a partial integration architecture We designed a partial integration architecture through a message-oriented middleware – the through a message-oriented middleware – the NaradaBrokering systemNaradaBrokering systemNaradaBrokering systemNaradaBrokering system JMS compliant topic-based communicationJMS compliant topic-based communication Scalability by brokers hierarchical connectionScalability by brokers hierarchical connection Passive queries / Static bindingPassive queries / Static binding

We attached a RDBMS to store the metadata We attached a RDBMS to store the metadata and index the contents of the dataand index the contents of the data

Architecture of Data Integration Architecture of Data Integration HubHub

Coupling vs. ScalabilityCoupling vs. Scalability

From ICDE 2002 TutorialFrom ICDE 2002 Tutorial

Query Propagate and Results back Query Propagate and Results back on a P2P Networkon a P2P Network

Peer group architecture of the P2P Peer group architecture of the P2P SearchSearch

Performance Test for Peer Group Performance Test for Peer Group Communication (JXTA)Communication (JXTA)

…..

Subnet A Subnet B Subnet C

Client Peer Rendezvous Peer Search Service Peers

Group

Propagatio

nGro

up

Propagatio

n

Point-to-point Pipe Connection

Performance for Group Peer Performance for Group Peer Communication – 1 Peer per NodeCommunication – 1 Peer per Node

Average Response Time for a Query

Performance for Group Peer Performance for Group Peer Communication – Multiple Peers per Communication – Multiple Peers per

Node Allowed (1)Node Allowed (1)

Average Response Time for a Query with Multiple Peers per Node Allowed

Performance for Group Peer Performance for Group Peer Communication – Multiple Peers per Communication – Multiple Peers per

Node Allowed (2)Node Allowed (2)

Message Response Time for 32 Group Peers

Related Works (1)Related Works (1)

Distributed lookup in routing to reduce the Distributed lookup in routing to reduce the unnecessary communicationsunnecessary communications Distributed Hash Table (DHT) – Chord, CAN, Distributed Hash Table (DHT) – Chord, CAN,

Pastry, and TapestryPastry, and Tapestry JXTA: DHT + multiple random walksJXTA: DHT + multiple random walks

Look up peers based on reputationLook up peers based on reputationHristidis et. al. – Exploiting a context on Hristidis et. al. – Exploiting a context on existing RDBMS with reducing the schema existing RDBMS with reducing the schema loss of Keyword Search in DBloss of Keyword Search in DB

Related Works (2)Related Works (2)

MethodMethodMetadataMetadata

(XML)(XML)ContentsContents NoteNote

PlanetPPlanetP NoNo YesYesGossipingGossiping

Thousands peersThousands peers

ODDISEAODDISEA NoNo YesYesDist. Global indexDist. Global index

PastryPastry

Galanis Galanis and et al.and et al. YesYes NoNo

Dist. DirectoriesDist. Directories

Chord, ThousandsChord, Thousands

XRANKXRANK YesYesYesYes

(in XML)(in XML)No P2PNo P2P

ConclusionConclusion

We addressed the semantic loss of We addressed the semantic loss of keyword-only search while remaining a keyword-only search while remaining a simpler solution than the Semantic Websimpler solution than the Semantic Web

Low cost scalability over heterogeneous Low cost scalability over heterogeneous resource through customized overlay resource through customized overlay networksnetworks

A practical bridging role on the road A practical bridging role on the road towards the ideal of information towards the ideal of information represented by Semantic Web? represented by Semantic Web?

ContributionsContributions

Demonstration of a hybrid search Demonstration of a hybrid search –– combining combining metadata search with a keyword search over metadata search with a keyword search over unstructured context dataunstructured context dataA way to increase locality and integrate several A way to increase locality and integrate several dispersed resources through a dispersed resources through a data integration data integration hubhubExtension of the scalabilityExtension of the scalability of a native XML of a native XML database and performance improvement for some database and performance improvement for some queries compared to those on a single machinequeries compared to those on a single machineGeneralization of our hybrid search architecture on Generalization of our hybrid search architecture on potentially more scalable potentially more scalable P2P overlay networkP2P overlay network

PublicationsPublicationsJ. KimJ. Kim and G. Fox. and G. Fox. Scalable Hybrid Search on Distributed DatabasesScalable Hybrid Search on Distributed Databases. Accepted for . Accepted for presentation in presentation in 3rd International Workshop on Autonomic Distributed Data and 3rd International Workshop on Autonomic Distributed Data and Storage Systems Management (ADSM)Storage Systems Management (ADSM) in conjunction with ICCS, To appear in in conjunction with ICCS, To appear in Lecture Notes in Computer Science. May, 2005.Lecture Notes in Computer Science. May, 2005.J. KimJ. Kim and G. Fox. and G. Fox. A Hybrid Keyword Search across Peer-to-Peer Federated A Hybrid Keyword Search across Peer-to-Peer Federated DatabasesDatabases. In . In Proceedings of 8th East-European Conference on Advances in Proceedings of 8th East-European Conference on Advances in Databases and Information Systems (ADBIS),Databases and Information Systems (ADBIS), September, 2004. September, 2004.J. KimJ. Kim, O. Balsoy, M. Pierce, and G. Fox. , O. Balsoy, M. Pierce, and G. Fox. Design of a Hybrid Search in the Online Design of a Hybrid Search in the Online Knowledge CenterKnowledge Center. In . In Proceedings of IASTED International Conference on Proceedings of IASTED International Conference on Information and Knowledge Sharing,Information and Knowledge Sharing, November, 2002. November, 2002.G. Aydin, H. Altay, M. S. Aktas, M. N. Aysan, G. Fox, C. Ikibas, G. Aydin, H. Altay, M. S. Aktas, M. N. Aysan, G. Fox, C. Ikibas, J. KimJ. Kim, A. Kaplan, A. , A. Kaplan, A. E. Topcu, M. Pierce, B. Yildiz, and O. Balsoy. E. Topcu, M. Pierce, B. Yildiz, and O. Balsoy. Online Knowledge Center Tools for Online Knowledge Center Tools for Metadata ManagementMetadata Management. Technical report, DoD HPCMP Users Group Meeting, June, . Technical report, DoD HPCMP Users Group Meeting, June, 2003.2003.O. Balsoy, M. S. Aktas, G. Aydin, M. N. Aysan, C. Ikibas, A. Kaplan, O. Balsoy, M. S. Aktas, G. Aydin, M. N. Aysan, C. Ikibas, A. Kaplan, J. KimJ. Kim, M. , M. Pierce, A. Topcu, B. Yildiz, and G. Fox. Pierce, A. Topcu, B. Yildiz, and G. Fox. The Online Knowledge Center: Building a The Online Knowledge Center: Building a Component Based PortalComponent Based Portal. In . In Proceedings of the International Conference on Proceedings of the International Conference on Information and Knowledge EngineeringInformation and Knowledge Engineering, June, 2002., June, 2002.G. Fox, S. Ko, M. Pierce, O. Balsoy, G. Fox, S. Ko, M. Pierce, O. Balsoy, J. KimJ. Kim, S. Lee, K. Kim, S. Oh, X. Rao, M. , S. Lee, K. Kim, S. Oh, X. Rao, M. Varank, H. Bulut, G. Gunduz, X. Qiu, S. Pallickara, A. Uyar, and C. Youn. Varank, H. Bulut, G. Gunduz, X. Qiu, S. Pallickara, A. Uyar, and C. Youn. Grid Grid services for earthquake scienceservices for earthquake science. . Concurrency and Computation: Practice and Concurrency and Computation: Practice and ExperienceExperience, 14:371---393, May---June 2002., 14:371---393, May---June 2002.