Hybrid Keyword Search Hybrid Keyword Search across Peer-to-Peer across Peer-to-Peer Federated Data Federated Data PhD Dissertation Defense PhD Dissertation Defense Florida State University Florida State University Jungkee (Jake) Kim Jungkee (Jake) Kim
36
Embed
Hybrid Keyword Search across Peer-to-Peer Federated Data PhD Dissertation Defense Florida State University Jungkee (Jake) Kim.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Hybrid Keyword Search Hybrid Keyword Search across Peer-to-Peer across Peer-to-Peer
Federated DataFederated Data
PhD Dissertation DefensePhD Dissertation Defense
Florida State UniversityFlorida State University
Jungkee (Jake) KimJungkee (Jake) Kim
MotivationMotivation
Internet
Where is the
Information?
OutlineOutline
Two Typical Search ParadigmsTwo Typical Search Paradigms
Problem Statements of Current ApproachesProblem Statements of Current Approaches
Hybrid Keyword SearchHybrid Keyword Search
Hybrid Search on Distributed DatabasesHybrid Search on Distributed Databases
Hybrid Search across Peer-to-Peer Hybrid Search across Peer-to-Peer Federated DatabasesFederated Databases
Two Two TypicalTypical Search Paradigms Search Paradigms
Searching over Searching over structured data structured data
Relational DatabasesRelational Databases
Searching over Searching over unstructured dataunstructured data
Information RetrievalInformation Retrieval
Internet EnvironmentInternet Environment
Semistructured Data – Semistructured Data – XMLXML
Keyword Search in DBKeyword Search in DB
Web Search Engines Web Search Engines – Technologies from – Technologies from Information RetrievalInformation Retrieval
Hybrid Keyword Search ?Hybrid Keyword Search ?
Current Approaches – Current Approaches – Keyword-only SearchKeyword-only Search
Web Search EnginesWeb Search Engines Web crawlers visit Web pages and collect the Web crawlers visit Web pages and collect the
keyword based text indexes.keyword based text indexes. Fast information retrievalFast information retrieval
Keyword Search in databasesKeyword Search in databases Web integration on legacy DBMSWeb integration on legacy DBMS Dynamic Web publication through embedded Dynamic Web publication through embedded
DBDB Easy to use without knowledge of DB schemaEasy to use without knowledge of DB schema
Problems of Current Problems of Current Approaches – Keyword-basedApproaches – Keyword-based
Web Search EnginesWeb Search Engines Can not collect every connected resourceCan not collect every connected resource Query results are often unrelatedQuery results are often unrelated
Keyword Search in DatabasesKeyword Search in Databases Losing the inherent meaning of the schemaLosing the inherent meaning of the schema Query results are not based on semantic Query results are not based on semantic
schemaschema
Current Approaches – Current Approaches – SemanticSemantic
Semantic WebSemantic Web Multiple relation links with directed Multiple relation links with directed
labeled graphs and machines can labeled graphs and machines can understand the relationship between understand the relationship between different resourcesdifferent resources
Describes Describes metadatametadata about resources about resources To represent the relations of the objects To represent the relations of the objects
on the Web; the object terms defined on the Web; the object terms defined under a specific description – an Ontologyunder a specific description – an Ontology
Problems of Current Problems of Current Approaches – Semantic WebApproaches – Semantic Web
Ontology design is sophisticatedOntology design is sophisticated
Lack of unified definition Lack of unified definition
A search service fetches target information A search service fetches target information data against a search query.data against a search query.
Unstructured dataUnstructured dataA file containing data – MS Word, PDF, PS documentsA file containing data – MS Word, PDF, PS documents
Metadata: Structured or semistructured data – Metadata: Structured or semistructured data – XMLXML
We utilized an XML-enabled relational DBMS We utilized an XML-enabled relational DBMS and a native XML DB along with a text search and a native XML DB along with a text search library (Apache Xindice + Jakarta Lucene) to library (Apache Xindice + Jakarta Lucene) to address the search against metadata and address the search against metadata and text.text.
How to Combine? (1)How to Combine? (1)
Two entities and a relationship in relational Two entities and a relationship in relational DBMSDBMS
We can obtain the hybrid search result We can obtain the hybrid search result using a nested subqueryusing a nested subquery
How to Combine? (2)How to Combine? (2)
A hash table is used for joining search results in non-A hash table is used for joining search results in non-DBMS based system (Apache Xindice + Lucene)DBMS based system (Apache Xindice + Lucene)
Local Query Processing – Local Query Processing – XML (1)XML (1)
XML-enabled RDBXML-enabled RDB DBLP XML recordDBLP XML record
(1,000 – 10,000)(1,000 – 10,000) Non indexed matches Non indexed matches
except year match bound except year match bound by the number of matches.by the number of matches.
Combined query time Combined query time depends on # of year query depends on # of year query resultsresults
Average XML Query Time
Local Query Processing – Local Query Processing – XML (2)XML (2)
Apache XindiceApache Xindice DBLP XML recordDBLP XML record (1,000 – 10,000)(1,000 – 10,000) Indexed approximate Indexed approximate
matches for text matches for text elements in XML elements in XML instances as bad as non-instances as bad as non-indexed queriesindexed queries
Exact matches bound by Exact matches bound by the number of matches.the number of matches.
Average XML Query Time
Local Query Processing – Local Query Processing – Hybrid (1)Hybrid (1)
Hybrid search query performance measurementHybrid search query performance measurement XML-enabled RDBXML-enabled RDB For 100,000 XML instances and 100,000 text documentsFor 100,000 XML instances and 100,000 text documents Small result set: 4 XML and a keyword matchesSmall result set: 4 XML and a keyword matches Large result set: 7,752 XML and 41,889 documentsLarge result set: 7,752 XML and 41,889 documents
MetadataMetadata AuthorAuthor YearYear
(Nested subquery)(Nested subquery)
YearYear
(Hash table)(Hash table)
FewFew
KeywordsKeywords
0.040.04
Sec.Sec.
82.9 Sec.82.9 Sec. 5.70 Sec.5.70 Sec.
ManyMany
KeywordsKeywords
0.480.48
Sec.Sec.
Half hourHalf hour 6.96 Sec.6.96 Sec.
Local Query Processing – Local Query Processing – Hybrid (2)Hybrid (2)
Hybrid search query performance measurementHybrid search query performance measurement Apache Xindice + Jakarta LuceneApache Xindice + Jakarta Lucene For 10,000 XML instances and 10,000 text documentsFor 10,000 XML instances and 10,000 text documents Small result set: 2 XML and a keyword matchesSmall result set: 2 XML and a keyword matches Large result set: 192 XML and 4,562 documentsLarge result set: 192 XML and 4,562 documents
Discussion – Local Hybrid SearchDiscussion – Local Hybrid Search
XML-enabled RDB provides proper XML-enabled RDB provides proper response except some extreme query response except some extreme query loads.loads.
A native XML DB (Apache Xindice) had A native XML DB (Apache Xindice) had very limited scalability. (No accurate very limited scalability. (No accurate query result over 16,000 XML instances)query result over 16,000 XML instances)
We will generalize hybrid search to a We will generalize hybrid search to a distributed environment.distributed environment.
Hybrid Search on Distributed Hybrid Search on Distributed DatabasesDatabases
Data IndependenceData Independence: logically and physically : logically and physically independent; the same schema – no change, independent; the same schema – no change, data encapsulation in each machinedata encapsulation in each machineNetwork TransparencyNetwork Transparency: depends on MOM or : depends on MOM or P2P frameworkP2P frameworkNo replicationNo replication – restricted to a computer cluster – restricted to a computer clusterFragmentFragment: full partition; horizontal fragmentation: full partition; horizontal fragmentationThe query result for the distributed databases is The query result for the distributed databases is the collection of query results from individual the collection of query results from individual database queries.database queries.
Scalable Hybrid Search Scalable Hybrid Search Architecture on DDBSArchitecture on DDBS
100,000 XML 100,000 XML and 100,000 and 100,000 Documents in 8 Documents in 8 machines – machines – 12,500 each12,500 eachFew keyword Few keyword match (1-3) on 1 match (1-3) on 1 machine onlymachine onlyRDB – 0.04 Sec. RDB – 0.04 Sec. for few keyword for few keyword match match
Avg. response time for an author exact match queryover 8 search services
100,000 XML 100,000 XML and 100,000 and 100,000 Documents in 8 Documents in 8 machines – machines – 12,500 each12,500 each
RDB – RDB – half half hour or 6.96 hour or 6.96 Sec. (Hash Sec. (Hash table)table)
Avg. response time for a year match queryover 8 search services
Data Integration HubData Integration Hub
Partial integration – possible method to increase Partial integration – possible method to increase the data portion queried the data portion queried c.f. Supernode in P2Pc.f. Supernode in P2PWe designed a partial integration architecture We designed a partial integration architecture through a message-oriented middleware – the through a message-oriented middleware – the NaradaBrokering systemNaradaBrokering systemNaradaBrokering systemNaradaBrokering system JMS compliant topic-based communicationJMS compliant topic-based communication Scalability by brokers hierarchical connectionScalability by brokers hierarchical connection Passive queries / Static bindingPassive queries / Static binding
We attached a RDBMS to store the metadata We attached a RDBMS to store the metadata and index the contents of the dataand index the contents of the data
Architecture of Data Integration Architecture of Data Integration HubHub
Coupling vs. ScalabilityCoupling vs. Scalability
From ICDE 2002 TutorialFrom ICDE 2002 Tutorial
Query Propagate and Results back Query Propagate and Results back on a P2P Networkon a P2P Network
Peer group architecture of the P2P Peer group architecture of the P2P SearchSearch
Performance Test for Peer Group Performance Test for Peer Group Communication (JXTA)Communication (JXTA)
…..
Subnet A Subnet B Subnet C
Client Peer Rendezvous Peer Search Service Peers
Group
Propagatio
nGro
up
Propagatio
n
Point-to-point Pipe Connection
Performance for Group Peer Performance for Group Peer Communication – 1 Peer per NodeCommunication – 1 Peer per Node
Average Response Time for a Query
Performance for Group Peer Performance for Group Peer Communication – Multiple Peers per Communication – Multiple Peers per
Node Allowed (1)Node Allowed (1)
Average Response Time for a Query with Multiple Peers per Node Allowed
Performance for Group Peer Performance for Group Peer Communication – Multiple Peers per Communication – Multiple Peers per
Node Allowed (2)Node Allowed (2)
Message Response Time for 32 Group Peers
Related Works (1)Related Works (1)
Distributed lookup in routing to reduce the Distributed lookup in routing to reduce the unnecessary communicationsunnecessary communications Distributed Hash Table (DHT) – Chord, CAN, Distributed Hash Table (DHT) – Chord, CAN,
Pastry, and TapestryPastry, and Tapestry JXTA: DHT + multiple random walksJXTA: DHT + multiple random walks
Look up peers based on reputationLook up peers based on reputationHristidis et. al. – Exploiting a context on Hristidis et. al. – Exploiting a context on existing RDBMS with reducing the schema existing RDBMS with reducing the schema loss of Keyword Search in DBloss of Keyword Search in DB
Related Works (2)Related Works (2)
MethodMethodMetadataMetadata
(XML)(XML)ContentsContents NoteNote
PlanetPPlanetP NoNo YesYesGossipingGossiping
Thousands peersThousands peers
ODDISEAODDISEA NoNo YesYesDist. Global indexDist. Global index
PastryPastry
Galanis Galanis and et al.and et al. YesYes NoNo
Dist. DirectoriesDist. Directories
Chord, ThousandsChord, Thousands
XRANKXRANK YesYesYesYes
(in XML)(in XML)No P2PNo P2P
ConclusionConclusion
We addressed the semantic loss of We addressed the semantic loss of keyword-only search while remaining a keyword-only search while remaining a simpler solution than the Semantic Websimpler solution than the Semantic Web
Low cost scalability over heterogeneous Low cost scalability over heterogeneous resource through customized overlay resource through customized overlay networksnetworks
A practical bridging role on the road A practical bridging role on the road towards the ideal of information towards the ideal of information represented by Semantic Web? represented by Semantic Web?
ContributionsContributions
Demonstration of a hybrid search Demonstration of a hybrid search –– combining combining metadata search with a keyword search over metadata search with a keyword search over unstructured context dataunstructured context dataA way to increase locality and integrate several A way to increase locality and integrate several dispersed resources through a dispersed resources through a data integration data integration hubhubExtension of the scalabilityExtension of the scalability of a native XML of a native XML database and performance improvement for some database and performance improvement for some queries compared to those on a single machinequeries compared to those on a single machineGeneralization of our hybrid search architecture on Generalization of our hybrid search architecture on potentially more scalable potentially more scalable P2P overlay networkP2P overlay network
PublicationsPublicationsJ. KimJ. Kim and G. Fox. and G. Fox. Scalable Hybrid Search on Distributed DatabasesScalable Hybrid Search on Distributed Databases. Accepted for . Accepted for presentation in presentation in 3rd International Workshop on Autonomic Distributed Data and 3rd International Workshop on Autonomic Distributed Data and Storage Systems Management (ADSM)Storage Systems Management (ADSM) in conjunction with ICCS, To appear in in conjunction with ICCS, To appear in Lecture Notes in Computer Science. May, 2005.Lecture Notes in Computer Science. May, 2005.J. KimJ. Kim and G. Fox. and G. Fox. A Hybrid Keyword Search across Peer-to-Peer Federated A Hybrid Keyword Search across Peer-to-Peer Federated DatabasesDatabases. In . In Proceedings of 8th East-European Conference on Advances in Proceedings of 8th East-European Conference on Advances in Databases and Information Systems (ADBIS),Databases and Information Systems (ADBIS), September, 2004. September, 2004.J. KimJ. Kim, O. Balsoy, M. Pierce, and G. Fox. , O. Balsoy, M. Pierce, and G. Fox. Design of a Hybrid Search in the Online Design of a Hybrid Search in the Online Knowledge CenterKnowledge Center. In . In Proceedings of IASTED International Conference on Proceedings of IASTED International Conference on Information and Knowledge Sharing,Information and Knowledge Sharing, November, 2002. November, 2002.G. Aydin, H. Altay, M. S. Aktas, M. N. Aysan, G. Fox, C. Ikibas, G. Aydin, H. Altay, M. S. Aktas, M. N. Aysan, G. Fox, C. Ikibas, J. KimJ. Kim, A. Kaplan, A. , A. Kaplan, A. E. Topcu, M. Pierce, B. Yildiz, and O. Balsoy. E. Topcu, M. Pierce, B. Yildiz, and O. Balsoy. Online Knowledge Center Tools for Online Knowledge Center Tools for Metadata ManagementMetadata Management. Technical report, DoD HPCMP Users Group Meeting, June, . Technical report, DoD HPCMP Users Group Meeting, June, 2003.2003.O. Balsoy, M. S. Aktas, G. Aydin, M. N. Aysan, C. Ikibas, A. Kaplan, O. Balsoy, M. S. Aktas, G. Aydin, M. N. Aysan, C. Ikibas, A. Kaplan, J. KimJ. Kim, M. , M. Pierce, A. Topcu, B. Yildiz, and G. Fox. Pierce, A. Topcu, B. Yildiz, and G. Fox. The Online Knowledge Center: Building a The Online Knowledge Center: Building a Component Based PortalComponent Based Portal. In . In Proceedings of the International Conference on Proceedings of the International Conference on Information and Knowledge EngineeringInformation and Knowledge Engineering, June, 2002., June, 2002.G. Fox, S. Ko, M. Pierce, O. Balsoy, G. Fox, S. Ko, M. Pierce, O. Balsoy, J. KimJ. Kim, S. Lee, K. Kim, S. Oh, X. Rao, M. , S. Lee, K. Kim, S. Oh, X. Rao, M. Varank, H. Bulut, G. Gunduz, X. Qiu, S. Pallickara, A. Uyar, and C. Youn. Varank, H. Bulut, G. Gunduz, X. Qiu, S. Pallickara, A. Uyar, and C. Youn. Grid Grid services for earthquake scienceservices for earthquake science. . Concurrency and Computation: Practice and Concurrency and Computation: Practice and ExperienceExperience, 14:371---393, May---June 2002., 14:371---393, May---June 2002.