SEMANTIC SEARCH OVER THE WEB By ALİ ERKAN
SEMANTICSEARCHOVERTHEWEB
ByALİERKAN
Introduction• Semanticsearchistoimprovetheaccuracyofthesearchprocessbyunderstandingthecontextandlimitingtheambiguity,• SemanticsearchistomakethesemanticsoftheWebcontentmachineunderstandable.• ThesemanticWebcreatesassociationsbetweendifferentrepresentationsofthesamereal-worldentity.• Theyallowdatafrommanydifferentsourcestobeinterlinked(linkedopendatacloud).• ExistingsolutionsareeithersearchenginesthatsimplyindexthesemanticWebdataorthetraditionalsearchenginesenhancedwithsomebasicformofsynonymusage,assupportedbyGoogleandBing.• ThesemanticWebisahugedistributeddatabasewecanquerytogetinformationcomingfromdifferentsources.
NatureofSemanticData
ResourceDescriptionFramework(RDF)
• AlldataitemsinRDFareuniformlyrepresentedastriplesoftheform(subject,predicate,object) or(subject,property,value)triples.
• RDFextendsthelinkingstructureoftheWebtouseURIstonametherelationshipbetweenthingsaswellasthetwoendsofthelink.
• Thislinkingstructureformsadirected,labeledgraph.
• ThegraphviewistheeasiestpossiblementalmodelforRDF.
AdvantagesofRDF
• RDFoffersastandardizedandflexibleframeworkforpublishingstructureddataontheWebsuchthat• (1)datacanbelinked,incorporated,extended,andreusedbyotherRDFdataacrosstheWeb;• (2)heterogeneousdatafromindependentsourcescanbeautomaticallyintegratedbysoftwareagents;• (3)themeaningofdatacanbewelldefinedusingontologies
WebofData
• Today,mostWebsitesaregeneratedfromstructureddatathatisstoredinrelationaldatabases.• Themainbenefitofusingtheontologyisthatthecorrespondingdataarecleanandwellstructured.• AlotofWebsitesthatembedstructureddataintoHTMLpages.• Google,Yahoo!,andMicrosofthavejointlyagreedonasetofvocabulariesfordescribingover200differenttypesofentities.• Question:• “HowcanweembedstructureddataintoHTMLpagesandlinkthemeachother?”
TopologyoftheWebofData• Microformats• Microformatsisatechniqueformarkingupstructureddataaboutspecifictypesonentities.
• RDFa• W3Cstartedin2004tostandardizeRDFa asanalternative.
• Microdata• MicrodataisanalternativeproposalforembeddingstructureddataintoWebpageswhichwasinitiallypresentedaspartoftheHTML5standardizationeffortin2009.
• LinkedData• ThetermLinkedDatareferstoasetofbestpracticesforpublishingstructureddatadirectlyontheWeb.
Microformats
• Designedforhumansfirst,machinessecond.• Microformats requiresthedevelopmentofspecializedparsersforeachformat.• Microformatsisusedtoaddressspecificusecases.• Microformats consistofadefinitionofavocabulary(namesforclassesandproperties),aswellasasetofrules(e.g.,requiredproperties,correctnestingofelements).• HTML/XHTMLattributesareusedforinsertingmarkup.• Themicroformatscommunityencouragesmixingmicroformatsandreusingexistingformatswhencreatingnewones.
MicroformatsSyntax• FigureshowsMicroformat representationoftheexampledataPeterSmith.• Thevcard isarootclassnameindicatingthepresenceofanhCard.• Thepropertiesareurl (Peter’shomepage)andfn (fullname).• ThemarkupalsostatesthatPeterknowsPaulawiththepropertymetacquaintence.
MicroformatsDeploymentontheWeb• Yahoo!SearchareindexingsemanticmarkupincludinghCard,hCalendar,hReview,hAtom,andXFN.• GoogleareparsingthehCard,hReview,andhProduct microformatsandusingthemtopopulatesearchresultpages.• FacebookpublisheseventpagesannotatedwithhCalendar,• Yelp.com addshReview andhCard toalloftheirlistings• Wikipediatemplatesareabletoautomaticallygeneratemicroformatssuchasgeo,hCard,andhCalendar markup.
RDFasyntax• RDFa allowsonetoembedRDFtripleswithintheHTMLdocumentobjectmodel(DOM).• TheRDFasyntaxspecifieshowHTMLelementsmaybeannotatedwithentityidentifiers,entitytypes,stringproperties,andrelationshipproperties.
• TheHTMLattribute@aboutindicatesthattheentityidentifiedbytheURIreferencehttp://example.com/Peter.
• TheHTMLattribute@rel specifiesarelationshippropertybetweenthe HTMLelementandthetargetURL.• Thepropertyfoaf:knows tostatethatPeterknowsPaula.• Forstringproperties,theattribute@property(foaf:name)toexpressPeter’sname.• AcentralideaofRDFaisthesupportformultiple,decentralized,independent,extensiblevocabularies,incontrasttothecommunity-drivencentralizedmanagementofmicroformats.
Microdata
• MicrodataisanattempttoprovideasimpleralternativetoRDFaandMicroformats.• ItdefinesfivenewHTMLattributes(ascomparedtozeroforMicroformatsandeightforRDFa),• Providesaunifiedsyntax(incontrasttoMicroformats),• Allowsfortheusageofanyvocabularies(similarlytoRDFa).• W3Ccurrentlyhastwodraftspecifications(MicrodataandRDFa)withthesameobjective.
MicrodataSyntax• Microdataconsistsofagroupofname–valuepairs.• Thegroupsarecalleditems,andeachname–valuepairisaproperty.• Inordertomarkupanitem,theitemscope attributeisappliedtoanHTMLelement.• Toaddapropertytoanitem,theitemprop attributeisused.
LinkedData
• ThetermLinkedDatareferstoasetofbestpracticesforpublishingstructureddatadirectlyontheWeb.• LinkedDatauseshyperlinkstoconnectdisparatedataintoasingleglobaldataspace.• ALinkedDataapplicationthathaslookedupaURIandretrievedRDFdatabyfollowinglinks.• InaLinkedDatacontext,ifanRDFlinkconnectsURIsindifferentnamespaces,itultimatelyconnectsresourcesindifferentdatasets.
LinkedDataPrinciples
1. UseHTTPURIsasnamesforthings.
2. WhensomeonelooksupaURI,provideusefulinformation,usingrecommendedstandards(RDF,SPARQL).
3. IncludelinkstootherURIs,sothattheycandiscovermorethings.
4. WheneveraLinkedDataclientlooksupanHTTPURIovertheHTTPprotocol,thecorrespondingWebserverreturnsanRDFdescriptionoftheidentifiedobjectusingtheRDF/XMLsyntax.
LinkedData(RDF/XML)Syntax• FOAF,avocabularyfordescribingpeople.• URIhttp://example.com/Peteroftypefoaf:Person.• foaf:name statesthatthisthinghasthenamePeterSmith.• Foaf:knows statesthatPeterSmithknowsPaulaJones,whichisidentifiedbytheURIreferencehttp://example.com/People/Paula.
EvaluationDataForSearchEngines• AnumberofpubliclyavailableevaluationdatasetsthathavebeencrawledfromtheWebandcanbeusedforevaluatingsemanticsearchapplications:• ClueWeb09• TRECEntity• CommonCrawl• WebDataCommons• Sindice• BillionTripleChallenge• SemSearch
• OrtoobtainWebdata,usepubliclyavailablesoftwareforcrawlingtheWeb,suchasNutchforcrawlingWebpagesandLDSpider forcrawlingLinkedData
Challengesof“WebofData”
• ApplicationsthatwanttoexploittheWebofDataarefacingtwomainchallengestoday:• SemanticHeterogeneity.ThedifferenttechniquesthatareusedtopublishdataontheWebleadtoacertaindegreeofsyntaxheterogeneity• DataQuality.TheWebisanopenmediumandeverybodycanpublishdataontheWeb.Thus,theWebwillalwayscontaindatathatisoutdated,conflicting,orintentionallywrong(spam).
StoringandIndexingStructuredData
PerspectivestostorageandindexingofRDFdatasets• TheRelationalPerspective• AnRDFgraphisjustaparticulartypeofrelationaldata,andthattechniquesdevelopedforstoring,indexing,andansweringqueriesonrelationaldata.
• TheEntityPerspective• ResourcesintheRDFgraphareinterpretedas“objects”or“entities”.Eachentityisdeterminedbyasetofattribute–valuepairsintheentityperspective.
• TheGraph-BasedPerspective• ThefocusisonsupportingnavigationintheRDFgraphwhenviewedasaclassicalgraphinwhichsubjectsandobjectsformthenodes,andtriplesspecifydirected,labelededges.
StoringandIndexingUndertheRelationalPerspective• TwodifferentapproachesforstoringRDFdatainrelationaldatabases.• Theverticalrepresentation:• StoresalltriplesinanRDFgraphasasingletableovertherelationschema(subject,predicate,object).• DuetothelargesizeoftheRDFgraphsandthepotentiallylargenumberofself-joinsrequiredtoanswerqueries.
• Thehorizontalrepresentationapproachinterpretstriplepredicatevaluesascolumnnames,andstoresRDFgraphsinoneormorewidetables.
HorizontalRepresentation• RDFdataareconceptuallystoredinasingletableofthefollowingformat:• ThetablehasonecolumnforeachpredicatevaluethatoccursintheRDFgraphandonerowforeachsubjectvalue.Foreach(s,p,o)triple,theobjectoisplacedinthepcolumnofrows.
DisadvantagesandAdvantages
• Thereisaweaknesswhenansweringqueriesthatdonotspecifythepredicatevalue.• TherelationalschemamustbechangedwheneveranewpredicatevalueisaddedtotheRDFgraph.
• Onthepositiveside,thehorizontalrepresentationmakesiteasytosupporttypingofobjectvalues.• itiseasytointegrateexistingrelationaldatawithRDFdata.
StoringandIndexingUndertheEntityPerspective
• ResourcesintheRDFgraphareinterpretedas“objects,”or“entities.”• Eachentityisdeterminedbyasetofattribute–valuepairs.• Heavyuseoftheinvertedindexdatastructure.• Typically,thefollowingtwogeneraltypesofqueriesaretobesupported• Simplekeywordqueries:Akeywordqueryreturnsallentitiesthatcontainanattribute,relationship,and/orvaluerelevanttoagivenkeyword.• Conditionalentity-centricqueries:Aconditionalentity-centricqueryreturnsallknownentitiesthatsatisfysomegivenconditionsonacombinationofattribute,relationships,andvaluesatthesametime
StoringandIndexingUndertheGraph-BasedPerspective• ThefocusisonsupportingnavigationintheRDFgraphinwhichsubjectsandobjectsformthenodes,andpredicatesspecifydirected,labelededges.• Typicalquerypatternsaregraph-theoreticqueriessuchasreachabilitybetweennodes.• Themajorissueunderthisperspectiveishowtoexplicitlyandefficientlystoreandindextheimplicitgraphstructure.• Astructuralindexisusedtoobtainareducedversionofthisgraphwherecertainnodeshavebeenmergedwhilemaintainingalledges.
FurtherIndexResearches
• Amajoropenissueistheincorporationofschemaandontologyreasoning(e.g.,RDFSandOWL)instorageandindexing.• Alittleworkontheimpactofreasoningondisk-baseddatastructures.• Efficientmaintenanceofstorageandindexingstructuresasdatasets.• Intheentityperspective,investigationofsupportforricherquerylanguagesandintegrationwithtechniquesfromtheothertwoperspectives.• Studyofricherstructuralindexingtechniquesandrelatedqueryprocessingstrategies.
SemanticWiki• Semanticwikisarewikisthataddmachine-processable annotationstowikipages.• Annotationsexistsfordataitems,mostfrequentlywikipagesandtags,butalsosmallerportionsoftext.• Theannotationsmaybefreelychosentags,ormoreformalmechanismssuchasRDFbackedby(imported)RDFSorOWLontologiesareofferedaswell.• Theannotationsmaybeusedforsomeprocesses:consistencychecking,improvednavigation,search,querying,personalization,context-dependentpresentation,andreasoning.
SemanticWikiQueries
• AnnotationsareoftenrepresentedinRDF.TheyarecompatiblewithSPARQL.• Semanticwikisusuallyprovidesimplefull-textsearchforthequeryingoftextualcontentorRDFliterals.• AstandardRDFquerylanguagesuchasSPARQLorRDQLcanoftenbeusedforqueryingtheannotations.• Anumberofsemanticwikisalsocomewiththeirownlanguageforqueryingannotations(i.e.,Kiwi-KWQL).
DBpedia
• DBpedia isextracted structuredcontent fromWikipedia.• Thisstructuredinformationismadeavailableonthe WorldWideWeb.• DBpediaallowsusersto semanticallyquery relationshipsandpropertiesofWikipediaresources• DBpediaisincludinglinkstootherrelated datasets.• ItispossibletoaskcomplexqueriestotheDbpedia withSPARQLendpoint.
Dbpedia SPARQL
• SupposewewereinterestedinknowingwhicharethemovieswhereHughGrantandColinFirthstarredtogether,wecouldaskDBpediathefollowingSPARQLquery:
SELECT?movieWHERE{?movie<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://dbpedia.org/ontology/Film>.?movie<http://dbpedia.org/ontology/starring>
<http://dbpedia.org/resource/Hugh_Grant>.?movie<http://dbpedia.org/ontology/starring>
<http://dbpedia.org/resource/Colin_Firth>}
Keyword-basedSearchSystems
Keyword-basedsearchsystemsaddressthefollowingkeysteps:
• Composingavalidsemanticquery,• forauseritisdifficulttomasteraquerylanguage(e.g.,SPARQL)andacquiresufficientknowledgeabouttheontologyortheschemaofthedatasource.
• Identifying(substructuresholding)datamatchinginputkeywords,• byusinganindexingsystemoradatabaseengine.Indexingmaybemadebyshortestpathtorootnodes.
• Linkingidentifieddata(substructures)intosolutions• sincedataisusuallyscatteredacrossmultipleplaces,e.g.,indifferenttablesorXMLelements.
• Rankingsolutionsaccordingtoarelevantcriterion(i.e.,asuitablescoringfunction).• SpecificimplementationofTF/IDFmaybeusedforscoringkeywordelements.
• Onlythetop-ksolutionswithhighestscore,arereturnedtotheusersasqueryanswers.
TheinterfaceofInteractiveQueryConstructionofQUICKconsistsofthreeparts:
• QuickcreatesSemanticQueriesfromkeywords:• Asearchfield(onthetop),• Theconstructionpaneshowingqueryconstructionoptions(ontheleft),• Thequerypaneshowingsemanticqueries(ontheright).
PerformanceMeasurementsofSearch
• Threemeasureshavebeenproposedtoevaluateperformance:• Exhaustivitymeasurestherelevanceofasolutionintermsofthenumberofkeywordsitcontains.• Specificitymeasurestheprecisionofasolutionintermsofthenumberofkeywordsitcontainswithrespecttootherirrelevanttermsoccurringinthesolution.• Overlapmeasurestheinformationcontentofasolutionintermsofitsintersectionwithothersolutions.
• Clearly,thebestrankingstrategybalancesexhaustivityandspecificitywhilereducingoverlap.
SemanticWebSearchEngines
• HiddenWeb/DeepWebApproaches• RDF-CentricSearchEngines• DistributedWebSearchArchitectures
HiddenWeb/DeepWebApproaches• AvastamountoftheinformationavailableontheWebishiddenbehindsiteswithheavydynamiccontent,usuallybackedbyrelationaldatabases.• Manuallyconstructed,site-specificwrapperstoextractstructureddatafromHTMLpagesortocommunicatedirectlywiththeunderlyingdatabaseofsuchsites.• Automaticall crawlerexists,however,thisapproachis“taskspecific”andnotappropriateforgeneralcrawling.• TheSemanticWebmayrepresentafuturedirectionforbringingDeepWebinformationtothesurfacebyusingRDFasacommonandflexibledatamodel.
RDF-CentricSearchEngines• EarlyprototypesareOntobroker andSHOE usingtheconceptsofontologiesandsemanticsontheWeb.• Swoogle offerssearchoverRDFdocumentsbymeansofaninvertedkeywordindexandarelationaldatabase.• Watson alsoprovideskeywordsearchfacilitiesoverSemanticWebdocumentsbutadditionallyprovidessearchoverentities.• Sindice isaregistryandlookupserviceforRDFfilesbasedonLuceneandaMapReduceframework.• Falcons searchengineoffersentity-centricsearchingforentities(andconcepts)overRDFdata.• Ithasrankentitiesbyusingalogarithmofthecountofdocumentsinwhichtheyarementioned.
• GoWeb systemdemonstratesthebenefitofsearchingstructureddataforthebiomedicaldomain.
DistributedWebSearchArchitectures
• DistributedarchitectureshavelongbeencommonintraditionalWebsearchengines.
• Thesystemarchitecturesincludesanincrementalcrawler,rankerandstoragemanager,indexer,andqueryprocessor.
• Somesystemsuseadistributedinvertedindex• (basedonanembeddeddatabasesystem)overalargecorpusofWebpages,forsubsequentanalysisandqueryprocessing.
SemanticSearchWebEngine(SWSE)SystemArchitecture
SWSE• SWSE consistsofcrawling,dataenhancing,indexingandauserinterfaceforsearch,browsingandretrievalofinformation;operatesoverRDFWebdata(LinkedData).• SWSEallowsuserstospecifykeywordqueriesinaninputboxandrespondswitharankedlistofresultsnippets.• Theresultsrefertoentitiesnotdocuments(entitysearchoverinstancedata).• Userscansubsequentlynavigatetorelatedentities,assuch,browsingtheWebofData.
SWSEPreprocessing• ThecrawleracceptsasetofseedURIsandretrievesalargesetofRDFdatafromtheWeb,• Theconsolidationcomponenttriestofindsynonymous(i.e.,equivalent)identifiersinthedata,andcombinesthedataaccordingtotheequivalencesfound,• Therankingcomponentperformslinks-basedanalysisoverthecrawleddataandderivesscoresindicatingtheimportanceofindividualelementsinthedata(PageRank).• Thereasoningcomponentproducesnewdatawhichisimpliedbytheinherentsemanticsoftheinputdata,• Theindexingcomponentpreparesanindexwhichsupportstheinformationretrievaltasksrequiredbytheuserinterface(InvertedIndex).
SWSEQueryProcessing• Withthedistributedindexbuiltandpreparedontheslavemachines,thequeryprocessorisabletoacceptuserqueries.
• Foratop-kkeywordquery,thecoordinatingmachinerequestskresultidentifiersandranksfromeachoftheslavemachines.
• Thecoordinatingmachinethencomputestheaggregatedtop-khits.
• Toprovidetherawdatarequired,themastermachinedirectlyrequestsdatafromtherespectiveslavemachine(focusview).
Resultsviewforkeywordquery“billClinton” Focusviewforentity“BillClinton”
SWSESearch
Watson(http://watson.kmi.open.ac.uk/WatsonWUI/)
ARecommenderSystemforLinkedDataMORE(MOREthanMovieRecommendation)• Thesystemsisneededtorecommenditemsbasedonuserpreferences.
• Thesystemsshouldallowaneasyandfriendlyexplorationoftheinformation/datarelatedtoaparticulardomainofinterest.
• NewchallengeswiththehugeamountofinterlinkeddatacomingfromthesemanticWeb.
SemanticVectorSpaceModel(MORE)
• InVSM,weightsareassignedtoindextermsinqueriesandindocuments(setsofterms),• Weightsareusedtocomputethedegreeofsimilaritybetweeneachdocumentinthecollectionandthequery.• WholeRDFgraphmayberepresentedasathree-dimensionaltensorwhereeachtwo-dimensionalslicereferstoanontologyproperty.• Givenaproperty,eachmovieisseenasavector,whosecomponentsrefertotheTF-IDF(resourcefrequency-inversemoviefrequency).• Foraparticularproperty,thesimilaritydegreebetweentwomoviesisrepresentedbythecorrelationbetweenthetwovectors.• Toobtaintheglobalcorrelationbetweentwomovies,aweightedsumofeachpropertyiscalculated.
TensorrepresentationoftheRDFgraph
ImportanceweightsofthepropertiesThepropertiesinvolvedinthesimilaritydetectionprocessdonothavethesameimportance.EachpropertycanhaveadifferentimportancefortheuserthatcanbespecifiedthroughaweightinMORE.
SampleofRDFgraphrelatedtothemoviedomain
FigureshowsasketchofourRDFgraphonmovies.Itcontains2movies,3actors,2directors,3categories,1genre,and5differentpredicates.
ExploratorySearchApplications• Theyaredesignedtosatisfytheneedsofuserswithspecificaims.• Theysupportsthepublishingandintegrationofdatasourcesforverticaldomains.• Theuserwillbeabletoselectsourcesbasedonindividualorcollectivetrust.• Andsystemswillbeabletoroutequeriestosuchsourcesandtoprovideeasyto-useinterfacesforcombiningthemwithinsearchstrategies.
DeploymentArchitecture
• ThedeploymentofexploratoryWebapplicationsintegratingdatasourcesrequiresanumberofsoftwarecomponentsandsophisticatedinteractionsbetweenthem:• Theprocessingmodules inchargeofinvokingservicesthatquerythedatasources.• Theexecutionengineisadataandcontrol-drivenqueryenginespecificallydesignedtohandlemultidomainqueries.• Thecontrollayeristhecontrollerofthearchitecture;itisdesignedtohandleseveralsysteminteractions.• Therepository containsthesetofcomponentsanddatastoragesusedbythesystem.
ExploratorySearchApplicationsExamples
• NightPlanner• WeekendBrowser• Real-EstateBrowser• Job-HouseCombinationBrowser
NightPlanner
• Anightplannerisashort-termWebapplicationpresentingseveralgeolocalizedservices,describingrestaurants,shows,movies,familyevents,musicconcerts,andthelike.• Selectedrestaurantsarerankedbydistancefromtheuserandpossiblybytheirscore
WeekendBrowser
• Aweekendbrowserisashort-termWebapplicationpresentingtouserstheeventswhichareoccurringinoneormoreselectedcitiesofinterest.• Onceshe/heisconsideringaparticularlocation,she/heisofferedadditionalservicesforcompletingtheweekendplan.
Real-EstateBrowser
• Areal-estatebrowserisalong-lived,hierarchicalapplication.• Itiscenteredaroundareal-estate.• Ausermayselectsomehouseoffersandevaluatethemaccordingtosomesearchdimensions(e.g.,distancefromwork,school).• Thedesignermaysimplifytheinteractionbycombiningseveralservicesintoonequery(e.g.,walkabilityandvicinitytomarketsandparks)
Job-HouseCombinationBrowser
• Awork-jobbrowserisalong-lived,hierarchicalapplicationwheretwohierarchicalroots,onecenteredonworkoffersandoneonhouseoffers.
• TheapplicationasdesignedforapplicantstoPhDprograms,whereopeningsarelinkedtodoctoralschools,thentotheirprofessors,thentotheirresearchprograms,andanon-campushousing
References
• SemanticSearchOverTheWeb,RobertoDeVirgilio,FrancescoGuerra,Yannis Velegrakis,Springer,2012.• https://rdfa.info/• http://microformats.org/wiki/Main_Page• https://schema.org/docs/gs.html• http://wiki.dbpedia.org/