Top Banner
Making the Web Searchable Peter Mika Researcher, Data Architect Yahoo! Research
66

Semantic Search Summer School2009

Sep 12, 2014

Download

Education

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Semantic Search Summer School2009

Making the Web Searchable

Peter Mika

Researcher, Data Architect

Yahoo! Research

Page 2: Semantic Search Summer School2009

- 2 -

Yahoo! Research (research.yahoo.com)

Page 3: Semantic Search Summer School2009

- 3 -

Yahoo! Research Barcelona

• Established January, 2006

• Led by Ricardo Baeza-Yates

• Research areas

– Web Mining

• content, structure, usage

– Distributed Web retrieval

– Multimedia retrieval

– NLP and Semantics

Page 4: Semantic Search Summer School2009

- 4 -

Yahoo! by numbers (April, 2007)

• There are approximately 500 million users of Yahoo! branded services, meaning we reach 50 percent – or 1 out of every 2 users – online, the largest audience on the Internet (Yahoo! Internal Data).

• Yahoo! is the most visited site online with nearly 4 billion visits and an average of 30 visits per user per month in the U.S. and leads all competitors in audience reach, frequency and engagement (comScore Media Metrix, US, Feb. 2007).

• Yahoo! accounts for the largest share of time Americans spend on the Internet with 12 percent (comScore Media Metrix, US, Feb. 2007) and approximately 8 percent of the world’s online time (comScore WorldMetrix, Feb. 2007).

• Yahoo! is the #1 home page with 85 million average daily visitors on Yahoo! homepages around the world, an increase of nearly 5 million visitors in a month (comScore WorldMetrix, Feb. 2007).

• Yahoo!’s social media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut and Bix) have 115 million unique visitors worldwide (comScore WorldMetrix, Feb. 2007).

• Yahoo! Answers is the largest collection of human knowledge on the Web with more than 90 million unique users and 250 million answers worldwide (Yahoo! Internal Data).

• There are more than 450 million photos in Flickr in total and 1 million photos are uploaded daily. 80 percent of the photos are public (Yahoo! Internal Data).

• Yahoo! Mail is the #1 Web mail provider in the world with 243 million users (comScore WorldMetrix, Feb. 2007) and nearly 80 million users in the U.S. (comScore Media Metrix, US, Feb. 2007)

• Interoperability between Yahoo! Messenger and Windows Live Messenger has formed the largest IM community approaching 350 million user accounts (Yahoo! Internal Data).

• Yahoo! Messenger is the most popular in time spent with an average of 50 minutes per user, per day (comScore WorldMetrix, Feb. 2007).

Nearly 1 in 10 Internet users is a member of a Yahoo! Groups (Yahoo! Internal Data).• Yahoo! is one of only 26 companies to be on both the Fortune 500 list and the Fortune’s “Best Place to Work”

List (2006).

Page 5: Semantic Search Summer School2009

- 5 -

Agenda

• Publishing metadata on the Semantic Web

– A brief history of publishing metadata in HTML

• Semantic Search research and applications

Page 6: Semantic Search Summer School2009

- 6 -

The many faces of the Semantic Web

• Six ways of publishing RDF

1. Linked RDF files (linked data)

2. Metadata inside webpages

3. SPARQL endpoints

4. Feeds

5. XSLT/GRDDL

6. Automated tools

• Non-exclusive but in practice

most publisher choose one

Page 7: Semantic Search Summer School2009

- 7 -

Option 1: Standalone RDF documents

• RDF documents linked to other RDF documents

– Use rdfs:seeAlso to point to a related document

• It says: Go and look at that document if you want to know more

• Advantages:

– No change to the publishing of the HTML documents

– Data can be published by third party

• Tools

– RDB-to-RDF mappers such as D2RQ or Triplify

– Linked Data browsers

• Examples: Most datasets in the Linked Data cloud

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Page 8: Semantic Search Summer School2009

- 8 -

Option 1: cntd.

• For discovery, the metadata is often linked from HTML pages

<link rel="meta" type="application/rdf+xml" title="FOAF" href="http://www.cs.vu.nl/~pmika/foaf.rdf" />

• Additional advantages:

– Discovery from the webpage

– It’s clear that the metadata is a machine representation of the human-targeted content of the page

• Examples: FOAF profiles, BestBuy

..

Peter Mika was born

in Budapest.

Peter Mika was born

in Budapest.

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Page 9: Semantic Search Summer School2009

- 9 -

Option 2: Metadata inside web pages

• Using microformats, RDFa, MicroData (more later)

• Advantages:

– No content negotiation required

– No separate database export required

– Browser plug-in friendly

– Search engine friendly

– Copy-paste friendly

• Tools:

– XML editors (e.g. Oxygen)

– Triplr

– RDFa Distiller

– RDFa bookmarklet

– Ubiquity RDFa plugin

– Optimus microformat parser

• Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook…

Peter Mika was born

in Budapest.

Peter Mika was born

in Budapest.

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Peter Mika was born

in Budapest.

Peter Mika was born

in Budapest.

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Page 10: Semantic Search Summer School2009

Metadata in HTML

Page 11: Semantic Search Summer School2009

- 16 -

Brief history of the Annotated Web

• 1995: HTML meta tags• 1996: Simple HTML Ontology Extensions (SHOE)• 1998: RDF/XML

– RDF/XML in HTML– RDF linked from HTML

• 2003: Web 2.0– Tagging– Microformats– Metadata in Wikipedia– Machine tags in Flickr

• 2005: eRDF • 2008: RDFa• 2009: Microdata (?)

Page 12: Semantic Search Summer School2009

- 17 -

HTML meta tags

<HTML><HEAD profile="http://dublincore.org/documents/dcq-html/"><META name="DC.author" content="Peter Mika"><LINK rel="DC.rights copyright"

href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF"

href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> …</HTML>

Page 13: Semantic Search Summer School2009

- 18 -

SHOE example (Hefflin & Hendler, 1996)

<ONTOLOGY "our-ontology" VERSION="1.0"> <ONTOLOGY-EXTENDS "organization-ontology" VERSION="2.1" PREFIX="org"

URL="http://www.ont.org/orgont.html"> <ONTDEF CATEGORY="Person" ISA="org.Thing"> <ONTDEF RELATION="lastName" ARGS="Person STRING"> <ONTDEF RELATION="firstName" ARGS="Person STRING"> <ONTDEF RELATION="marriedTo" ARGS="Person Person"> <ONTDEF RELATION="employee" ARGS="org.Organization Person">

</ONTOLOGY>

<HEAD><META HTTP-EQUIV="Instance-Key" CONTENT="http://www.cs.umd.edu/~george"> <USE-ONTOLOGY "our-ontology" VERSION="1.0" PREFIX="our" URL="http://ont.org/our-ont.html"> </HEAD><BODY>

<CATEGORY "our.Person">

<RELATION "our.marriedTo" TO="http://www.cs.umd.edu/~helena">

<RELATION "our.employee" FROM="http://www.cs.umd.edu">

My name is

<ATTRIBUTE "our.firstName"> George </ATTRIBUTE>

<ATTRIBUTE "our.lastName"> Cook </ATTRIBUTE> and I live at...

Page 14: Semantic Search Summer School2009

- 19 -

SHOE system

Page 15: Semantic Search Summer School2009

- 20 -

SHOE Text-based query interface

Page 16: Semantic Search Summer School2009

- 21 -

SHOE Graphical Query Interface

Page 17: Semantic Search Summer School2009

- 22 -

Example: Creative Commons

Embedding CC license in HTML (now deprecated):

<HTML><HEAD>… </HEAD><BODY>…

<!–- <rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <Work rdf:about="http://www.yergler.net/averages/"> <dc:title>The Law of Averages</dc:title> <dc:description>...because eventually i&apos;ll be right...</dc:description> <license rdf:resource="http://creativecommons.org/licenses/by-nc/1.0/" /> </Work> <License rdf:about="http://creativecommons.org/licenses/by-nc/1.0/"><requires rdf:resource="http://web.resource.org/cc/Notice" /> <permits rdf:resource="http://web.resource.org/cc/Reproduction" /> <permits rdf:resource="http://web.resource.org/cc/Distribution" /> <prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" /> </License> </rdf:RDF>

-->

Page 18: Semantic Search Summer School2009

- 23 -

Example: Creative Commons

• Current: rel attribute (HTML4)

This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/">Creative Commons Attribution 3.0 United States License</a>.

• Use of the “rel” attribute for semantic annotation is the birth of the microformat…

Page 19: Semantic Search Summer School2009

- 24 -

Microformats (μf)

• Community centered around microformats.org– Specifications and discussions are hosted there

• Agreements on the way to encode certain kinds metadata in HTML– Reuse of semantic-bearing HTML elements– Based on existing standards– Minimality

• Microformats exist for a limited set of objects– hCard (persons and organizations)– hCalendar (events)– hResume– hProduct– hRecipe

• Varying degrees of support and stability– hCard and rel-tag are widely supported

Page 20: Semantic Search Summer School2009

- 25 -

Example: microformats

<cite class="vcard">

<a class="fn url" rel="friend colleague met"

href="http://meyerweb.com/">Eric Meyer</a>

</cite> wrote a post (<cite>

<a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">

Tax Relief</a></cite>) about an unintentionally humorous letter

he received from the <span class="vcard">

<a class="fn org url" href="http://irs.gov/">

Internal Revenue Service</a> </span>.

<div class="vcard">

<a class="email fn" href="mailto:[email protected]">Joe Friday</a>

<div class="tel">+1-919-555-7878</div>

<div class="title">Area Administrator, Assistant</div>

</div>

Page 21: Semantic Search Summer School2009

- 26 -

Microformats: limitations

• No shared syntax

– Each microformat has a separate syntax tailored to the vocabulary

• No formal schemas

– Limited reuse, extensibility of schemas

– Unclear which combinations are allowed

• No datatypes

• No namespaces, unique identifiers (URIs)

– no interlinking

– mapping between instances is required

• Relationship to page context is often unclear

Page 22: Semantic Search Summer School2009

- 27 -

RDFa

• RDF-in-attributes – World Wide Web Consortium (W3C) recommendation for

encoding RDF triples in HTML

– Full RDF support

• Recommendation specifies the algorithm for parsing the triples out of HTML

– Requires XHTML in principle

• In practice, no one cares

Page 23: Semantic Search Summer School2009

- 28 -

RDFa in a slide

<p xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:foaf="http://xmlns.com/foaf/0.1/” typeof=”foaf:Person"

about="http://example.org/staff/jo">

<span property="rdfs:label foaf:name">

Jo Smith</span>.

<span property="foaf:title">Web hacker

</span> at

<a rel=”vcard:org" property="foaf:name"

href="http://example.org"> Acme Corp</a>.

You can contact me <a rel="foaf:mbox" href="mailto:[email protected]">

via email </a>.

</p> ...

Assign the prefixes rdfs and foaf to the RDFS and FOAF namespaces

(as in XML, RDF/XML etc.)

Create a new resource of type foaf:Person

Assign a value to a property

Give it a URI

Link to another resourceand assign a name to it

Page 24: Semantic Search Summer School2009

- 29 -

Microformats vs. RDFa

• Choose microformats when you find a microformat that fits your needs (and supported by your favorite tool or search engine)– Microformats are first option because they are simple

– We support all major microformats, see the documentation

– It’s a common misconception that RDFa requires XHTML: it doesn’t

• If you find none that perfectly fits your needs then you need RDFa– Microformats have a fixed schema: you can not add your own

attributes

• Example: a social networking site with user profiles– VCard is a good candidate, but for example it doesn’t have a way to

express the user’s social connections

– You either live without this, or go with RDFa

Page 25: Semantic Search Summer School2009

- 30 -

Keep an eye on HTML5

• Currently under standardization at the W3C

– Last Call in Fall 2009

• Introduces Microdata

– Similar to microformats

• Some predefined vocabularies with central registration

– Some of the flexibility of RDFa

– Introduce new terms using reverse domain names or full URIs

• Semantic HTML elements such as <time>, <video>, <article>…

Page 26: Semantic Search Summer School2009

- 31 -

Microdata example<div item=“http://www.yahoo.com/resource/person”>

<p>My name is <span itemprop="name">Neil</span>.</p>

<p>My band is called

<span itemprop="band">Four Parts Water</span>.

I was born on

<time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>.

<img itemprop="image" src=”me.png" alt=”me”>

</p>

</div

Page 27: Semantic Search Summer School2009

- 32 -

The process of annotating with RDFa

1. Invest in familiarizing with the RDFa syntax by reading the RDFa Primer– It is also highly recommended that you read the RDF Primer. RDF is the data model used

by RDFa.

2. Choose a vocabulary from the SearchMonkey documentation that fits your needs– A vocabulary describes a set of types and attributes within a given domain – If you don’t find a good candidate, extend an existing one or create a new one

3. Annotate your page.– Before you start, you might want to validate your page for (X)HTML conformance

using the W3C’s (X)HTML Validator to reduce the chance of errors. Choose Document Type XHTML + RDFa.

– No specific tool support. If you have an HTML or XML editor that supports DTDs, you will have syntax checking and highlighting.

– Use the RDFa Distiller to validate which data can be extracted from your page.– If you fancy, use the RDF Validator to graphically visualize the RDF graph that is

outputted.

4. Put the annotated page online. The data will extracted the next time your page is crawled– No need to explicitly submit anything– No notification when your site is crawled

• See http://rdfa.info/rdfa-implementations for new tools and APIs

Page 28: Semantic Search Summer School2009

- 33 -

Choosing a vocabulary

• Look at SearchMonkey objects

– Video, Games, Presentations, Events, News, Businesses, Products, Discussion

• Search the Web or ask for advice on mailing lists

[email protected]

[email protected]

• Wikis

– semanticweb.org

– vocamp.org

• Beware of people who claim to have the vocabulary of everything

– Preferably you want something small and targeted

• Never a 100% fit you will need to introduce vocabulary terms (classes and properties)

– Do not introduce new classes/properties in existing namespaces

– Example: the namespace http://xmlns.com/foaf/0.1/ is used by the FOAF project. Try not to introduce a new term without contacting the owner, i.e. the membership of the FOAF mailing list.

Page 29: Semantic Search Summer School2009

- 34 -

Advanced topic: creating a vocabulary

1. Get advice on methodology– vocamp.org and semanticweb.org

2. Choose a namespace and a prefix– Give sensible names, e.g. name it after your site, but don’t call it searchmonkey

– Namespace ends either with a slash or a hash

3. Create an RDF or OWL document describing your classes and properties• Use an ontology editor such as Protégé 4.0

• Follow naming conventions

4. Publish your vocabulary– Make sure the URIs of your properties and classes are resolvable

1. E.g. myvocab:digicam should resolve to a document containing the definition of myvocab:digicam

• Convince others to adopt your vocabulary1. If you are in fishing, convince other fishing businesses

Page 30: Semantic Search Summer School2009

- 35 -

Exercise (slideshare.net/pmika)

• Explore data on the Web

– Microformats

• Search for pages on Yahoo using searchmonkey:com.yahoo.page.uf.hcard

• Try Operator Firefox Plug-in

• Try Optimus

– RDFa

• Create yourself or search for pages on Yahoo using searchmonkey:com.yahoo.page.rdf.rdfa

• Try RDFa bookmarklet to highlight RDFa

• Try RDFa Distiller to extract RDF from HTML

• Try RDF Validator to visualize your RDF data

• Mark up your webpage using RDFa

– Use RDFa Distiller to test

• Try automated annotation using Zemanta or OpenCalais

Page 31: Semantic Search Summer School2009

Semantic Search

Page 32: Semantic Search Summer School2009

- 37 -

Why semantic search? (P. Raghavan)

• Old battles are won– Main driver of user perception of search quality used to be:

precision of navigational queries

– The technical prowess was about crawling and spam

– The hard core was indexing and retrieval

• Currently, the biggest bottlenecks in IR not computational, but in modeling user cognition – If only we could find a computationally expensive way to solve

the problem

• then we should be able to make it go faster

Page 33: Semantic Search Summer School2009

- 38 -

Searches that show a ‘semantic gap’

• Ambiguous searches– Paris Hilton

• Multimedia search– Images of Paris Hilton

• Imprecise or overly precise searches – Publications by Jim Hendler– Find images of strong and adventurous people (Lenat)

• Searches for descriptions– Search for yourself without using your name– Product search (ads!)

• Searches that require aggregation– Size of the Eiffer tower (Lenat)– Public opinion on Britney Spears– World temperature by 2020

Queries that require a deeper understanding of the query, the content and/or the world at large

Page 34: Semantic Search Summer School2009

- 39 -

Not just search

Page 35: Semantic Search Summer School2009

- 40 -

Semantic Search

• Def. matching the user’s query with the Web’s content at a conceptual level, often with the help of world knowledge

– R. Guha, R. McCool: Semantic Search, WWW2003

• Related disciplines

– Semantic Web, IR, Databases, NLP, IE

• As a field

– ISWC/ESWC/ASWC, WWW, SIGIR

– Exploring Semantic Annotations in Information Retrieval (ECIR08, WSDM09)

– Semantic Search Workshop (ESWC08, WWW09)

– Future of Web Search: Semantic Search (FoWS09)

Page 36: Semantic Search Summer School2009

- 41 -

Semantics at every step of the IR process

bla bla bla?

bla

blabla

q=“bla” * 3

Document processing

bla

blabla

blabla

bla

Ranking “bla”θ(q,d)

Query processing

Search interface

The IR engine The WebThe Semantic Web

Page 37: Semantic Search Summer School2009

- 42 -

Document processing

• Goal: provide a higher level representation of text in some conceptual space

• Diverse methods

– Document classification

– Information Extraction

• Named-entity recognition, word-sense disambiguation, semantic role labeling, wrapper induction, form filling, etc.

• Previously: open source toolkits such as GATE, OpenNLP

– Require expert user to operate and train

• NEW: Online tools

– For non-expert users

– Embed metadata inside documents and/or link entities in documents to the cloud

– Often using existing metadata as background knowledge

Page 38: Semantic Search Summer School2009

- 43 -

Examples: online NLP

• OpenCalais (Thomson Reuters)– Online interface and API for named-entity recognition, (partial)

disambiguation and relationship extraction

– HTML annotator (microformats), Firefox plug-in, WordPress plug-in, Yahoo Pipes service, etc.

– Works best on the news domain

• Zemanta– Zemanta is a ‘personal writing assistant’ that recognizes and

disambiguates named entities and suggests related content

– A Firefox-plugin that extends the functionality of popular blogging and online email platforms, online interface and API

– Broad coverage but partial recognition

Page 39: Semantic Search Summer School2009

- 44 -

Examples: information extraction from templated HTML

• Intel MashMaker

– Create mashups based on information extracted from the page you are browsing (Firefox plugin)

– Wrapper induction trained using manual annotations

– Export the XSLT to be used in Yahoo’s SearchMonkey

• Dapper

– Semantic Advertising company

– Wrapper induction trained using manual annotations (online tool)

– API to the dynamic extraction

• Glue

– Social interaction around objects on the Web (Firefox plugin)

– API provides access to the information extracted from popular sites

Page 40: Semantic Search Summer School2009

- 45 -

Query Interpretation

• Provide a higher level representation of queries in some conceptual space

– Ideally, the same space in which documents are represented

• Queries may be keywords, questions, semi-structured, structured etc.

• Interpretation treated as a separate step from ranking

– Required for federation, i.e. determine where to send the query

– Choosing between interpretations could be left to the user

– Due to performance requirements

• You cannot execute the query to determine what it means and then query again

• General world knowledge (e.g. DBpedia), domain ontologies or the schema of the actual data can be used

Page 41: Semantic Search Summer School2009

- 46 -

Example: Semantic Search Assist

• Observation: the same type of objects often have the same query context– Users asking for the same aspect of the type

• Could we make query suggestions based on the type of the entity?– Improvement for infrequent queries

apple ipod nano review sony plasma tv reviewjerry yang biography biography tim berners lee tim berners lee blog

peter mika yahoobritney spears shaves her head

Page 42: Semantic Search Summer School2009

- 47 -

Models

• Desirable properties:

– P1: Fix is frequent within type

– P2: Fix has frequencies well-distributed across entities

– P3: Fix is infrequent outside of the type

• Models:

apple ipod nano review

entity fix

type: product

Page 43: Semantic Search Summer School2009

- 48 -

Demo

Page 44: Semantic Search Summer School2009

- 49 -

Ranking

• Goal: match the query representation to content representation

– Again, possibly with the use of background knowledge

• Methods depend on

– the actual representation

• Bag of words, NL parse trees, SPARQL…

– the qualities of the queries and content

– the use case

• e.g. time available for ranking may vary from milliseconds to days

Page 45: Semantic Search Summer School2009

- 50 -

Search interface

• Goal is to facilitate the interaction between the user and the system

– helping the user to formulate queries

– present the results in an intelligent manner

• Semantic-based representations allow novel interfaces

– Adaptive presentation

• presentation adapts to the kind of query and results presented

– Aggregated search

• Grouping similar items, summarizing results in various ways

• Possibilities for filtering, possibly across different dimensions

– Task completion

• Help the user to fulfill the task by placing the query in a task context

Page 46: Semantic Search Summer School2009

- 51 -

Putting it all together: Semantic Search Engines

• Natural Language search engines

– Hakia

– Powerset (now built into Bing)

– TrueKnowledge

• Structured data search engines

– Searching open web data

• Sindice

• Sigma

– Searching closed world data

• Wolfram Alpha (closed structured data + computation)

• Web search engines

– Yahoo’s SearchMonkey

• BOSS and YQL

– Google’s Rich Snippets

Page 47: Semantic Search Summer School2009

- 52 -

• An open platform for using structured data to build more useful and relevant search results

• Creating an ecosystem of publishers, developers and end-users – Helping publishers to implement semantic annotation and

motivating them by allowing to customize their search results

– Providing tools and APIs for developers to create compelling applications

– Improving search result presentation for our end-users

• Semantic Web technology– Support for a number of microformats, RDFa

– RDF based data representation

– Industry standard vocabularies (or use any of your own)

SearchMonkey

Page 48: Semantic Search Summer School2009

- 53 -

image

deep links

name/value pairs or

abstract

Enhanced Result

Page 49: Semantic Search Summer School2009

- 54 -

A difficult problem!

• What if one would try to do this automatically?

– Document summarization

– Page structure detection

– Information Extraction

– Image recognition

– Link classification and ranking

• But one of those cases where a little semantics can go a long way…

Page 50: Semantic Search Summer School2009

- 55 -

Acme.com’sdatabase

Index

RDF/Microformat Markup

site owners/publishers share structured data with Yahoo!. 1

consumers customize their search experience with Enhanced Results or Infobars

3

site owners & third-party developers build SearchMonkey apps.2

DataRSS feed

Web Services

Page Extraction

Acme.com’s Web Pages

SearchMonkey

Page 51: Semantic Search Summer School2009

- 56 -

Example apps

• LinkedIn

– hCard plus feed data

• Creative Commons by Ben Adida

– CC in RDFa

Page 52: Semantic Search Summer School2009

- 57 -

Example apps. II.

• Other me by Dan Brickley

– Google Social Graph API wrapped using a Web Service

Page 53: Semantic Search Summer School2009

- 58 -

Google’s Rich Snippets

• Shares a subset of the features of SearchMonkey

– Encourages publishers to embed certain microformats and RDFa into webpages

• Currently reviews, people, products, business & organizations

– These are used to generate richer search results

• SearchMonkey is customizable

– Developers can develop applications themselves

• SearchMonkey is open

– Wide support for standard vocabularies

– API access

Page 54: Semantic Search Summer School2009

- 59 -

BOSS: Build your Own Search Service

• Ability to re-order results and blend-in addition content

• No restrictions on presentation

• No branding or attribution

• Access to multiple verticals (web search, image, news)

• 40+ supported language and region pairs

• Pricing (BOSS)

– Pay-by-usage

– 10,000 queries a day still free

– Serve any ads you want

• For more info, http://developer.yahoo.com/search/boss/

Page 55: Semantic Search Summer School2009

- 60 -

BOSS API to structured data

• Simple HTTP GET calls, no authentication

– You need an Application ID: register at developer.yahoo.com/search/boss/

• http://boss.yahooapis.com/ysearch/web/v1/{query}?appid={appid}&format=xml&view=searchmonkey_feed

• Restrict your query using special words

– searchmonkey:com.yahoo.page.uf.{format}

• {format} is one of hcard, hcalendar, tag, adr, hresume etc.

– searchmonkey:com.yahoo.page.rdf.rdfa

Page 56: Semantic Search Summer School2009

- 61 -

Demo: resume search

• Search pages with resume data and given keywords

{keyword} searchmonkey:com.yahoo.page.uf.hresume

• Parse the results as DataRSS (XML)

• Extract information and display using YUI

Page 57: Semantic Search Summer School2009

- 62 -

Demo

Page 58: Semantic Search Summer School2009

- 63 -

Yahoo Query Language (YQL)

• Query web APIs as virtual tables

– Mash-up data by joining tables

– Add an API by adding a table definition

• Example: select my friends and sort by nickname

Page 59: Semantic Search Summer School2009

- 64 -

PHP example: select the last 100 photos from Flickr with the word Austin

<?php

$url = "http://query.yahooapis.com/v1/public/yql?q=";

$q = "select * from flickr.photos.search(100) where text=’Austin'";

$fmt = "xml";

$x = simplexml_load_file($url.urlencode($q)."&format=$fmt");

foreach($x->attributes('http://www.yahooapis.com/v1/base.rng') as $k=>$v) {

$$k=(string)$v;

}

echo <<<EOB

$count photos fetched from

{$x->diagnostics->url} in

{$x->diagnostics->url['execution-time']} seconds<br>

EOB;

$flickr = "http://static.flickr.com/";

foreach($x->results->photo as $p) {

echo "<img src=\"$flickr{$p['server']}/{$p['id']}_{$p['secret']}_s.jpg\"/>\n";

}

?>

Page 60: Semantic Search Summer School2009

- 65 -

YQL example (source)

Page 61: Semantic Search Summer School2009

- 66 -

That’s all there is to it!

<?php

$root = 'http://query.yahooapis.com/v1/public/yql?q=';

$city = 'Barcelona';

$loc = 'Barcelona';

$yql = 'select * from html where url = \'http://en.wikipedia.org/wiki/'.$city.'\' and xpath="//div[@id=\'bodyContent\']/p" limit 3';

$url = $root . urlencode($yql) . '&format=xml';

$info = getstuff($url);

$info = preg_replace("/.*<results>|<\/results>.*/",'',$info);

$info = preg_replace("/<\?xml version=\"1\.0\"".

" encoding=\"UTF-8\"\?>/",'',$info);

$info = preg_replace("//",'',$info);

$info = preg_replace("/\"\/wiki/",'"http://en.wikipedia.org/wiki',$info);

$yql = 'select * from upcoming.events.bestinplace(5) where woeid in '.

'(select woeid from geo.places where text="'.$loc.'")'.

' | unique(field="description")';

$url = $root . urlencode($yql) . '&format=json';

$events = getstuff($url);

$events = json_decode($events);

foreach($events->query->results->event as $e){

$evHTML.='<li><h3><a href="'.$e->ticket_url.'">'.$e->name.'</a></h3><p>'.

substr($e->description,0,100).'&hellip;</p></li>';

}

$yql = 'select * from flickr.photos.info where photo_id in '.

'(select id from flickr.photos.search where woe_id in '.

'(select woeid from geo.places where text="'.$loc.'")) limit 16';

$url = $root . urlencode($yql) . '&format=json';

$photos = getstuff($url);

$photos = json_decode($photos);

foreach($photos->query->results->photo as $s){

$src = "http://farm{$s->farm}.static.flickr.com/{$s->server}/".

"{$s->id}_{$s->secret}_s.jpg";

$phHTML.='<li><a href="'.$s->urls->url->content.'"><img alt="'.

$s->title.'" src="'.$src.'"></a></li>';

}

$yql='select description from rss where '.

' url="http://weather.yahooapis.com/forecastrss?p=SPXX0015&u=c"';

$url = $root . urlencode($yql) . '&format=json';

$weather = getstuff($url);

$weather = json_decode($weather);

$weHTML = $weather->query->results->item->description;

function getstuff($url){

$curl_handle = curl_init();

curl_setopt($curl_handle, CURLOPT_URL, $url);

curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);

curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);

$buffer = curl_exec($curl_handle);

curl_close($curl_handle);

if (empty($buffer)){

return 'Error retrieving data, please try later.';

} else {

return $buffer;

}

}?>

Page 62: Semantic Search Summer School2009

- 67 -

Exercise

• Build a SearchMonkey application

• Try a few YQL queries

• Find data on the Web using search or by browsing Linked Data

• Build your own search engine using BOSS

Page 63: Semantic Search Summer School2009

- 68 -

Challenges

• Future work in Semantic Web

– (Semi-)automated ways of metadata creation• How do we go from 5% to 95%?

– Data quality• Can we trust statements people make about each other’s data?

– Reasoning• To what extent is reasoning useful?

• For example, how much would entity resolution or taxonomic reasoning help?

– Scale• How do we exploit cluster computing techniques?

• What is between databases and IR engines?

– Fostering social agreements• How do we get people to reuse vocabularies?

Page 64: Semantic Search Summer School2009

- 69 -

Challenges

• Future work in IR– Query interpretation

– Ranking with metadata

– Evaluation of semantic search

– Personalization

– Semantic ads

• Constraints

– Users still want to see a document• Keyword-based search cannot suffer

• Whole page relevance, monetization can only increase

– Established expectations • Query entry

• Result presentation

Page 65: Semantic Search Summer School2009

- 70 -

Contact

• Peter Mika

[email protected]

– Come to Barcelona and stop by

– Ask about our internship program

• SearchMonkey

– developer.yahoo.com/searchmonkey/

– mailing lists• [email protected]

[email protected]

– forums• http://suggestions.yahoo.com/searchmonkey

Page 66: Semantic Search Summer School2009

- 71 -

the monkey is out!