Semantic Search Summer School2009

Making the Web Searchable

Peter Mika

Researcher, Data Architect

Yahoo! Research

- 2 -

Yahoo! Research (research.yahoo.com)

- 3 -

Yahoo! Research Barcelona

• Established January, 2006

• Led by Ricardo Baeza-Yates

• Research areas

– Web Mining

• content, structure, usage

– Distributed Web retrieval

– Multimedia retrieval

– NLP and Semantics

- 4 -

Yahoo! by numbers (April, 2007)

• There are approximately 500 million users of Yahoo! branded services, meaning we reach 50 percent – or 1 out of every 2 users – online, the largest audience on the Internet (Yahoo! Internal Data).

• Yahoo! is the most visited site online with nearly 4 billion visits and an average of 30 visits per user per month in the U.S. and leads all competitors in audience reach, frequency and engagement (comScore Media Metrix, US, Feb. 2007).

• Yahoo! accounts for the largest share of time Americans spend on the Internet with 12 percent (comScore Media Metrix, US, Feb. 2007) and approximately 8 percent of the world’s online time (comScore WorldMetrix, Feb. 2007).

• Yahoo! is the #1 home page with 85 million average daily visitors on Yahoo! homepages around the world, an increase of nearly 5 million visitors in a month (comScore WorldMetrix, Feb. 2007).

• Yahoo!’s social media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut and Bix) have 115 million unique visitors worldwide (comScore WorldMetrix, Feb. 2007).

• Yahoo! Answers is the largest collection of human knowledge on the Web with more than 90 million unique users and 250 million answers worldwide (Yahoo! Internal Data).

• There are more than 450 million photos in Flickr in total and 1 million photos are uploaded daily. 80 percent of the photos are public (Yahoo! Internal Data).

• Yahoo! Mail is the #1 Web mail provider in the world with 243 million users (comScore WorldMetrix, Feb. 2007) and nearly 80 million users in the U.S. (comScore Media Metrix, US, Feb. 2007)

• Interoperability between Yahoo! Messenger and Windows Live Messenger has formed the largest IM community approaching 350 million user accounts (Yahoo! Internal Data).

• Yahoo! Messenger is the most popular in time spent with an average of 50 minutes per user, per day (comScore WorldMetrix, Feb. 2007).

Nearly 1 in 10 Internet users is a member of a Yahoo! Groups (Yahoo! Internal Data).• Yahoo! is one of only 26 companies to be on both the Fortune 500 list and the Fortune’s “Best Place to Work”

List (2006).

- 5 -

Agenda

• Publishing metadata on the Semantic Web

– A brief history of publishing metadata in HTML

• Semantic Search research and applications

- 6 -

The many faces of the Semantic Web

• Six ways of publishing RDF

1. Linked RDF files (linked data)

2. Metadata inside webpages

3. SPARQL endpoints

4. Feeds

5. XSLT/GRDDL

6. Automated tools

• Non-exclusive but in practice

most publisher choose one

- 7 -

Option 1: Standalone RDF documents

• RDF documents linked to other RDF documents

– Use rdfs:seeAlso to point to a related document

• It says: Go and look at that document if you want to know more

• Advantages:

– No change to the publishing of the HTML documents

– Data can be published by third party

• Tools

– RDB-to-RDF mappers such as D2RQ or Triplify

– Linked Data browsers

• Examples: Most datasets in the Linked Data cloud

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

- 8 -

Option 1: cntd.

• For discovery, the metadata is often linked from HTML pages

<link rel="meta" type="application/rdf+xml" title="FOAF" href="http://www.cs.vu.nl/~pmika/foaf.rdf" />

• Additional advantages:

– Discovery from the webpage

– It’s clear that the metadata is a machine representation of the human-targeted content of the page

• Examples: FOAF profiles, BestBuy

..

Peter Mika was born

in Budapest.

Peter Mika was born

in Budapest.

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

- 9 -

Option 2: Metadata inside web pages

• Using microformats, RDFa, MicroData (more later)

• Advantages:

– No content negotiation required

– No separate database export required

– Browser plug-in friendly

– Search engine friendly

– Copy-paste friendly

• Tools:

– XML editors (e.g. Oxygen)

– Triplr

– RDFa Distiller

– RDFa bookmarklet

– Ubiquity RDFa plugin

– Optimus microformat parser

• Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook…

Peter Mika was born

in Budapest.

Peter Mika was born

in Budapest.

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Peter Mika was born

in Budapest.

Peter Mika was born

in Budapest.

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Metadata in HTML

- 16 -

Brief history of the Annotated Web

• 1995: HTML meta tags• 1996: Simple HTML Ontology Extensions (SHOE)• 1998: RDF/XML

– RDF/XML in HTML– RDF linked from HTML

• 2003: Web 2.0– Tagging– Microformats– Metadata in Wikipedia– Machine tags in Flickr

• 2005: eRDF • 2008: RDFa• 2009: Microdata (?)

- 17 -

HTML meta tags

<HTML><HEAD profile="http://dublincore.org/documents/dcq-html/"><META name="DC.author" content="Peter Mika"><LINK rel="DC.rights copyright"

href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF"

href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> …</HTML>

- 18 -

SHOE example (Hefflin & Hendler, 1996)

<ONTOLOGY "our-ontology" VERSION="1.0"> <ONTOLOGY-EXTENDS "organization-ontology" VERSION="2.1" PREFIX="org"

URL="http://www.ont.org/orgont.html"> <ONTDEF CATEGORY="Person" ISA="org.Thing"> <ONTDEF RELATION="lastName" ARGS="Person STRING"> <ONTDEF RELATION="firstName" ARGS="Person STRING"> <ONTDEF RELATION="marriedTo" ARGS="Person Person"> <ONTDEF RELATION="employee" ARGS="org.Organization Person">

</ONTOLOGY>

<HEAD><META HTTP-EQUIV="Instance-Key" CONTENT="http://www.cs.umd.edu/~george"> <USE-ONTOLOGY "our-ontology" VERSION="1.0" PREFIX="our" URL="http://ont.org/our-ont.html"> </HEAD><BODY>

<CATEGORY "our.Person">

<RELATION "our.marriedTo" TO="http://www.cs.umd.edu/~helena">

<RELATION "our.employee" FROM="http://www.cs.umd.edu">

My name is

<ATTRIBUTE "our.firstName"> George </ATTRIBUTE>

<ATTRIBUTE "our.lastName"> Cook </ATTRIBUTE> and I live at...

- 19 -

SHOE system

- 20 -

SHOE Text-based query interface

- 21 -

SHOE Graphical Query Interface

- 22 -

Example: Creative Commons

Embedding CC license in HTML (now deprecated):

<HTML><HEAD>… </HEAD><BODY>…

<!–- <rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <Work rdf:about="http://www.yergler.net/averages/"> <dc:title>The Law of Averages</dc:title> <dc:description>...because eventually i'll be right...</dc:description> <license rdf:resource="http://creativecommons.org/licenses/by-nc/1.0/" /> </Work> <License rdf:about="http://creativecommons.org/licenses/by-nc/1.0/"><requires rdf:resource="http://web.resource.org/cc/Notice" /> <permits rdf:resource="http://web.resource.org/cc/Reproduction" /> <permits rdf:resource="http://web.resource.org/cc/Distribution" /> <prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" /> </License> </rdf:RDF>

-->

- 23 -

Example: Creative Commons

• Current: rel attribute (HTML4)

This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/">Creative Commons Attribution 3.0 United States License</a>.

• Use of the “rel” attribute for semantic annotation is the birth of the microformat…

- 24 -

Microformats (μf)

• Community centered around microformats.org– Specifications and discussions are hosted there

• Agreements on the way to encode certain kinds metadata in HTML– Reuse of semantic-bearing HTML elements– Based on existing standards– Minimality

• Microformats exist for a limited set of objects– hCard (persons and organizations)– hCalendar (events)– hResume– hProduct– hRecipe

• Varying degrees of support and stability– hCard and rel-tag are widely supported

- 25 -

Example: microformats

<cite class="vcard">

<a class="fn url" rel="friend colleague met"

href="http://meyerweb.com/">Eric Meyer</a>

</cite> wrote a post (<cite>

<a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">

Tax Relief</a></cite>) about an unintentionally humorous letter

he received from the <span class="vcard">

<a class="fn org url" href="http://irs.gov/">

Internal Revenue Service</a> </span>.

<div class="vcard">

<a class="email fn" href="mailto:[email protected]">Joe Friday</a>

<div class="tel">+1-919-555-7878</div>

<div class="title">Area Administrator, Assistant</div>

</div>

- 26 -

Microformats: limitations

• No shared syntax

– Each microformat has a separate syntax tailored to the vocabulary

• No formal schemas

– Limited reuse, extensibility of schemas

– Unclear which combinations are allowed

• No datatypes

• No namespaces, unique identifiers (URIs)

– no interlinking

– mapping between instances is required

• Relationship to page context is often unclear

- 27 -

RDFa

• RDF-in-attributes – World Wide Web Consortium (W3C) recommendation for

encoding RDF triples in HTML

– Full RDF support

• Recommendation specifies the algorithm for parsing the triples out of HTML

– Requires XHTML in principle

• In practice, no one cares

- 28 -

RDFa in a slide

<p xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:foaf="http://xmlns.com/foaf/0.1/” typeof=”foaf:Person"

about="http://example.org/staff/jo">

<span property="rdfs:label foaf:name">

Jo Smith</span>.

<span property="foaf:title">Web hacker

</span> at

<a rel=”vcard:org" property="foaf:name"

href="http://example.org"> Acme Corp</a>.

You can contact me <a rel="foaf:mbox" href="mailto:[email protected]">

via email </a>.

</p> ...

Assign the prefixes rdfs and foaf to the RDFS and FOAF namespaces

(as in XML, RDF/XML etc.)

Create a new resource of type foaf:Person

Assign a value to a property

Give it a URI

Link to another resourceand assign a name to it

- 29 -

Microformats vs. RDFa

• Choose microformats when you find a microformat that fits your needs (and supported by your favorite tool or search engine)– Microformats are first option because they are simple

– We support all major microformats, see the documentation

– It’s a common misconception that RDFa requires XHTML: it doesn’t

• If you find none that perfectly fits your needs then you need RDFa– Microformats have a fixed schema: you can not add your own

attributes

• Example: a social networking site with user profiles– VCard is a good candidate, but for example it doesn’t have a way to

express the user’s social connections

– You either live without this, or go with RDFa

- 30 -

Keep an eye on HTML5

• Currently under standardization at the W3C

– Last Call in Fall 2009

• Introduces Microdata

– Similar to microformats

• Some predefined vocabularies with central registration

– Some of the flexibility of RDFa

– Introduce new terms using reverse domain names or full URIs

• Semantic HTML elements such as <time>, <video>, <article>…

- 31 -

Microdata example<div item=“http://www.yahoo.com/resource/person”>

<p>My name is <span itemprop="name">Neil</span>.</p>

<p>My band is called

<span itemprop="band">Four Parts Water</span>.

I was born on

<time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>.

<img itemprop="image" src=”me.png" alt=”me”>

</p>

</div

- 32 -

The process of annotating with RDFa

1. Invest in familiarizing with the RDFa syntax by reading the RDFa Primer– It is also highly recommended that you read the RDF Primer. RDF is the data model used

by RDFa.

2. Choose a vocabulary from the SearchMonkey documentation that fits your needs– A vocabulary describes a set of types and attributes within a given domain – If you don’t find a good candidate, extend an existing one or create a new one

3. Annotate your page.– Before you start, you might want to validate your page for (X)HTML conformance

using the W3C’s (X)HTML Validator to reduce the chance of errors. Choose Document Type XHTML + RDFa.

– No specific tool support. If you have an HTML or XML editor that supports DTDs, you will have syntax checking and highlighting.

– Use the RDFa Distiller to validate which data can be extracted from your page.– If you fancy, use the RDF Validator to graphically visualize the RDF graph that is

outputted.

4. Put the annotated page online. The data will extracted the next time your page is crawled– No need to explicitly submit anything– No notification when your site is crawled

• See http://rdfa.info/rdfa-implementations for new tools and APIs

- 33 -

Choosing a vocabulary

• Look at SearchMonkey objects

– Video, Games, Presentations, Events, News, Businesses, Products, Discussion

• Search the Web or ask for advice on mailing lists

– [email protected]


• Wikis

– semanticweb.org

– vocamp.org

• Beware of people who claim to have the vocabulary of everything

– Preferably you want something small and targeted

• Never a 100% fit you will need to introduce vocabulary terms (classes and properties)

– Do not introduce new classes/properties in existing namespaces

– Example: the namespace http://xmlns.com/foaf/0.1/ is used by the FOAF project. Try not to introduce a new term without contacting the owner, i.e. the membership of the FOAF mailing list.

- 34 -

Advanced topic: creating a vocabulary

1. Get advice on methodology– vocamp.org and semanticweb.org

2. Choose a namespace and a prefix– Give sensible names, e.g. name it after your site, but don’t call it searchmonkey

– Namespace ends either with a slash or a hash

3. Create an RDF or OWL document describing your classes and properties• Use an ontology editor such as Protégé 4.0

• Follow naming conventions

4. Publish your vocabulary– Make sure the URIs of your properties and classes are resolvable

1. E.g. myvocab:digicam should resolve to a document containing the definition of myvocab:digicam

• Convince others to adopt your vocabulary1. If you are in fishing, convince other fishing businesses

- 35 -

Exercise (slideshare.net/pmika)

• Explore data on the Web

– Microformats

• Search for pages on Yahoo using searchmonkey:com.yahoo.page.uf.hcard

• Try Operator Firefox Plug-in

• Try Optimus

– RDFa

• Create yourself or search for pages on Yahoo using searchmonkey:com.yahoo.page.rdf.rdfa

• Try RDFa bookmarklet to highlight RDFa

• Try RDFa Distiller to extract RDF from HTML

• Try RDF Validator to visualize your RDF data

• Mark up your webpage using RDFa

– Use RDFa Distiller to test

• Try automated annotation using Zemanta or OpenCalais

Semantic Search

- 37 -

Why semantic search? (P. Raghavan)

• Old battles are won– Main driver of user perception of search quality used to be:

precision of navigational queries

– The technical prowess was about crawling and spam

– The hard core was indexing and retrieval

• Currently, the biggest bottlenecks in IR not computational, but in modeling user cognition – If only we could find a computationally expensive way to solve

the problem

• then we should be able to make it go faster

- 38 -

Searches that show a ‘semantic gap’

• Ambiguous searches– Paris Hilton

• Multimedia search– Images of Paris Hilton

• Imprecise or overly precise searches – Publications by Jim Hendler– Find images of strong and adventurous people (Lenat)

• Searches for descriptions– Search for yourself without using your name– Product search (ads!)

• Searches that require aggregation– Size of the Eiffer tower (Lenat)– Public opinion on Britney Spears– World temperature by 2020

Queries that require a deeper understanding of the query, the content and/or the world at large

- 39 -

Not just search

- 40 -

Semantic Search

• Def. matching the user’s query with the Web’s content at a conceptual level, often with the help of world knowledge

– R. Guha, R. McCool: Semantic Search, WWW2003

• Related disciplines

– Semantic Web, IR, Databases, NLP, IE

• As a field

– ISWC/ESWC/ASWC, WWW, SIGIR

– Exploring Semantic Annotations in Information Retrieval (ECIR08, WSDM09)

– Semantic Search Workshop (ESWC08, WWW09)

– Future of Web Search: Semantic Search (FoWS09)

- 41 -

Semantics at every step of the IR process

bla bla bla?

bla

blabla

q=“bla” * 3

Document processing

bla

blabla

blabla

bla

Ranking “bla”θ(q,d)

Query processing

Search interface

The IR engine The WebThe Semantic Web

- 42 -

Document processing

• Goal: provide a higher level representation of text in some conceptual space

• Diverse methods

– Document classification

– Information Extraction

• Named-entity recognition, word-sense disambiguation, semantic role labeling, wrapper induction, form filling, etc.

• Previously: open source toolkits such as GATE, OpenNLP

– Require expert user to operate and train

• NEW: Online tools

– For non-expert users

– Embed metadata inside documents and/or link entities in documents to the cloud

– Often using existing metadata as background knowledge

- 43 -

Examples: online NLP

• OpenCalais (Thomson Reuters)– Online interface and API for named-entity recognition, (partial)

disambiguation and relationship extraction

– HTML annotator (microformats), Firefox plug-in, WordPress plug-in, Yahoo Pipes service, etc.

– Works best on the news domain

• Zemanta– Zemanta is a ‘personal writing assistant’ that recognizes and

disambiguates named entities and suggests related content

– A Firefox-plugin that extends the functionality of popular blogging and online email platforms, online interface and API

– Broad coverage but partial recognition

- 44 -

Examples: information extraction from templated HTML

• Intel MashMaker

– Create mashups based on information extracted from the page you are browsing (Firefox plugin)

– Wrapper induction trained using manual annotations

– Export the XSLT to be used in Yahoo’s SearchMonkey

• Dapper

– Semantic Advertising company

– Wrapper induction trained using manual annotations (online tool)

– API to the dynamic extraction

• Glue

– Social interaction around objects on the Web (Firefox plugin)

– API provides access to the information extracted from popular sites

- 45 -

Query Interpretation

• Provide a higher level representation of queries in some conceptual space

– Ideally, the same space in which documents are represented

• Queries may be keywords, questions, semi-structured, structured etc.

• Interpretation treated as a separate step from ranking

– Required for federation, i.e. determine where to send the query

– Choosing between interpretations could be left to the user

– Due to performance requirements

• You cannot execute the query to determine what it means and then query again

• General world knowledge (e.g. DBpedia), domain ontologies or the schema of the actual data can be used

- 46 -

Example: Semantic Search Assist

• Observation: the same type of objects often have the same query context– Users asking for the same aspect of the type

• Could we make query suggestions based on the type of the entity?– Improvement for infrequent queries

apple ipod nano review sony plasma tv reviewjerry yang biography biography tim berners lee tim berners lee blog

peter mika yahoobritney spears shaves her head

- 47 -

Models

• Desirable properties:

– P1: Fix is frequent within type

– P2: Fix has frequencies well-distributed across entities

– P3: Fix is infrequent outside of the type

• Models:

apple ipod nano review

entity fix

type: product

- 48 -

Demo

- 49 -

Ranking

• Goal: match the query representation to content representation

– Again, possibly with the use of background knowledge

• Methods depend on

– the actual representation

• Bag of words, NL parse trees, SPARQL…

– the qualities of the queries and content

– the use case

• e.g. time available for ranking may vary from milliseconds to days

- 50 -

Search interface

• Goal is to facilitate the interaction between the user and the system

– helping the user to formulate queries

– present the results in an intelligent manner

• Semantic-based representations allow novel interfaces

– Adaptive presentation

• presentation adapts to the kind of query and results presented

– Aggregated search

• Grouping similar items, summarizing results in various ways

• Possibilities for filtering, possibly across different dimensions

– Task completion

• Help the user to fulfill the task by placing the query in a task context

- 51 -

Putting it all together: Semantic Search Engines

• Natural Language search engines

– Hakia

– Powerset (now built into Bing)

– TrueKnowledge

• Structured data search engines

– Searching open web data

• Sindice

• Sigma

– Searching closed world data

• Wolfram Alpha (closed structured data + computation)

• Web search engines

– Yahoo’s SearchMonkey

• BOSS and YQL

– Google’s Rich Snippets

- 52 -

• An open platform for using structured data to build more useful and relevant search results

• Creating an ecosystem of publishers, developers and end-users – Helping publishers to implement semantic annotation and

motivating them by allowing to customize their search results

– Providing tools and APIs for developers to create compelling applications

– Improving search result presentation for our end-users

• Semantic Web technology– Support for a number of microformats, RDFa

– RDF based data representation

– Industry standard vocabularies (or use any of your own)

SearchMonkey

- 53 -

image

deep links

name/value pairs or

abstract

Enhanced Result

- 54 -

A difficult problem!

• What if one would try to do this automatically?

– Document summarization

– Page structure detection

– Information Extraction

– Image recognition

– Link classification and ranking

• But one of those cases where a little semantics can go a long way…

- 55 -

Acme.com’sdatabase

Index

RDF/Microformat Markup

site owners/publishers share structured data with Yahoo!. 1

consumers customize their search experience with Enhanced Results or Infobars

3

site owners & third-party developers build SearchMonkey apps.2

DataRSS feed

Web Services

Page Extraction

Acme.com’s Web Pages

SearchMonkey

- 56 -

Example apps

• LinkedIn

– hCard plus feed data

• Creative Commons by Ben Adida

– CC in RDFa

- 57 -

Example apps. II.

• Other me by Dan Brickley

– Google Social Graph API wrapped using a Web Service

- 58 -

Google’s Rich Snippets

• Shares a subset of the features of SearchMonkey

– Encourages publishers to embed certain microformats and RDFa into webpages

• Currently reviews, people, products, business & organizations

– These are used to generate richer search results

• SearchMonkey is customizable

– Developers can develop applications themselves

• SearchMonkey is open

– Wide support for standard vocabularies

– API access

- 59 -

BOSS: Build your Own Search Service

• Ability to re-order results and blend-in addition content

• No restrictions on presentation

• No branding or attribution

• Access to multiple verticals (web search, image, news)

• 40+ supported language and region pairs

• Pricing (BOSS)

– Pay-by-usage

– 10,000 queries a day still free

– Serve any ads you want

• For more info, http://developer.yahoo.com/search/boss/

- 60 -

BOSS API to structured data

• Simple HTTP GET calls, no authentication

– You need an Application ID: register at developer.yahoo.com/search/boss/

• http://boss.yahooapis.com/ysearch/web/v1/{query}?appid={appid}&format=xml&view=searchmonkey_feed

• Restrict your query using special words

– searchmonkey:com.yahoo.page.uf.{format}

• {format} is one of hcard, hcalendar, tag, adr, hresume etc.

– searchmonkey:com.yahoo.page.rdf.rdfa

- 61 -

Demo: resume search

• Search pages with resume data and given keywords

{keyword} searchmonkey:com.yahoo.page.uf.hresume

• Parse the results as DataRSS (XML)

• Extract information and display using YUI

- 62 -

Demo

- 63 -

Yahoo Query Language (YQL)

• Query web APIs as virtual tables

– Mash-up data by joining tables

– Add an API by adding a table definition

• Example: select my friends and sort by nickname

- 64 -

PHP example: select the last 100 photos from Flickr with the word Austin

<?php

$url = "http://query.yahooapis.com/v1/public/yql?q=";

$q = "select * from flickr.photos.search(100) where text=’Austin'";

$fmt = "xml";

$x = simplexml_load_file($url.urlencode($q)."&format=$fmt");

foreach($x->attributes('http://www.yahooapis.com/v1/base.rng') as $k=>$v) {

$$k=(string)$v;

}

echo <<<EOB

$count photos fetched from

{$x->diagnostics->url} in

{$x->diagnostics->url['execution-time']} seconds<br>

EOB;

$flickr = "http://static.flickr.com/";

foreach($x->results->photo as $p) {

echo "<img src=\"$flickr{$p['server']}/{$p['id']}_{$p['secret']}_s.jpg\"/>\n";

}

?>

- 65 -

YQL example (source)

- 66 -

That’s all there is to it!

<?php

$root = 'http://query.yahooapis.com/v1/public/yql?q=';

$city = 'Barcelona';

$loc = 'Barcelona';

$yql = 'select * from html where url = \'http://en.wikipedia.org/wiki/'.$city.'\' and xpath="//div[@id=\'bodyContent\']/p" limit 3';

$url = $root . urlencode($yql) . '&format=xml';

$info = getstuff($url);

$info = preg_replace("/.*<results>|<\/results>.*/",'',$info);

$info = preg_replace("/<\?xml version=\"1\.0\"".

" encoding=\"UTF-8\"\?>/",'',$info);

$info = preg_replace("//",'',$info);

$info = preg_replace("/\"\/wiki/",'"http://en.wikipedia.org/wiki',$info);

$yql = 'select * from upcoming.events.bestinplace(5) where woeid in '.

'(select woeid from geo.places where text="'.$loc.'")'.

' | unique(field="description")';

$url = $root . urlencode($yql) . '&format=json';

$events = getstuff($url);

$events = json_decode($events);

foreach($events->query->results->event as $e){

$evHTML.='<li><h3><a href="'.$e->ticket_url.'">'.$e->name.'</a></h3><p>'.

substr($e->description,0,100).'…</p></li>';

}

$yql = 'select * from flickr.photos.info where photo_id in '.

'(select id from flickr.photos.search where woe_id in '.

'(select woeid from geo.places where text="'.$loc.'")) limit 16';


$photos = getstuff($url);

$photos = json_decode($photos);

foreach($photos->query->results->photo as $s){

$src = "http://farm{$s->farm}.static.flickr.com/{$s->server}/".

"{$s->id}_{$s->secret}_s.jpg";

$phHTML.='<li><a href="'.$s->urls->url->content.'"><img alt="'.

$s->title.'" src="'.$src.'"></a></li>';

}

$yql='select description from rss where '.

' url="http://weather.yahooapis.com/forecastrss?p=SPXX0015&u=c"';


$weather = getstuff($url);

$weather = json_decode($weather);

$weHTML = $weather->query->results->item->description;

function getstuff($url){

$curl_handle = curl_init();

curl_setopt($curl_handle, CURLOPT_URL, $url);

curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);

curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);

$buffer = curl_exec($curl_handle);

curl_close($curl_handle);

if (empty($buffer)){

return 'Error retrieving data, please try later.';

} else {

return $buffer;

}

}?>

- 67 -

Exercise

• Build a SearchMonkey application

• Try a few YQL queries

• Find data on the Web using search or by browsing Linked Data

• Build your own search engine using BOSS

- 68 -

Challenges

• Future work in Semantic Web

– (Semi-)automated ways of metadata creation• How do we go from 5% to 95%?

– Data quality• Can we trust statements people make about each other’s data?

– Reasoning• To what extent is reasoning useful?

• For example, how much would entity resolution or taxonomic reasoning help?

– Scale• How do we exploit cluster computing techniques?

• What is between databases and IR engines?

– Fostering social agreements• How do we get people to reuse vocabularies?

- 69 -

Challenges

• Future work in IR– Query interpretation

– Ranking with metadata

– Evaluation of semantic search

– Personalization

– Semantic ads

• Constraints

– Users still want to see a document• Keyword-based search cannot suffer

• Whole page relevance, monetization can only increase

– Established expectations • Query entry

• Result presentation

- 70 -

Contact

• Peter Mika


– Come to Barcelona and stop by

– Ask about our internship program

• SearchMonkey

– developer.yahoo.com/searchmonkey/

– mailing lists• [email protected]

• [email protected]

– forums• http://suggestions.yahoo.com/searchmonkey

- 71 -

the monkey is out!

Semantic Search Summer School2009

Education

comscore media

higher level

online interface

semantic search

comscore worldmetrix

peter mika

formatjson

internal data