Iden%fying Informa%on Needs by Modelling Collec%ve Query Pa:erns K.Elbedweihy, S. Mazumdar, A.E. Cano, S.N. Wrigley, F.Ciravegna OAK Research Group, Department of Computer Science, University of Sheffield
Jan 21, 2015
Iden%fying Informa%on Needs by Modelling Collec%ve Query Pa:erns
K.Elbedweihy, S. Mazumdar, A.E. Cano, S.N. Wrigley, F.Ciravegna OAK Research Group,
Department of Computer Science, University of Sheffield
Informa%on Needs
Informa(on needs “the set of concepts and proper%es users refer to while
using SPARQL queries.”
Informa%on Needs (Cont’d) !!PREFIX dbo: <http://dbpedia.org/ontology/> SELECT ?manufacturer WHERE {!<http://dbpedia.org/resource/Acura_ZDX> !!dbo:manufacturer ?manufacturer. !}!
!• User’s informa%on needs:
concept: “h:p://dbpedia.org.../Automobile” property: “dbo:manufacturer”
query
type
Mo%va%on
Saracevic[1997]: “The success or failure of any interac%ve system and technology is con%ngent on the extent to which user issues, the human factors, are addressed right from the beginning to the very end…..” Peter Mika[2009]: “Considering the informa%on needs of end users is cri%cal to the success of Seman%c Search” !
Mo%va%on
understand how to use logs of queries iden%fy informa%on needs consume such analysis
be:er understanding and insight into the data usage !
• Introduc%on • Related Work • Approach
-‐ Formalising Query Logs -‐ Analysing Query Logs
-‐ Consuming Query Log Analyses • Dataset & Findings
Outline
As of September 2011
MusicBrainz
(zitgist)
P20
Turismo de
Zaragoza
yovisto
Yahoo! Geo
Planet
YAGO
World Fact-book
El ViajeroTourism
WordNet (W3C)
WordNet (VUA)
VIVO UF
VIVO Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UniRef
UniProt
UMBEL
UK Post-codes
legislationdata.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov.
uk
Traffic Scotland
theses.fr
Thesau-rus W
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
Open Library (Talis)
tags2con delicious
t4gminfo
Swedish Open
Cultural Heritage
Surge Radio
Sudoc
STW
RAMEAU SH
statisticsdata.gov.
uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
SSW Thesaur
us
SmartLink
Slideshare2RDF
semanticweb.org
SemanticTweet
Semantic XBRL
SWDog Food
Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears
Scotland Geo-
graphy
ScotlandPupils &Exams
Scholaro-meter
WordNet (RKB
Explorer)
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAASKISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP (RKB
Explorer)
Crime Reports
UK
Course-ware
CORDIS (RKB
Explorer)CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov.
ukRen. Energy Genera-
tors
referencedata.gov.
uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
Rådata nå!
PSH
Product Types
Ontology
ProductDB
PBAC
Poké-pédia
patentsdata.go
v.uk
OxPoints
Ord-nance Survey
Openly Local
Open Library
OpenCyc
Open Corpo-rates
OpenCalais
OpenEI
Open Election
Data Project
OpenData
Thesau-rus
Ontos News Portal
OGOLOD
JanusAMP
Ocean Drilling Codices
New York
Times
NVD
ntnusc
NTU Resource
Lists
Norwe-gian
MeSH
NDL subjects
ndlna
myExperi-ment
Italian Museums
medu-cator
MARC Codes List
Man-chester Reading
Lists
Lotico
Weather Stations
London Gazette
LOIUS
Linked Open Colors
lobidResources
lobidOrgani-sations
LEM
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
LinkedUser
FeedbackLOV
Linked Open
Numbers
LODE
Eurostat (OntologyCentral)
Linked EDGAR
(OntologyCentral)
Linked Crunch-
base
lingvoj
Lichfield Spen-ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National Radio-activity
JP
Jamendo (DBtune)
Italian public
schools
ISTAT Immi-gration
iServe
IdRef Sudoc
NSZL Catalog
Hellenic PD
Hellenic FBD
PiedmontAccomo-dations
GovTrack
GovWILD
GoogleArt
wrapper
gnoss
GESIS
GeoWordNet
GeoSpecies
GeoNames
GeoLinkedData
GEMET
GTAA
STITCH
SIDER
Project Guten-berg
MediCare
Euro-stat
(FUB)
EURES
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
CORDIS(FUB)
Freebase
flickr wrappr
Fishes of Texas
Finnish Munici-palities
ChEMBL
FanHubz
EventMedia
EUTC Produc-
tions
Eurostat
Europeana
EUNIS
EU Insti-
tutions
ESD stan-dards
EARTh
Enipedia
Popula-tion (En-AKTing)
NHS(En-
AKTing) Mortality(En-
AKTing)
Energy (En-
AKTing)
Crime(En-
AKTing)
CO2 Emission
(En-AKTing)
EEA
SISVU
education.data.g
ov.uk
ECS South-ampton
ECCO-TCP
GND
Didactalia
DDC Deutsche Bio-
graphie
datadcs
MusicBrainz
(DBTune)
Magna-tune
John Peel
(DBTune)
Classical (DB
Tune)
AudioScrobbler (DBTune)
Last.FM artists
(DBTune)
DBTropes
Portu-guese
DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data-open-ac-uk
SMCJournals
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Metoffice Weather Forecasts
Discogs (Data
Incubator)
Climbing
data.gov.uk intervals
Data Gov.ie
databnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2Bio2RDF
Calames
businessdata.gov.
uk
Bricklink
Brazilian Poli-
ticians
BNB
UniSTS
UniPathway
UniParc
Taxonomy
UniProt(Bio2RDF)
SGD
Reactome
PubMedPub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIMMGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Com-pound
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
Affy-metrix
bible ontology
BibBase
FTS
BBC Wildlife Finder
BBC Program
mes BBC Music
Alpine Ski
Austria
LOCAH
Amster-dam
Museum
AGROVOC
AEMET
US Census (rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
Introduc%on
295 Dataset 31 billion RDF triples “September 2011”
Introduc%on
Semantic Query Logs
Related Work Analysis for the Web of Documents
• Studying the search behavior of Web users [Silverstein et al. (1999), Jansen and Spink (2005), Jansen et al. (2005) and Spink et al. (2002)].
• Improving the search experience of Web users: -‐ Query Recommenda(ons [Baeza-‐Yates et al. (2004) and Wen et al. (2001)] -‐ Query Expansion [Cui et al. (2002a)]
Related Work (Cont’d) Analysis for the Web of Data • Moller et al. [10] iden%fied pa>erns of Linked Data usage with respect to different types of agents.
• Arias et al. [1] analyzed the structure of the SPARQL queries to iden(fy most frequent language elements.
• Kirchberg et al. [8] introduced a new no%on of ‘relevance of a LD resource’ as the ‘rela%onship between traffic and the resource and whether it changes over %me windows’
Related Work (Cont’d)
How our work is different:
Our focus is on iden%fying informa%on needs by
modelling query pa5erns of Linked Data users.
approach to formalize seman%c query log analysis set of methods for extrac%ng pa:erns in the query logs visualiza%on of informa%on needs
APPROACH
Formalizing Query Logs
• Proposed ontology ‘Qlog’ used to represent the main concepts and rela%ons extracted from a query log entry.
• A log entry follows the Combined Log Format (CLF):
Qlog Ontology Log Entry Concepts Query Logs Analysis Concepts
Analyzing Query Logs
Consuming Query Logs Analysis
• How to consume the query logs analysis?
-‐ Automa%c query sugges%ons
-‐ Recommender systems
-‐ Search tools (disambigua%on and ranking results)
-‐ Visualiza%ons (to gain understanding of dataset usage) 1. Concept Graph
2. Predicate sequence tree
Consuming Query Logs Analysis: Visualiza%ons
Concept Graph Predicate sequence tree
Steps for Consuming Query Logs Analysis
Identify Instance Types
Identify Predicate Sequence
Query Logs Knowledge
Base
Gather Class Size
Build Transition
Matrix
Build Vis
Tables
Build Vis
Tables
Render Vis
Render Vis
A1 A2 A3 A4
B1 B2 B3 B4
KB
Identify Instance Types
Identify Predicate Sequence
Query Logs Knowledge
Base
Gather Class Size
Build Transition
Matrix
Build Vis
Tables
Build Vis
Tables
Render Vis
Render Vis
A1 A2 A3 A4
B1 B2 B3 B4
KB
Identify Instance Types
Identify Predicate Sequence
Query Logs Knowledge
Base
Gather Class Size
Build Transition
Matrix
Build Vis
Tables
Build Vis
Tables
Render Vis
Render Vis
A1 A2 A3 A4
B1 B2 B3 B4
KB
Identify Instance Types
Identify Predicate Sequence
Query Logs Knowledge
Base
Gather Class Size
Build Transition
Matrix
Build Vis
Tables
Build Vis
Tables
Render Vis
Render Vis
A1 A2 A3 A4
B1 B2 B3 B4
KB
Identify Instance Types
Identify Predicate Sequence
Query Logs Knowledge
Base
Gather Class Size
Build Transition
Matrix
Build Vis
Tables
Build Vis
Tables
Render Vis
Render Vis
A1 A2 A3 A4
B1 B2 B3 B4
KB
Identify Instance Types
Identify Predicate Sequence
Query Logs Knowledge
Base
Gather Class Size
Build Transition
Matrix
Build Vis
Tables
Build Vis
Tables
Render Vis
Render Vis
A1 A2 A3 A4
B1 B2 B3 B4
KB
Identify Instance Types
Identify Predicate Sequence
Query Logs Knowledge
Base
Gather Class Size
Build Transition
Matrix
Build Vis
Tables
Build Vis
Tables
Render Vis
Render Vis
A1 A2 A3 A4
B1 B2 B3 B4
KB
CASE STUDY
Dataset • The data used in this study is made available by the USEWOD2011 data challenge.
• The logs contained around 5 million queries issued to DBpedia over a %me period of almost 4 months.
Number of analyzed queries 4951803
Number of unique triple pa:erns 2641098
Number of unique subjects 1168945
Number of unique predicates 2003
Number of unique objects 196221
Number of unique vocabularies 323
Analyzing Dbpedia usage pa:erns
Analyzing Dbpedia usage pa:erns (Cont’d)
Ques%ons
Ques%ons?!