Top Banner
A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing
12

A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing

Jan 13, 2016

Download

Documents

sanam

Applying records management processes principles to the open government record. A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing. Overview. The National Archives’ Digital Strategy: An overview of the SKB project, including: The Problem - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

A Semantic Knowledge Base for the UK Government Web Archive

Tom Storrar & Claire Newing

Page 2: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

Overview

• The National Archives’ Digital Strategy:

• An overview of the SKB project, including:1. The Problem2. The Solution3. Next Steps

Page 3: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

Introducing the UK Government Web Archive

•More than 18,000 crawls of over 3,000 websites from 1996-2014

•Approximately 90tb of data, 3.5 billion resources

•More than 875,000 ARC files

•More than 20 million pageviews and 2-3 million visits per month

Page 4: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

4

Page 5: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing
Page 6: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

• User surveys on website: all banners and index pages

• Established that UKGWA is regularly visited by a great variety of users.

• The biggest area for dissatisfaction was found to be the existing search functions.

• We constructed user stories so we could test the improvements.

Who are our users and what do they want?

6

Page 7: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

Full Text Search – its limitations

Our full text search is very useful and very much used, but is

•limited by how the live sites were at crawl time

•noisy as it contains much duplicate or near-duplicate material

•reliant on keyword matching

•most useful when combined with specialist knowledge

Page 8: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

• Aim was to improve access to information in the UKGWA by providing far richer information about what it contains

• The semantic web is a start to tackling a limitation of the web

• Becomes a dataset in its own right

• Borrows from and contributes to the web

• Technology open and machine-readable. APIs allow the data to be easily queried and integrated with other services

• Awarded to a consortium led by Ontotext AD, the University of Sheffield and System Simulation

Semantic Search – What it allows

8

Page 9: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

UKGWA: a good candidate for semantic search?

9

• Each resource already has a persistent HTTP URI

• UKGWA is both limited and diverse

• Generic and domain-specific meanings can be attributed to otherwise loose terms, e.g:

• Facts can be modelled and refined to show the linkages between entities and how they change over time

• 2010 general election was opportunity to demonstrate concept

Page 10: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

Making UKGWA semantic – How?

10 Image: Ontotext AD, University of Sheffield and System Simulation.

Page 11: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

What we learned and next steps

11

• We will deliver it as an internal system to develop further

• It’s not AI! 60-70% annotation accuracy not bad at this scale!

• Concept can be difficult to explain, and even harder for those unfamiliar with computer science to use (SPARQL etc)

prefix skb:<http://proton.semanticweb.org/skb-ont#>prefix xsd: <http://www.w3.org/2001/XMLSchema#>select distinct ?URL ?title where

{ ?page <http://ordi.ontotext.com/sar#hasFeature> ?doc_feature . ?doc_feature <http://ordi.ontotext.com/sar#hasValue> ?URL. ?doc_feature <http://ordi.ontotext.com/sar#hasKey> "WEBARCHIVEURL" . ?page <http://proton.semanticweb.org/2006/05/protont#title> ?title . FILTER regex(str(?title), "Foot and Mouth", "i") . FILTER regex(str(?title), "Prime Minister", "i") . ?page <http://proton.semanticweb.org/2006/05/protont#hasDate>

}

• So, integrating the system with other services is a must.

Page 12: A  Semantic Knowledge Base for the UK Government Web Archive Tom  Storrar & Claire Newing

Any Questions?

Contact us: [email protected] Visit: nationalarchives.gov.uk/webarchive