Top Banner
Digging Up Data: Digging Up Data: The Archaeotools project, The Archaeotools project, Faceted Classification and Faceted Classification and Natural Language Processing in Natural Language Processing in an archaeological context. an archaeological context. Stuart Jeffrey, Julian Richards, Fabio Ciravegna Stuart Jeffrey, Julian Richards, Fabio Ciravegna, Stewart Waller, Sam Chapman, Ziqi Zhang Stewart Waller, Sam Chapman, Ziqi Zhang, Tony Austin. Tony Austin. UK e-Science All Hands Meeting, Edinburgh, 9 UK e-Science All Hands Meeting, Edinburgh, 9 th th September 2 September 2
27

Digging Up Data:

Jan 01, 2016

Download

Documents

Mark Silva

Digging Up Data: The Archaeotools project, Faceted Classification and Natural Language Processing in an archaeological context. Stuart Jeffrey, Julian Richards, Fabio Ciravegna , Stewart Waller, Sam Chapman, Ziqi Zhang , Tony Austin. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Digging Up Data:

Digging Up Data:Digging Up Data:The Archaeotools project, Faceted The Archaeotools project, Faceted Classification and Natural Language Classification and Natural Language Processing in an archaeological context.Processing in an archaeological context.

Stuart Jeffrey, Julian Richards, Fabio CiravegnaStuart Jeffrey, Julian Richards, Fabio Ciravegna,Stewart Waller, Sam Chapman, Ziqi ZhangStewart Waller, Sam Chapman, Ziqi Zhang, Tony Austin. Tony Austin. UK e-Science All Hands Meeting, Edinburgh, 9UK e-Science All Hands Meeting, Edinburgh, 9thth September 2008 September 2008

Page 2: Digging Up Data:

AHRC-EPSRC-JISC eScience research grants scheme:AHRC-EPSRC-JISC eScience research grants scheme:

AIM: To allow archaeologists to discover, share and analyse datasets and legacy publications which have hitherto been very difficult to integrate into existing digital frameworks

BUILDS UPON: Common Information Environment Enhanced Geospatial browser

PARTNERS: Natural Language Processing Research Group, Department of Computer Science, University of Sheffield

Joint Information Systems Committee

Page 3: Digging Up Data:

• Workpackage 1 - Advanced Faceted Classification /Geo-spatial Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media).and Media).

• Workpackage 2 – Natural language processing /Data-mining of Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus taggingGrey Literature; plus tagging

• Workpackage 3 – Data-mining of Historic Literature; plus Workpackage 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk

Three distinct Workpackages:

Page 4: Digging Up Data:

• Datasets include:– National Monuments Records (Scotland, Wales, England)– Excavation Index (EH)– Archive Holdings– Local Authority Historic Environment Records

• Thesauri include:– Thesaurus of Monuments Types (TMT)– Thesaurus of Object Types – MIDAS Period list– UK Government list of administrative areas, County,

District, Parish (CDP) – Not MIDAS

Page 5: Digging Up Data:

OracleRDBMS

MIDAS XML Record

Information Extraction RDF Resource

Knowledge triple store

XML Docs of Thesaurus

Query

User Interface

Information Extraction

When, Where, What ontologiesas entries to faceted index

Input

Input

Page 6: Digging Up Data:
Page 7: Digging Up Data:
Page 8: Digging Up Data:
Page 9: Digging Up Data:
Page 10: Digging Up Data:
Page 11: Digging Up Data:

“WHAT”

• Records that have no subject information

• Records that use terms not found in TMT, so these records cannot be indexed (6,442 unique terms)

Records (1,001,407)

19,269 records (2%)

Records (1,001,407)

101,507 records (10.1%)

Page 12: Digging Up Data:

“WHEN”

• Records that have no temporal information

• Records that use period terms not found in MIDAS so these records cannot be indexed (457 types of irresolvable dates)

Records (1,001,407)

292,793 records (29.2%)

Records (1,001,407)

114,505 (11.4%)

1066, 1001-1100,11th Centuary, C11, 11C, Eleventh Century

Page 13: Digging Up Data:

“WHERE”

• Records that have no spatial information

• Records that use terms not found in CDP, so these records cannot be indexed.

Records (1,001,407)

11,126(1.1%)

Records (1,001,407)

245,601 records (24.5%)

Page 14: Digging Up Data:
Page 15: Digging Up Data:

linear

Page 16: Digging Up Data:

• Workpackage 1 - Advanced Faceted Classification /Geo-spatial Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media).and Media).

• Workpackage 2 – Natural language processing /Data-mining of Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus taggingGrey Literature; plus tagging

• Workpackage 3 – Data-mining of Historic Literature; plus Workpackage 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk

Three distinct Workpackages:

Page 17: Digging Up Data:
Page 18: Digging Up Data:
Page 19: Digging Up Data:
Page 20: Digging Up Data:

XML tagging of semantic content

CIDOC: CRM

Page 21: Digging Up Data:

Information Extraction in Archaeotools

• What (subject)• Where (place name)• When (temporal info)• Grid reference (easting and northing)• Report title• Report creator• Report publisher• Report publisher contact• Report publication date• Event date• Bibliography & references

Page 22: Digging Up Data:

Example annotations

in highlighted colours are

positive examples

Un-annotated texts are negative

examples

Features of this annotation:•first_letter_capitalised: true•word_found_in_gazetteer: true

preceded_by: the

followed_by: period

Page 23: Digging Up Data:

Rule based systems are good for extracting information that match with simple patterns, and/or occur in regular contexts, thus are applied to:

• Grid reference (easting and northing)• Report title*• Report creator*• Report publisher*• Report publication date*• Report publisher contact• Bibliography & references

Machine Learning is good for extracting information that can not be matched by patterns, or occur irregularly with contexts, or are large amount, thus is applied to:

• What (subject)• Where (place name)• When (temporal info)• Event date

Page 24: Digging Up Data:

• Workpackage 1 - Advanced Faceted Classification /Geo-spatial Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media).and Media).

• Workpackage 2 – Natural language processing /Data-mining of Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus taggingGrey Literature; plus tagging

• Workpackage 3 – Data-mining of Historic Literature; plus Workpackage 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk

Three distinct Workpackages:

Page 25: Digging Up Data:
Page 26: Digging Up Data:
Page 27: Digging Up Data:

http://ads.ahds.ac.uk/project/archaeotools/