CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012
CODA – CATCHPlus Open Document Annotation
Hennie Brugman
OAC II Project Review meeting
Chicago – July 26-27, 2012
Annotation context
• Audiovisual – ASR, language, gesture, oral history
• Text – Semantic annotation
• Music – lyrics, music notation
• Linguistic Annotation – named entities
• Image annotation
• Programs: CATCH, CATCHPlus, CLARIN
CODA main use cases
• Queen’s Cabinet (Henny van Schie/National Archive, Lambert Schomaker/Univ Groningen)
– Line strip and word zone annotations– ML: search in manuscript images– Add Named Entity annotations
• Sailing Letters (Nicoline van de Sijs/Meertens + consortium, Lambert Schomaker)
– Support manual annotation– Line strip detection service
2
Line annotation tools (catchplus)
<txt>godefroit</txt> <id>navis-SAL7316_0195-line-026
-y1=2094-y2=2317-zone-HUMAN-x=1145-y=105-w=315-h=116-unshear=0.0-version=ortho </id>
<user>mceunen</user> <time>Wed Jan 26 16:37:01 2011</time>
OAC representationImageAnnotation TextAnnotations
hasBody
hasTarget
hasBodyhasTarget
constrainsconstrains
constrainsconstrains
hasTarget hasBody
“Dit is een beschrijving van DenHaag. En dit is een tweede zin.”
cnt:chars
imageScan.jpg
ia:1
page:0
zone:2
line:1
Canvas1
ct:1
ct:2 cb:2
cb:1
ib:0
hasBody
linestrip.jpg ia:2
Named Entity
OAC representation – Named Entities
ImageAnnotation TextAnnotations EntityAnnotation
hasBody hasTarget hasBodyhasTarget hasTarget
hasTarget
hasBodyconstrains
constrains
constrainsconstrains
constrainsconstrains
hasTarget hasBody“Dit is een beschrijving van DenHaag. En dit is een tweede zin.”
“location”cnt:chars
cnt:charsimageScan.jpg
ia:1 ta:0
ta:2
ta:1
Canvas1
ct:1
ct:2
ct:3
ct:4
cb:2
cb:1
ib:0 ib:1
ea:1
! Annotation of annotations?
! Annotation of segments of inline text?
InlineTextConstraint:<rdf:Description rdf:about="urn:uuid:533624bb-d565-40ba-a14a-2e95c19c20df">
<rdf:type rdf:resource="http://www.openannotation.org/ns/ConstrainedTarget"/><constrains xmlns="http://www.openannotation.org/ns/"
rdf:resource="http://oas.dev.seecr.nl:8000/resolve/urn%3Auuid %3Ad8741024-18bf-40a8-a648-2cd5ebb9acfd"/><constrainedBy xmlns="http://www.openannotation.org/ns/"
rdf:resource="urn:uuid:4f6b7d34-2329-4ab6-be89-a0feec9e7208"/></rdf:Description>
<rdf:Description rdf:about="urn:uuid:4f6b7d34-2329-4ab6-be89-a0feec9e7208"><rdf:type rdf:resource="http://www.openannotation.org/ns/Constraint"/><rdf:type
rdf:resource="http://www.catchplus.nl/annotation/InlineTextConstraint"/><rdf:type rdf:resource="http://www.w3.org/2008/content#ContentAsText"/><chars xmlns="http://www.w3.org/2008/content#">
"<textsegment offset="279" range="2"/>"</chars><characterEncoding xmlns="http://www.w3.org/2008/content#">
UTF-8</characterEncoding></rdf:Description>
KdK-2-OAC conversion
• Implicit line and page text
• Word and line order
• Text offsets and ranges
• Spatial information
• Identifiers and ‘annotatability’
• Redundant text for searchability
! Need for explicit representation of Sequence?
! Search on text of ConstrainedTarget/Body?
KdK2OAC conclusions
• Bidirectional mapping is possible
• Compatible with SharedCanvas model
• OAC + Canvas links everything together
• Implicit information made explicit
• Supports alternative text segmentations
• OAC representation is extremely verbose
! For many annotation tasks OA may be overkill
Open Annotation Service (OAS)• Upload annotation RDF using SRU/Update
• Inlines external text and XML Bodies and authors
• Indexes OA and DC properties
• Assigns resolvable http URIs and resolves those
• Implementation: RDF store icw Solr, production quality software components (Meresco)
• Built-in OAI-PMH data provider and harvester for ‘annotation sets’
• Query: SRU/CQL, SPARQL, OAI-PMH
• Simple management dashboard (authentication and authorization, collection management, harvesting)
• Easy installation and Open Source
! Model does not support Annotation “sets”
OAS: issues
• Annotation publication
• Searchability: ‘harvest and index’
• Text search on external bodies
• Annotation boundaries
• ‘Bypassing’ oac:constrains
! In RDF, what are the boundaries of an annotation?
Entity Recognition service
service
frog
converter
URL ortext OAS
resolve
source_text
FoLiA_document
URLor ID
entityannotations
‘frog’ and FoLiA
• ‘Frog’ tool generates FoLiA XML document with– Segmentation of text in paragraphs, sentences and words
(tokens) – XML hierarchy
– Part of speech, lemma, morphology, chunking, dependency structure and named entities
• Mix of inline and standoff annotation– ‘Frog’ does not keep track of character offsets– Explicit ordering: numbering system in ids
• Trained for Dutch• Widely used for Dutch corpora• Made available by: ILK @ Tilburg University
FoLiA-2-OAC conversion
• Reconstruct character offsets after tokenization• Operates on inline text as published by OAS• Construct and add entity text from tokens +
sequence (the+hague != hague+the)• Two approaches
1. Minimal: extract entity annotations and tokens, and convert to OAC
2. Maximal: full conversion to OAC
Linguistic Annotation
! Mix-in domain semantics as subtypes/subproperties?
! Maximal OA mapping or embed linguistic standards?
! Layers, hierarchies (syntax) and Documents
! Sequence (e.g. entities, morpheme breakup)
Synchronized viewing clientdemo
• Demo/screenshot
Summary of OA issues! Annotation of annotations?
! Annotation of segments of inline text?
! Need for explicit representation of Sequence?
! Search on ConstrainedTarget/Body?
! For many annotation tasks OA may be overkill
! Model does not support Annotation sets
! In RDF, what are the boundaries of an annotation?
Future work
• Finalize and integrate software (with web services)
• Upgrade to new OA spec (incl OAS)
• Line strip detection web service
• Possible applications– AV annotation in CATCHPlus– Nederlab
Questions?