Dynamic Data in the humanities Marc Kemps-Snijders [email protected] EUDAT Dynamic Data Amsterdam September 25 th 2014
Jun 19, 2020
Dynamic Data in the humanities
Marc Kemps-Snijders
EUDAT Dynamic Data
Amsterdam
September 25th 2014
Dynamic data approach
Observation time
Archive
Ingest time
Ideally, data is stored the moment it is observed
Usually, data arrives late
…….or never at all
45°
From:
EUDAT meeting
September 2013
Barcelona
Over 2 M pages in 10.000 books
the period 1781 to 1800
Over 84 M unique articles
from 1618 to 1996
20 B words
Digitization started around 2000
- Scientists and general public
Provide accurately dated title, author and geographical information85.957 titles
92.276 authors
157.432 dependent titles
Creating uniformity and standardization for
heterogenity of collections
Data behaviour in a humanities
Virtual Research EnvironmentTime
Archive
Ingest time
Book
1623
Sept 2013
SAME record
Phenomena are recorded in single record
(metadata)
Author
1587-1679
Data arrives VERY late
Records are often related,
e.g. books and authors
Data needs to be curated……
Metadata curation example 1.
Sometimes authors appear twice in our system, e.g. due to spelling
variants or name variants.
In the 16th century authors sometimes published under their motto
rather than their own name
Example:
• „Liefde verwinnet al‟ (Love conquers all)
• „door Eén is 't nu voldaen‟ (by One it is all done now)
Joost van den Vondel
17 November 1587
5 February 1679 (aged 91)
Versioning and reproducibility
Time
Archive
Ingest time
45°
Het lof der zeevaert
Poem
1623
Oct 2013
Jan 2014
Joost van den Vondel
1587-1679
Lucifer
Drama
1654
Reproducibility prevents objects
from being thrown away
Query 1: How many titles are available for Vondel?
Answer: one
Query 1: How many titles are available for Vondel?
Answer: two
Query 1 is not reproducible
Add Archive Ingest time stamp
Add expiration time stamp
Select title where ArchiveIngestTime(title) < ArchiveIngestTime(query)
and ExpirationTime(title) > ArchiveIngestTime(query)
AIT:Oct 2013
Exp: Jan 2014
AIT:Jan 2013
Exp: -
AIT: Nov 2013
Exp: -
Data curation example 2.
editions are to be split
up into source texts
and editorial para texts
Published 1987
Published 1623
Published 1613
J. Van den Vondel
Twee zeevaart gedichten Marijke Spies
Joost van den Vondel
Lof der zee-vaert
Hymnus…..
Editions provide an additional challenge
• Recently published
• Consists of fragments of modern
and old Dutch
Data curation example 3.
OCR digitized newspaper articles
sometimes prove to be of poor
quality, e.g.
• Older articles
• WW II articles
Published June 14th 1618
VVt Venetien den 1.Iunij, Anno 1618.
'sx En 25 Mssaro is 3^ adviseert wozdel,/van H^ .,et gcyor uerracc aihi. r / twclcn
l> / zynde vele d« r srlver gtlustlreert duer onder eeulghe Franc0i>scn/die stch,net
deSpaellschcn cndc eenlghen d.ftr «lödellupden verdondcn dcse Stadt aen 50
ende meer in bzam lc stchen/ ende re plunderen ghelncllmendanaense» her plaetsen
de met vicrwerr heest glMonden/het w l̂ctle ccnc hunner mede gesellen el n deser
mlldccllr heeft / den welc- Kcn sp 2f.duuscnt ducaten hebben vereen: Als sulckr
hebben vernomen/znnderbp 70l>.wechghtloopcn Doch vanglvanzihcn / ende dcse
40. uan V^vua al hier ghcdzacht^oock noch dagnelhcnr van daer ende Verona/
Bcrgamo / en andere plaetsen ghevanckellicn gcvzacht werden: dese ol>ser Salien
dledacr toe gheholpen/zijn des nachts van wegen harcr grooter vrienden ver»
wo)5en/cut>c Komen daghclc)cllt noch Wouderlycne sanen aen den dach / sonderltjcnen
dat deSpaensche dele Stadt alsomncme wilde
Crowd sourcing project are underway
to provide accurate
transcriptions
Collaboration with Royal Library
VVt Venetien den I.Iunij, Anno 1618.
DEn 25. Passato is geadviseert worden, van het groot verraet alhier, 't welck
ontdeckt is, zijnde vele der selver gerusticeert daer onder eenighe Francoysen
die sich met de Spaenschen ende eenighen deser Edelluyden verdonden dese
Stadt aen 50 plaetsen ende meer in brant te steken, ende te plonderen,
ghelijck men dan aen seker plaetsen by de 50. potten met vierwerc heeft
ghevonden, het welcke eene hunner mede gesellen aen deser Seign. ontdeckt
heeft, den welcken sy 25. duysent ducaten hebben vereert: Als sulckx die
andere hebben vernomen, zijnder by 700; Wech gheloopen. Doch 20. daer van
gevanghen, ende dese daghen 40. van Padua al-hier ghebracht, oock noch
daghelijckx van daer ende Verona, Vicenza, Bergamo, ende andere plaetsen
ghevanckelijck gebracht werden: dese onser Natien alhier die daer toe
gheholpen, zijn des nachts van wegen harer grooter vrienden verdroncken
worden, ende komen daghelijckx noch wonderlijcke saken aen den dach,
Annotations
Linguistic annotations are at the heart of scientific data processing,
e.g. Part of Speech tagging, Named Entity Recognition, Syntactic
analysis, Coreference, Semantic Role Labeling.
Ga er nog eens op uit in Amsterdam!
1 Ga gaan [ga] WW(pv,tgw,ev) 0.993151 0 ROOT
2 er er [er] VNW(aanw,adv-pron,stan,red,3,getal) 0.972222 1 mod
3 nog_eens nog_eens [nog]_[eens] BW()_BW() 0.980727 1 mod
4 op op [op] VZ(fin) 0.920000 1 pc
5 uit uit [uit] VZ(fin) 0.936170 4 hdf
6 in in [in] VZ(init) 0.998321 1 mod
7 Amsterdam Amsterdam [Amsterdam] SPEC(deeleigen) 1.000000 0 ROOT
8 ! ! [!] LET() 0.995005
Most tools need to be trained or are designed to deal with specific
language periods (commonly modern language).
The result often needs to be manually corrected.
Interoperability across tools is often an issue (tagsets and processing methods).
Lemma=“Amsterdam”Postag= SPEC(deeleigen) Frog
Postag=N(eigen,ev,basis,onz,stan) Alpino
Word="Ga”Lemma=“ga” Frog
Lemma=“uit_gaan” Alpino
Annotations
Ideally produce
• Training corpora (manually corrected)
• Preprocessed annotated data (sometimes using different tools)
• (Manually) corrected annotated data
Used trainingscorpus
Processed resource
Based on
Manually corrected
Book
Training corpus
BookAnnotation
Annotation
Manually corrected Training corpus
e.g. from the same time period
Nederlab Virtual Research Environment
With over 37.5 M documents and 1.277.188.758 words currently
available in the environment this becomes quite a difficult
process to manage.
And we have ongoing discussions on acceptable methods for
maintaining this environment over prolonged periods of time.
• How to handle dynamic behaviour of data?
• Under which conditions can data be phased out?
• Should ALL data be integrated into the environment?
At least for metadata management a separate editorial environment
has been set up to limit the amounts of potential updates (and
versions) in the system.
Nederlab Virtual Research Environment
Over 2 M pages in 10.000 books
the period 1781 to 1800
Harmonization tool
Metadata editor
VRE
Concluding remarks
Efficient versioning appears to be the key towards dynamic data
management
• Maintain version history
• Assign appropriate time stamps
• When dealing with large quantities of data decide upon criteria
for phasing out of data
• When dealing with heterogeneous collections from different
sources, including automated enrichment processes, great care
must be taken to maintain overall data integrity
– Both data and metadata may be affected
– Must be evaluated on a case by case basis
– In our domain data dynamics is not limited to a single project or
organization!!! Data may originate from different overlapping sources and
different approaches may have been applied (e.g. data enrichment
processes)
Thank you for your attention
Marc Kemps-Snijders
EUDAT Dynamic Data
Amsterdam
September 25th 2014