S. Haaf: Text Type Classification for the Historical DTA Corpus Text Type Classification for the Historical DTA Corpus Susanne Haaf Deutsches Textarchiv, BBAW – Berlin NeDiMAH-CLARIN-Workshop Exploring Historical Sources with Language Technology: Results and Perspectives
25
Embed
Text Type Classification for the Historical DTA Corpus · •Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) •DTA 'Base Format'
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
S. Haaf: Text Type Classification for the Historical DTA Corpus
Text Type Classification for the Historical DTA Corpus
Susanne Haaf Deutsches Textarchiv, BBAW – Berlin
NeDiMAH-CLARIN-Workshop Exploring Historical Sources with Language Technology:
Results and Perspectives
S. Haaf: Text Type Classification for the Historical DTA Corpus
About the Project
• Deutsches Textarchiv/ German Text Archive (DTA)
• Funding:
• Partner:
• Duration: 2007-2014/15
• Goal: – Provide the basis for a reference corpus for the development
of the New High German language (17th to 19th century)
S. Haaf: Text Type Classification for the Historical DTA Corpus
About the Project
• Ca. 1,500 texts of different disciplines and text types
• Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization)
• DTA 'Base Format'
• Guidelines for the transcription closely to the source
• Structural XML-annotation according to TEI/P5
• Guidelines for metadata entry
• Web-based quality assurance
• DTA-Extensions
• Integration of historical text data from other project contexts
• Curation and Collection of diverse text resources
S. Haaf: Text Type Classification for the Historical DTA Corpus
The DTA Bibliography
• Selection of works for the DTA core corpus: fixed bibliography
• Bibliography was created with the help of BBAW members, i.e. experts for the (history of) different (scientific) disciplines
• Requirements for the Selection
– reflect the diversity of text types …
– … at different points in time
– represent works which were
• Important for the scientific field
• Or: Widely recognised (i.e. of huge public influence)
• Or even: Not very influential
Genuinely lexicographic approach
• Phase 3: New selection of another 200 works Filling gaps … – … considering time
– … considering text type
S. Haaf: Text Type Classification for the Historical DTA Corpus
Text Type Classification for the DTA
• Created in a data-driven way, i.e.:
New book in the DTA corpus
Is there an existing category that fits?
Yes?
Assign the fitting existing category!
No?
Create new category!
• Based on the classification of the DWDS (Digital Dictionary of the German Language) which was continually extended
S. Haaf: Text Type Classification for the Historical DTA Corpus
Text Type Classification for the DTA
3 main (super-)categories:
2 levels: super- & sub-categories
S. Haaf: Text Type Classification for the Historical DTA Corpus