Top Banner
The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science Database Research Group University of Rostock
23

The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Mar 26, 2015

Download

Documents

Amber Ellis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

The cadastral register of the town Wismar (1677 – 1838)

Structure Mining in historical documents

Manja NeliusMeike Klettke

Department of Computer Science Database Research Group

University of Rostock

Page 2: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 2

Overview

1. Available documents Cadastral register Additional information

Index of persons List of abbreviations

2. Algorithms Overview of the approach Analysis of (implicit) structure, layout and of the regular parts Association of markup in the historical texts Rule-based semantic analysis Structuring of information

3. Results4. Conclusion & Literature

Page 3: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 3

Motivation

In the Department of History at the University of Rostock (Prof. Papay)◦ Historical geographical information systems are developed,

underlying databases◦ Cadastral registers contain lots of information for these geographical

information systems (in texts)– Characteristics of historical texts:

– No unique orthography– Varying spelling (temporal and regional), also the spelling of proper

names varies– Influences from different languages– Usage of Latin words (for instances saint feasts (Heiligenfeiertagen) for

dates)– Varying use of capitalisation

◦ Inserting data in the databases is till now a manual process◦ With a diploma thesis: trying to automate this process

Page 4: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 4

Available documents

Cadastral register of the town Wismar (1677 – 1838) ◦ Contains information about the history of the town◦ Reflects changes of the building development◦ Shows ownership structure

Similar Cadastral register are available for the towns Rostock and Stralsund◦ Different categories and different structures in the registers◦ Then-mayor (burgomaster) fixed the structure that‘s why it is different

in each town

2002 all Cadastral registers had been (manually) digitalised - aim: preserving of sources◦ That means no changes on the original texts◦ but addition of an index for persons and a list of usual abbreviations

Page 5: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Street

Categories

Neighbour objects

Alt Wismarsche Strasse vom thor her. Norderseite1 Grundbuch Nr. 16 ( 16 ), fol. 15v , neues Stadtbuch Nr. 1962 siehe Grundbuch Nr. 15.3 Haus4 jus protimiseos an dem dahinten belegenen Garten. Medard. 1673.5 ut No. praeced. siehe Grundbuch Nr. 15.Peter Koppe. adjud. et aedif. Tr. Regum 1657.Georg Gammelkern. empt. Medard. 1673.[…]

6 400 Rthlr. Joh. Christoff Müller. Clem. 1677. delirt Mattaei 1688.600 m. l. die Stadt Cämmerey t. Joh. mit 5 procent zu vertzinsen.[…]

Contig. Oldeböterstr. Ost Grundbuch Nr. 310.

Page 6: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 6

List of persons

Information about a personDifferent alternative spellings of a family rnameReferences onto another spelling of a family nameFamily name with additional information about a person (first name)and the cadastral number (where the person is mentioned)

Gagzow, David 1207Gahde siehe GadeGahrtz, Gehrtz-, Agneta Cristiane Hedwig (Witwe, geborene Haase) 22, 29, 263

Page 7: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 7

List of abbreviations

Abbreviations of phrases

Alternative abbreviations Explanations

infr.: infra (unten)Inn., Innoc., Innocent. Puer. : (dies) Innocentum Puerorum (28. Dezember)intab., intabul.: intabulatus (intabuliert, in das Grundbuch eingetragen)inter., interim. : interimistisch

Page 8: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 8

Overview

Combining analysis of layout, structure and texts Stepwise enrichment of the original documents with markup

(markup contains the meaning of the text fractions)

generation of XML documents (bottom up)

Storage in databases

Page 9: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

HTMLTXTTXT

XMLXMLXML

DOC RTFRTF

Transformation of the input format

List of abbreviations

List of persons

Cadastral register

Analysis of layout and structure

Exploitation of regular expressions

Storage in database tables

Structuring of information

Mapping rules

Dictionaries

Semantic rules

Rules for structuring

Grammars

NormalisationRules for replacement

Semantic Analysis

Analysis of full texts

Page 10: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 10

Analysis of layout and structure

Usage of layout characteristics◦ Bold and italic fonts (HTML-Markup)◦ Usage of X-Fetch Wrappers

Analysis of structural characteristics◦ Numbers at the beginning of rows determine the categories◦ cadastral item is divided into the different categories◦ Implementation uses the parser generator ANTLR

adding of XML-Markup into the

cadastral items

Page 11: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 11

Exploitation of regular expressions

Some parts of the input documents are regular (list of abbreviations, list of persons, category 1 of cadastral items)

Example:

Grundbuch Nr. 16 ( 16 ), fol. 15v , neues Stadtbuch Nr. 196

A Grammar describes these expressions

adding of further XML-Markup into the

cadastral items

Page 12: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 12

HTMLTXTTXT

XMLXMLXML

DOC RTFRTF

Transformation of the input format

List of abbreviations

List of persons

Cadastral register

Analysis of layout and structure

Exploitation of regular expressions

Storage in database tables

Structuring of information

Mapping rules

Dictionaries

Semantic rules

Rules for structuring

Grammars

NormalisationRules for replacement

Semantic Analysis

Analysis of full texts

Page 13: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 13

Analysis of the historical full texts /1

Usage of 23 different dictionaries◦ Some of them generated from the list of persons (first names, family

names)◦ Some created by hand (streets, saint feasts, professions, stop words)

Same method that was used in the GETESS project, developed from the DFKI Saarbrücken, AG Prof. Uszkoreit

Determining the similarity between terms in the cadastral register and in the dictionaries ◦ with phonetic encoding and ◦ phonetic similarity search

Page 14: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 14

Analysis of the historical full texts /2

phoneticencoding

Word distance with Levenshtein distance

Cadastral item dictionariesTerm 1

Term 2

phonetic encoding

Term 1‘ Term 2‘

Phonetic encoding norms all spellings that sound similar Example: Friedrich VRYEDRYCH Substitution rules for that had been defined (because that were

not available for historical texts)

Page 15: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 15

HTMLTXTTXT

XMLXMLXML

DOC RTFRTF

Transformation of the input format

List of abbreviations

List of persons

Cadastral register

Analysis of layout and structure

Exploitation of regular expressions

Storage in database tables

Structuring of information

Mapping rules

Dictionaries

Semantic rules

Rules for structuring

Grammars

NormalisationRules for replacement

Semantic Analysis

Analysis of full texts

Page 16: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 16

semantic analysis with rules

Association of terms to dictionaries:◦ unique◦ ambiguous (two or more meanings) semantic◦ no association found rules

Different semantic rules (with priorities) are applied In the rules information about the context are used

(predecessor –successor) Example:

◦ (a rule in informal description) ◦ token: number, has no associated meaning◦ predecessor token: date

meaning year is associated to the token

<Feiertag value="Jacobi" pos="6“/> <Jahr value="1693" pos="7"/>

Page 17: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 17

HTMLTXTTXT

XMLXMLXML

DOC RTFRTF

Transformation of the input format

List of abbreviations

List of persons

Cadastral register

Analysis of layout and structure

Exploitation of regular expressions

Storage in database tables

Structuring of information

Mapping rules

Dictionaries

Semantic rules

Rules for structuring

Grammars

NormalisationRules for replacement

Semantic Analysis

Analysis of full texts

Page 18: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 18

Structuring of information

Up to now: meanings are associated to terms of the cadastral register

Now: bottom up structuring of information Example:

<Eigentuemer>

<Person>

<Vorname value=„Peter" pos="1"/>

<Nachname value=„Koppe" pos="2"/>

</Person>

<Erwerbsart> …</Erwerbsart>

</Eigentuemer>

Process base on rules

Page 19: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 19

HTMLTXTTXT

XMLXMLXML

DOC RTFRTF

Transformation of the input format

List of abbreviations

List of persons

Cadastral register

Analysis of layout and structure

Exploitation of regular expressions

Storage in database tables

Structuring of information

Mapping rules

Dictionaries

Semantic rules

Rules for structuring

Grammars

NormalisationRules for replacement

Semantic Analysis

Analysis of full texts

Page 20: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 20

Result of this process

<Eintrag> <Grundbuchnummer> <GbNr value="16"/> <SchNr value="16"/> <FolNr value="15v"/> <SbNr value="196"/> ... </Grundbuchnummer> ... <GegQualitaet> <Bauwerk value="Haus" pos="1"/> </GegQualitaet> ... <Eigentuemer> <Person> <Vorname value="Peter" pos="1"/> <Nachname value="Koppe" pos="2"/> </Person> <Erwerbsart value="adjudicatio" pos="3"/> <Stoppwort value="et" pos="4"/> <Erwerbsart value="aedificatio" pos="5"/> <unbekannt value="Tr" pos="6"/> <unbekannt value="Regum" pos="7"/> <Zahl value="1657" pos="8"/> </Eigentuemer>

... <Eigentuemer> <Person> <Vorname value="Georg" pos="1"/> <Nachname value="Gammelkern" pos="2"/> </Person> <Erwerbsart value="emptio" pos="3"/> <unbekannt value="Medard" pos="4"/> <Zahl value="1673" pos="5"/> </Eigentuemer> <Eigentuemer> <Person> <Vorname value="Johan" pos="1"/> <Nachname value="Faber" pos="2"/> </Person> <Erwerbsart value="emptio" pos="3"/> <Zeitangabe> <Wochentag value="Veneris" pos="4"/> <Feiertag value="Jacobi" pos="6"> <Zeitbezug value="ante" pos="5"/> </Feiertag> <Jahr value="1693" pos="7"/> </Zeitangabe> </Eigentuemer> ...

Page 21: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 21

Instead of a benchmark

01020304050607080

tokens

with

unique

markup

tokens

without

markup

token

with two

or more

meanings

after full textanalysis

afterapplication ofsemantic rules

Semantic rules assign numbers a meaning (most number are four-digit numbers: year or cadastral numbers)

Solves ambiguous meanings (first name vs. date, family name vs. profession)

Page 22: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 22

Conclusion

Approach bases on ◦ extensible rules and◦ dictionaries

flexible tool Extraction of information is realised in a stepwise process Check mechanism supports error detection An evaluation of the results is difficult because we cannot

compare the results of the process with correct results

All methods that are represented here had been developed in the diploma thesis from Manja Nelius

Future work:◦ Semantic analysis and information structuring in one step:◦ Matching between XML documents with incomplete markup and a

schema

Page 23: The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke

University of RostockFolie 23

Literatur

Ernst Münch: Das Wismarer Grundbuch ( 1677/80 - 1838 ), Verlag Schmidt-Römhild, Rostock, 2002

Hans-Jürgen Martin, Geschichtlicher Abriß der Rechtschreibung, http://www.schriftdeutsch.de/orth-his.htm, 2004

Justin Zobel and Philip W. Dart: Phonetic string matching: lessons from information retrieval, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval table of contents, 1996

Justin Zobel, Philip W. Dart, Finding Approximate Matches in Large Lexicons, Software --- Practice and Experience, Volume 25, Number 3, 1995

Norbert Fuhr: Regelbasierte Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung, http://www.is.informatik.uni-duisburg.de/projects/rsnsr/, 2005

M. Abolhassani and N. Fuhr and N. Gövert, Information Extraction and Automatic Markup for XML documents, in Intelligent Search on XML Data. Applications, Languages, Models, Implementations, and Benchmarks, ed. Henk M. Blanken and Torsten Grabs and Hans-Jörg Schek and Ralf Schenkel and Gerhard Weikum, Lecture Notes in Computer Science, Vol 2818, 2003

Arnaud Sahuguet, Fabien Azavant: Building Light-Weight Wrappers for Legacy Web Data-Sources using W4F, 25th Conference on Very Large Database Systems, Edingurgh, UK, 1999

Terence Parr: ANTLR -- ANother Tool for Language Recognition, http://www.antlr.org, 2004 X-Fetch Suite, Republica Corporation, www.x-fetch.com Kai-Uwe Sattler, Stefan Conrad, Gunter Saake, Datenintegration und Mediatoren, In Web &

Datenbanken, d.punkt Verlag, Heidelberg, 2003