Top Banner
Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE Semantic tagging of Leo Tolstoy
23

Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Dec 31, 2015

Download

Documents

Allen Jackson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Dialogue 2014, Bekasovo

Anastasia Bonch-Osmolovskaya

NRU HSE

Semantic tagging of Leo Tolstoy

Page 2: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Association of Digital Humanities Organizations (Europe, America, Australasia, Japan)

Digital humanities project

Page 3: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

What is Digital Humanities

Page 4: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Scholarly dissemination Big Data for HumanitiesDistant reading and complex network analysis

and vizualisationLinking cultural data: building standartized

resources and interoperability

Digital humanities: state of the art field

Page 5: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Republic of letters

Page 6: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

a stack of books of 2,7 m

height

includes all published

works, variants,

unpublished drafts,

diaries, letters, fragments

13 volumes of diaries

31 volume or 8500 letters

about 14,5 mln tokens

commentaries, indexes

Leo Tolstoy’ 90-vol complete edition

Page 7: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Open cultural heritage

Page 8: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

All Tolstoy in one click

Page 9: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

A project to digitise the entire works of Leo Tolstoy – named All of Tolstoy in One Click – making them available for tablets and smartphones, turned out to be lighter work than expected for the Tolstoy Museum in Moscow, when thousands of readers from all over the world responded to a call for volunteers. (The Guardian)

Now, thanks largely to the efforts of these volunteers, nearly all of the great Russian writer’s massive body of work, including novels, diaries, letters, religious tracts, philosophical treatises, travelogues, and childhood memories, will soon be available online, in a form that can be easily downloaded, free of charge. (The New Yorker)

A Crowdsourcing Wonder

Page 10: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

The idea of contemporary standards of cultural heritage web publishing

Tagging relevant structural elements of the text and textual data

Linking elements inside and outside the text

project participantsTolstoy Museum (Fekla Tolstaya)High School of Economics, philology department (Boris

Orekhov, Anastasia Bonch-Osmolovskaya)Tartu University (Roman Leibov)ABBYY Compreno ( Anatoly Starostin)students of the philological department HSE

Semantic Tolstoy

Page 11: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

What should be tagged? What tags should be used?Should we do it manually or automatically?Do we represent book or text? (Do we tag

non-Tolstoy’s texts?)

Semantic Tolstoy: how to start

Page 12: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

What should be tagged? Everything that can be tagged with TEI

What tags should be used? TEI schemeShould we do it manually or automatically? It

dependsDo we represent volumes or texts? Text

Semantic Tolstoy: how to start

Page 13: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

xml standard scheme for books encoding http://www.tei-c.org

wiki, manuals, tutorials, events, discussions, groups of interest

ROMA -  http://www.tei-c.org/Roma/ - customization generator for TEI scheme

Text Encoding Initiative

Page 14: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

TEI scheme modules

critical apparatusreadings,

variantsnames dates

placestables, formulae,

graphics, notated music

language corporadictionaries

linking, segmentation, alignment

linguistic annotationpos tagging

certainty, precision, responsibility

Page 15: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Types of texts

documentary textsliterary texts

proseverseperformance texts

spoken textstranscriptions of

speech

manuscriptsancient texts

on papyri, stonemedieval texts

illuminated mscmodern texts

variorumhandwrittentypewritten

Page 16: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Corrections

Page 17: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

<l>Я просыпаюсь. Я <choice>   <orig>об'ят</orig>   <reg>объят</reg>  </choice> <l>Открывшимся. Я на <choice>   <orig>учете</orig>   <reg>учете.</reg>  </choice> </l>

Normalization

Page 18: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

create volume/text-type matrixselect TEI schemes for different text types

use modificated xml from ABBYY Finereader for structural elements

parse indexes and link them to text define intertextual linksmake Semantic Tolstoy cookbook

Preliminary work

Page 19: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Улыбка <forename>Аграфены Петровны</forename> означала, что письмо было от <rolename>княжны</rolename> <surname>Корчагиной</surname>, на которой, по мнению <forename>Аграфены Петровны</forename>, <surname>Нехлюдов</surname> собирался жениться. И это предположение, выражаемое улыбкой <forename>Аграфены Петровны</forename>, было неприятно <surname>Нехлюдову</surname>.

TEI for Tolstoy (cookbook)

Page 20: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Automatic date extraction(M.Kolbasov, HSE student)

Прямой полный 17 марта 1847 года <date when="1847-17-03"> 17 марта 1847 года </date>

Прямой неполный Числа 22 <date when="1847-22-03"> Числа 22 </date>

Лучевой задний Вот уже шестой день Вот уже <date from="1847-24-04" to="1848-01-01"> шестой день </date>

Отрезковый наст. Эту неделю я сижу дома Эту <date from="1847-19-04" to="1847-25-04"> неделю </date> я сижу дома

Точечный прош. Я совершенно доволен собою за вчерашний день

Я совершенно доволен собою за <date when="1847-23-04"> вчерашний день </date>

Page 21: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Old2New orthography transliterator(M.Kartysheva, E.Sidorova, D.Kolomeytsev, students of HSE)

Page 22: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Student projectsOld2New orthography transliteratorTolstoy corpus for ruscorpora Universal index parser

Together with ComprenoNamed entity extractionEvaluation of NE merging (indexes as a Gold

Standard)Fact extraction

Accompanying projects

Page 23: Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

“The coolest thing to do with your data will be thought of by someone else.”

Rufus Pollock,Co-Founder and Director,

Open Knowledge Foundation