
eDictor:(a chronology)

eDictor:(a chronology)

Roundtable: e-dictor, Advances and Perspectives.

Workshop: Construction and use

of large annotated corpora

Campinas, Sept. 9, 2013.


2004-2006Preliminary Ideas

The preliminary ideas that would result in the development of eDictor in 2007 started in 2004 with a project that aimed at restructuring the text-preparation system at the Tycho Brahe Corpus.



Essentially, the idea was that the Corpus would be constituted of single-source documents that could contain all relevant annotations (textual, philological, linguistic).


This was achieved in partnership with computer scientist Thorsten Trippel, from the University of Bielefeld.

He suggested we used the XML annotation language to re-encode the Corpus, and XSLT to transform each document into different presentations of the encoded information.



Our central idea was to encapsulate edition interferences at the word level, i.e. for each token in the corpus – so that each element of the pair would be available to different modules of analysis.


This first idea was applied to a few pilot texts, and published as a poster at the annual conference of the ALLC in 2004

PAIXÃO DE SOUSA, M. C.; TRIPPEL, T. Single source processing of Historic corpora for diverse uses.

In: Proceedings of the Association for Literary and Linguistic Computing (ALLC) Annual Conference, 2004.



In 2005, the Corpus went through a complete re-encoding process.



The restructured Corpus was composed of XML documents that, via

XSLT transformations, would render different (HTML and TXT) versions, adequate for different visualization and processing needs, as we had originally planned.


The Tycho Brahe Corpus, restructured

(XML base)


The Tycho Brahe Corpus, restructured (“catalogue” view)

The Tycho Brahe Corpus, restructured (“original” view)

The Tycho Brahe Corpus, restructured (“modernized” view)

The Tycho Brahe Corpus, restructured (simple text for further processing)

[ prologue (author: P.M. Gandavo)] [ title: AO MUITO ILUSTRE SENHOR DOM LIONIS PEREIRA, Epístola de Pero de Magalhães. ][g_008_s_43] Neste pequeno serviço (muito ilustre senhor ) que ofereço a Vossa Mercê das primícias de meu fraco entendimento, poderá em alguma maneira conhecer os desejos que tenho de pagar com minha possibilidade alguma parte do muito que se deve à ínclita fama de vosso heróico nome. [g_008_s_44] E isto assim pelo merecimento do nobilíssimo sangue e clara progênie de onde traz sua origem, como pelos troféus das grandes vitórias , e casos bem afortunados que lhe hão sucedido nessas partes do Oriente em que Deus o quis favorecer com tão larga mão, que não cuido ser toda minha vida bastante para satisfazer à menor parte de seus louvores . [g_008_s_45] E como todas estas razões me ponham em tanta obrigação , e eu entenda que outra nenhuma coisa deve ser mais aceita a pessoas de altos ânimos que a lição das escrituras , por cujos meios se alcançam os segredos de todas as ciências , e os homens vêm a ilustrar seus nomes e perpetuar os na terra com fama imortal , determinei escolher a Vossa Mercê entre os mais senhores da terra , e dedicar lhe esta breve história . [g_008_s_46] A qual espero que folgue de ver com atenção e receber me a benignamente debaixo de seu amparo : assim por ser coisa nova , e eu a escrever como testemunha de vista : como por saber quão particular afeição Vossa Mercê tem às coisas do engenho , e que por esta causa lhe não será menos aceito o exercício das escrituras , que o das armas. [g_008_s_47] Por onde com muita razão favorecido desta confiança possa seguramente sair a luz com esta pequena empresa e divulgar a pela terra sem nenhum receio , tendo por defensor dela a Vossa Mercê Cuja muito ilustre pessoa nosso Senhor guarde e acrescente sua vida e estado por longos e felizes anos . [ end prologue ]

Along with the application of the new single-source system to the Corpus, new ideas started to pop up.

Some of them were carried on, some were not.



The main thing that we wanted to do back then and still have not done is...

... to integrate syntactic annotation into this same, single-source system...



Other ideas were a little more fruitful: the integration of other, less complex levels of linguistic annotation (such as items of lexicological interest); and the expansion of the system to include the possibility of critical editions, in which more than one version of the same text could be compared.



PAIXÃO DE SOUSA, M. C. A Anotação da variação de grafia no Corpus Histórico do Português Tycho Brahe: Frentes abertas para estudos do léxico. V Encontro de Corpora: Lingüística de Corpus: a aplicabilidade nos estudos sobre Léxico, São Carlos, 2005.

PAIXÃO DE SOUSA, M. C. Memórias do Texto. Mesa-redonda “Bibliotecas e bancos de dados digitais de literatura”, II Simpósio Nacional de Literatura e Informática, Florianópolis, 2005.

Published in 2006 as:

PAIXÃO DE SOUSA, M. C. Memórias do Texto. Texto Digital (UERJ), v. 1, p. 10, 2006.

PAIXÃO DE SOUSA, M. C. Critical Hipereditions and the new challenges for text-critique. Seminário Internacional Literaturas: Del texto al hipertexto. Madri, Universidade Complutense, setembro de 2006.

Published in 2007 as:

PAIXÃO DE SOUSA, M. C. Digital Text: Conceptual and methodological frontiers. In: Dolores Romero; Amelia Sanz. (Org.). Literatures in the Digital Era: Theory and Praxis. Cambridge: Cambridge Scholarly, 2007.

By 2006 the single-source encoding system was mature; a first manual was prepared and a more complete paper on these results was published.



TRIPPEL, T.; PAIXÃO DE SOUSA, M. C. Metadata and XML standards at work: a corpus repository of Historical Portuguese texts. V International Conference on Language Resources and Evaluation (LREC), 2006.

TRIPPEL, T.; PAIXÃO DE SOUSA, M. C. Metadata and XML standards at work: a corpus repository of Historical Portuguese texts. V International Conference on Language Resources and Evaluation (LREC), 2006.


... as the system was presented to a wider range of potential users outside Tycho Brahe, new challenges emerged.



I Oficina de Anotação – Projeto CorPorA. Salvador, 19-21 de abril, 2006.

The 1st annotation workshop outside the Tycho Brahe team, in 2006 in Salvador, was an important breakthrough.

It was then that we noticed that the original techniques used to annotate the XML documents (“by hand”, in E-Macs) and to transform them (by coding XSL into the system via Saxon) was not adequate for teams with a less computational, and more philological background.


I Oficina de Anotação – Projeto CorPorA. Salvador, 19-21 de abril, 2006.

After the workshop in 2006 it became clear that if we wanted more teams to use the single-source annotation system, we would have to build a software that could perform the annotation and transformation tasks in a user-friendly interface.

In other words... it was then that the idea of eDictor took shape.




2007eDictor is launched!

eDictor beta 1.0 was developed in 2007 by Prof. Fabio N. Kepler (then a post-graduate student at IME-USP’s computer science program), and was first presented in the same year at the VI Encontro de Linguística de Corpus, at USP.




This first version of eDictor

contained the core functions of the original text encoding system:

an XML annotation module and the possibility of

XSLT transformation exportation.



Plus... it included a morphosyntactic tagging function!

This first version of eDictor

contained the core functions of the original text encoding system:

an XML annotation module and the possibility of

XSLT transformation exportation.


Interface of eDictor 1.0 beta 01


2008-2012years of growing into new uses

Two important aspects mark the years

2008 to 2012 for the development of eDictor.

The first was the arrival of a new team

member, Pablo P. F. Faria, who joined F. Kepler in developing the software after the first version.


The second important aspect was that, while up to 2008 the main application of the single-source system (first manually and later with eDictor) was the restructuring of the Tycho Brahe Corpus, after 2008 the system started to be used beyond Tycho Brahe.



This was important because, as the different projects have different aims, the tool started to include

new technical aspects.

The second important aspect was that, while up to 2008 the main application of the single-source system (first manually and later with eDictor) was the restructuring of the Tycho Brahe Corpus, after 2008 the system started to be

used beyond Tycho Brahe.

> For instance, in 2009 eDictor started

to be used by the Brasiliana USP team.

One of the main particularities of this context was that eDictor was used as

a corrector for automatic character recognition (OCR) – and new edition categories had to be created.


PAIXÃO DE SOUSA, M. C.; KEPLER, F. N.; FARIA, P. P. F. O Processamento automático de textos antigos: Desafios e Experiências. Workshop de Linguística de Corpus do Projeto Para a História do Português Brasileiro (PHPB), São Paulo, 2010.

> One important consequence for eDictor was the possibility of adding new edition categories to the tools Preference archive.

> Some of these developments were presented at the VIII Encontro de Linguística de Corpus in 2009 by Pablo Faria; this presentation would be published as a book chapter in 2010.

PAIXÃO DE SOUSA, M. C.; KEPLER, F. N.; FARIA, P. E-dictor: Novas perspectivas na codificação e edição de corpora de textos históricos. In: VIII Encontro de Linguística de Corpus, 2009, Rio de Janeiro. 2009.

Interface of eDictor in 2009 – Edition Module

Example of changes after 1.0 beta 001: Edition Tab – “edition” became an open category

> More importantly, researchers that used manuscript documents became interested in eDictor.

The special needs of this kind of material led to very important developments in the tool.


> The first group of manuscript documents to be worked with the tool was the corpus of XIXth century letters from the PhD thesis of Zenaide Carneiro (2005) – now part of the corpus CEDOH.

The edition of this corpus in XML had been idealized at the time of the 2006 workshop in Salvador - and from the start, it brought to the development of eDictor the challenge of dealing with particular categories and edition needs of manuscripts.


> One important example of developments brought by the needs of manuscript editors are the fac-simile view functionalities.

They were developed by Pablo Faria after eDictor started to be used by the team at CEDOH and by the team lead by Celia Lopes at LaborHistórico, at UFRJ.


The CEDOH corpus, with integrated fac-simile view of manuscripts.>

The CEDOH corpus, with integrated fac-simile view of manuscripts.

This new exporting format - Hypertext with fac-simile view – was integrated in later versions of eDictor, and is currently used by other projects.

LaborHistorico – Laboratório para a História do Português Brasileiro,Universidade Federal do Rio de Janeiro. Coord. Célia Lopes

Workshop: “Edição Digital e Divulgação de Textos Antigos”, Rio de Janeiro, 3-5 de fevereiro, 2010.

The corpus at LaborHistorico,with integrated fac-simile view of manuscripts.


> The corpus at LaborHistorico,with integrated fac-simile view of manuscripts.

> The workshops with the new teams of users, organized between 2010-2012, resulted in the development of new builds for eDictor beta 1.0 – and also, thanks to the expansion in the number of users, in 2010 we finally got to make a



First Version of eDictor’s Manual (2010)

(... actually, the only version so far)

> As a result of this expansion, between 2009 and 2012 ten builds of eDictor beta 1.0 were made, reflecting the additions that were pointed out as necessary by the different user teams.


Two important publications were prepared during this period: a poster session at the ALC meeting of 2010, presented by P. Faria, and the chapter for the book “Caminhos da Linguística de Corpus”.

In these papers we tried to cover the backgound on eDictor’s creation, the new developments, and the challenges ahead.



FARIA, P. P. F.; PAIXÃO DE SOUSA, M. C.; KEPLER, F. N. An Integrated Tool for Annotating Historical Corpora. The Fourth Linguistic Annotation Workshop (LAW IV) at The 48th Annual Meeting of the Association for Computational Linguistics (ALC 2010), Uppsala, 2010.

PAIXÃO DE SOUSA, M. C.; KEPLER, F. N.; FARIA, P. E-dictor: Novas perspectivas na codificação e edição de corpora de textos históricos. In: Tania Shepherd; Tony Berber Sardinha; Marcia Veirano Pinto. (Org.). Caminhos da linguística de corpus. Campinas: Mercado de Letras, 2010.


2013and now, what?

> eDictor 1.0 beta build 010 is the current version under use. The main differences in comparison to beta 001 are the additions related to fac-simile integration (in transcription module and in export

functionalities) and some bug-fixing in the editions module.

But there are still bugs to be busted!


Interface of eDictor 1.0 beta b010

Interface of eDictor 1.0 beta b010


> In the end of 2012, a new, web-based version of eDictor was idealized by Luiz Veronesi, and is currently under construction

Version 1.0 beta b010 of eDictor is currently being used by seven projects in Brazil and in Portugal


Corpus Anotado do Português Tycho Brahe(Universidade Estadual de Campinas)

Grupo de Pesquisas Humanidades Digitais (Universidade de São Paulo)

Laboratório de História do Português Brasileiro (Universidade Federal do Rio de Janeiro)

P.S. – Projeto Arquivo Digital de Escrita Quotidiana em Portugal e Espanha na Época Moderna (Universidade de Lisboa)

Corpus Eletrônico de Documentos Históricos do Sertão, CEDOHS (Universidade Federal de Feira de Santana)

Memória Conquistense (Universidade Estadual do Sudoeste da Bahia)> Version 1.0 beta b010 of eDictor is

currently being used by seven projects in Brazil and in Portugal

There is still a lot to be done if we want to make eDictor

a stable and fully transferrable tool.

but of course ...>

The spirit of this tool has been one of growing into the users’ needs and requests. It will become a better tool if we work together on what we want it to be.


So we are very excited about this workshop!


So we are very excited about this workshop!

Here’s one idea of how we could work:


We are launching today (09/09/2013) a new webpage for eDictor, at

We are launching today (09/09/2013) a new webpage for eDictor, at

We could use these days at the workshop to build more documentation and group it on the page.

That was it.Thank you!

That was it.Thank you!

Universidade de São Paulo Maria Clara Paixão de Sousa

eDictor:•(a chronology)

Roundtable: e-dictor, Advances and Perspectives.

Workshop: Construction and use

of large annotated corpora

Campinas, Sept. 9, 2013.

Roundtable: e-dictor, Advances and Perspectives.

Workshop: Construction and use

of large annotated corpora

Campinas, Sept. 9, 2013.

top related