The Translational English Corpus: A practical approach to corpus building
The Translational English Corpus: A practical approach to corpus building
Outline• TEC and new developments
– EDT Corpus– Humanities Corpus
• Corpus design– Representativeness– Balance– Size
• Corpus building– Identifying material– Scanning/Converting texts– Tagging & Annotation
Translational English Corpus
A corpus of contemporary English translations: written texts translated into English from a
variety of source languages
http://www.llc.manchester.ac.uk/ctis/research/english-corpus/
Subc
orpo
ra
Lang
uage
s
French
German
Span
ish
Portugu
ese
Norwegia
n
Catalan
Latin Americ
an Sp
a...
Slove
neTam
il
Finnish
Hebrew
Vietnam
ese0
5
10
15
20
25
30
24 23
1513
96 6 5 4 4 3 3 3 3 3 3 3 2 2 2 2 2 2 1
Number of books in each language
for fiction and (auto)biography
Set of software tools for the investigation of a wide range of issues to do with the language of translated texts.
Header File: contains meta data such as the title of ‐the text, author, publisher, etc. Text File: contains the actual data to be analysed
– Sub-corpus Selection: Allows you to select particular text files or groups of text files to search.
– Sort Tool: Allows you to sort concordances to the left or right, and specify the number words between the search keywords.
– Corpus Tree Viewer: Allows you to “grow” a tree for various keywords. The size of the text reflects frequency of occurrence in the corpus.TE
C To
ols
TEC Database
An electronic database of all material (to be) included in the TEC for the subcorpora of fiction and (auto)biography.
The entry for each book includes not only most of the information that is included in the header file, but also images of the covers of the books.
English Discourses on Translation Corpus• A corpus of discourses on translation for the investigation of
they way in which translation/translators are conceptualised in society at different historical periods.
• No time, language or genre restriction: any material is included as long as it is written in English.
• Two types of material– Peritextual : material that accompanies the translation, e.g.
prefaces, introductions, afterwords, etc.
– Epitextual: published material (broadsheet and mainstream newspapers, literary magazines, etc.)
• Link with TEC
Humanities CorpusA corpus of translations into English of works by theorists in the humanities, e.g. philosophers, sociologists, literary theorists, etc.
Temporality: translations date from 1900 onwards, but the source textstexts do not have a time restriction.
* Multiple translations of the same book.
What is a Corpus?
Corpus DesignWhat is a corpus?
‘A collection of texts held in machine-readable form and capable of being analysed automatically or semi-automatically’ (Baker 1995)…
….and has certain characteristics:
– Representativeness
– Balance
– Size
Representativeness
“a corpus is thought to be representative of the language variety it is supposed to represent, if the findings based on its contents can be generalised to the said language variety” (Leech 1991).
A corpus may focus on a particular genre/language/ author/translator, etc.
Decisions about criteria for selection of texts
TEC Design
Material: English translations (whole texts)
Genres: Fiction, (auto)biography, in-flight magazines, news articles
Time of publication: Late 80s onwards
Place of publication: UK and USA
Repr
esen
tativ
enes
s
Balance
“a balanced corpus covers a wide range of texts which are supposed to be representative of the language variety under question” (McEnery et al. 2006).
Also, ‘internal’ balance, e.g.
– Gender balance
– Source language balance
– Genre balance
Bala
nce
Corpus Size
A corpus needs to be adequate for the purposes for which it is intended.
A bigger corpus is not necessarily more useful than a smaller one.
Factors that affect corpus size:
– Purpose of the corpus
–Availability of data
–Copyright
• Research questions (purpose of the corpus)
– Specialised corpora and corpora intended for morphosyntactic studies tend to be smaller than general corpora and corpora intended for lexical studies. Static corpora are also smaller than dynamic ones.
• Availability of data
– The availability of suitable data (especially in machine-readable form), as well as the ease with which they can be identified may affect the size of a corpus.
Corp
us S
ize
• Copyright
– Copyright clearance can impede corpus development as well as the accessibility and availability of a corpus to a wide audience.
– Copyright law varies internationally. – Fair dealing: no permission needed for short extracts
not exceeding 400 words for prose (or a total of 800 words in a series of extracts, none exceeding 300 words).
– Out of copyright material: author’s / translator’s lifetime + 70 years (UK).
– If you’re in doubt, seek permission! (McEnery et al. 2006)
Corp
us S
ize
Communication with publishers
We're delighted to learn of your interest project, and pleased to grant you general permission to use all book reviews and blogs on our site. We'll be grateful if you can include a link to the site in the
pieces you use.
….We don't feel comfortable posting the entirety of both titles to your database, but would be willing to make half of both books available to your research center…We typically charge a fee of $150 per title for use of such a large portion.
…University Press is pleased to grant you non-exclusive, English language, world rights to reprint limits of fair use (under 300 words)…
We're delighted to learn of your interesting project, and pleased to grant you general permission to use all book reviews and blogs on our site. We'll be grateful if you can include a link to the site in the pieces you use.
But also…
Corpus Building
• Identifying material
• Scanning
• Converting texts
• Corpus tagging and annotation
• Ready to be used
Identifying Material• Possible sources
• Publishers’ websites• Search engines e.g. Farrar, Strauss and Giroux, NYTimes• Publishing houses specialising in translation
• Databases• National databases e.g. Three Percent, LTI Korea
• Internet, archives, etc.
• Problems• Search engine not well-designed e.g. The Telegraph• Need for specific material• In some cases, not indicated whether it is a translation or not• For reviews: not always related to translation
Scanning and Converting Texts• Scanning
• Flat-bed scanner – Document feeder• Paper and print quality• Scanner settings: Resolution and Colour vs Greyscale
• OCR (Optical Character Recognition) Process• Language support• Accuracy• Font type• Document format
• Text File• Spelling errors• Character recognition errors (e.g. Tm instead of I’m)• Save as .txt file
Corpus Tagging and Annotation
Adds value to a corpus, makes it easier to extract information and prepares texts to be used with a corpus software
Factors that affect the extent of tagging/annotation (Olohan 2004):
• Purpose of the corpus
• Corpus software
• Accessibility of the corpus
• Technical expertise of the researcher
Hea
der F
ile
Text
File
Corpus Annotation
• POS (Part-of-Speech) Tagging– Marks up a word in a corpus as corresponding to a particular part of
speech, based on both its definition, as well as its context. E.g. John_NP0 loves_VVZ Mary_NP0 ._.
• Lemmatisation– Reduces the inflectional variants of words to their respective
lemmas, i.e. as they appear in a dictionary. E.g. is, are, am -> BE
• Parsing– Marks the syntactic structure of each sentence.
E.g. (S (NP (NNP John)) (VP (VPZ loves) (NP (NNP Mary)))
Sear
chin
g a
tagg
ed c
orpu
s
Ready to be Used
• Develop and use your own software
• Use existing corpus tools– TEC Tools
For more information about how to use TEC Tools with local corpora, you can download the tutorial from the TEC webpage.
– WordSmith Tools
A collection of corpus linguistics tools
– ParaConcA bilingual or multilingual concordancer
– ….
“When a corpus is created, a compromise has often to be reached between ideal design criteria and practical constraints. However, while opportunistic choices may be justified, the limitations and distortions they introduce in the makeup of a corpus should not be forgotten when evaluating the results”. (Zanettin 2011)
Thank you!
TEC website
http://www.llc.manchester.ac.uk/ctis/research/english-corpus/
TEC Email Address
References
Baker, Mona (1995) ‘Corpora in Translation Studies: An overview and some suggestions for future research’, Target 7(2): 223-243.
Leech, Geoffrey (1991) ‘The state of the Art in Corpus Linguistics’, in Karin Aijmer and Bengt Altenberg (eds) English Corpus Linguistics: Linguistic studies
in honour of Jan Svartvik, London: Longman, pp. 8-29.
McEnery, Tony, Richard Xiao and Yukio Tono (2006) Corpus-based Language Studies, London and New York: Routledge.
Olohan, Maeve (2004) Introducing Corpora in Translation Studies, London and New York: Routledge.
Zanettin, Federico (2011) ‘Translation and Corpus Design’, SYNAPS 26:14-23.