Top Banner

of 27

A corpus-based view of similarity

Jun 01, 2018

Download

Documents

zardhsh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/9/2019 A corpus-based view of similarity

    1/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarityand difference in translation

    Mona Baker

    Centre for Translation & Intercultural Studies, University of Manchester

    Corpus-based research throws up a number of methodological challenges.

    Many of these are evident in any type of research which attempts to compare

    authentic data of any kind, but the difficulties are accentuated by the

    availability of vast amounts of data in this case. In particular, questions

    relating to how one selects the features to be compared and, more

    importantly, how the findings may be interpreted, invite us to elaborate our

    methodology far more explicitly than in other types of research. The

    accessibility of the same body of data to other researchers also means that

    (a) the findings can be assessed and challenged in other studies, and (b) otherresearchers can invoke different, and perhaps more plausible explanations of

    the same findings by appealing to parameters that may have been downplayed

    or ignored in previous studies. These issues have been extensively debated in

    the literature on corpus linguistics, but rarely if ever in the context of

    corpus-based translation studies. A small-scale study involving comparisons

    between corpora of translated and non-translated texts in English in terms of

    frequency and distribution of recurring lexical patterns is used to examine

    some methodological issues in corpus-based translation research and suggest

    different ways in which the same findings may be interpreted depending onthe variables on which individual researchers choose to focus.

    Keywords: translation, corpus-based translation studies, style, literary

    translators, methodology, lexical patterns

    . Introduction

    For a number of years now, I have been involved in a research project which at-

    tempts, among other things, to compare translated and non-translated English

    text on the basis of a computerized collection of translated English (Transla-

    International Journal of Corpus Linguistics:(),.

    - John Benjamins Publishing Company

  • 8/9/2019 A corpus-based view of similarity

    2/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    tional English Corpus)1 and a similar computerized collection of non-translated

    English text. The latter is a subset of a much larger corpus available commer-

    cially, namely theBritish National Corpus. Research based on these two com-

    puterized collections of text, or some subset of them, has been widely publishedin various journals and collections, by myself and other colleagues (see in par-

    ticular Laviosa1998a; Laviosa-Braithwaite1997; Baker2000; Olohan & Baker

    2000; Olohan2001). It has also informed research done by various postgrad-

    uate students at the University of Manchester and elsewhere, some of which

    is reported in readily available publications (see, for example, Kenny2000a,

    2000b,2001; Bosseaux2001).

    TheTranslational English Corpushas been used not only to compare trans-

    lated and non-translated English, but also to inform stylistic comparisons be-tween individual translators represented in the corpus. Of the four subcor-

    pora which constituteTEC(namely fiction, biography, inflight magazines and

    news), fiction and biography taken together constitute what might be broadly

    seen as the narrative subcorpus and lend themselves particularly well to stylis-

    tic analyses. The question of how individual translators behave linguistically is

    an important aspect of the study I will describe shortly.

    Since both corpora are currently fluid in terms of size and hence composi-

    tion,2 it is always important to establish the parameters of the corpora whichinform each study. Details of the corpora used in the current study as well as

    the composition of the translational corpus in terms of the output of individual

    translators are presented later in this article.

    There are then essentially two broad questions addressed by the type of re-

    search that concerns us here: one is whether the patterning of translated text is

    significantly different from the patterning of non-translated text, and the other

    is whether there are patterns of variation within the corpus of translated text

    in terms of the linguistic behaviour of individual translators. These questions

    are ambitious in scope and difficult to answer reliably at the moment. Initially,

    the difficulty of arriving at reasonably reliable findings had to do with insuffi-

    cient data. When we only had one or two million words and very little of the

    output of any specific translator, there was simply not enough data to allow us

    to conduct descriptive studies of most features. We are now beginning to expe-

    rience a different kind of difficulty, namely that there is so much of some types

    of evidence and data that what we really need at the moment is much more re-

    search time and more researchers to be able to follow up the many threads and

    avenues that this resource is opening up, and in order to come up with plausi-

    ble explanations of the patterns that are emerging from our studies. This is an

  • 8/9/2019 A corpus-based view of similarity

    3/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    interesting difficulty to have too much rather than too little to go on3 and

    no less challenging than the difficulty of not having sufficient data to inform

    our research.

    At any rate, the project requires a certain level of investment in researchtime and funding, both of which are currently in very short supply, at least in

    the context of British academia. Nevertheless, it is important for researchers in

    the current climate to press on with carefully thought out research agendas in

    spite of financial and other obstacles, and so we continue to do our best with

    the current resources.

    . Methodology in corpus research

    In what follows, I would like to use a small-scale study on recurring lexical

    patterns to demonstrate the potential of both types of corpus research com-

    parison of translated and non-translated language, and identifying patterns

    of stylistic variation in the work of individual translators for offering in-

    sights into various aspects of translational behaviour. I also want to use this

    study to explore some of the methodological challenges and problems which

    are involved in this type of research generally.Corpus-based research has become very popular among translation stud-

    ies scholars in recent years, with several groups of researchers working on cor-

    pus projects in various countries and involving different languages, including

    Finnish, German, Italian, Spanish and Brazilian Portuguese.4 Like any area of

    study that becomes attractive to researchers, there is always a danger of uncrit-

    ical application of the methodology, of applying the approach without being

    cautious enough about how one interprets the findings and about how far a

    particular methodology can take us before we have to switch to other method-

    ologies to complement our research. Although corpus-based research is an ap-

    proach that I have been advocating and believe on the whole to offer a very

    powerful research programme for translation studies, what I would like to do

    here is to explore a number of the more problematic aspects of this methodol-

    ogy and suggest concrete ways in which we can nuance some of the findings of

    this type of research.

    In presenting and questioning the methodology I used in the small-scale

    study I wish to report on here, I will try to highlight a number of things, all of

    which are relevant in any kind of research, but I will be focusing on these issues

    in the context of corpus-based translation research specifically:

  • 8/9/2019 A corpus-based view of similarity

    4/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    (a) the inherent methodological difficulties of this type of research bearing

    in mind that all research methods have their difficulties, weaknesses and

    blind spots;

    (b) the complexity of the issues involved and the difficulty of coming up withplausible explanations for the patterns we manage to identify;

    (c) the potentially conflicting but also potentially illuminating ways of inter-

    preting the same set of findings, depending on the contextual parameters

    we decide to appeal to in offering an explanation for the patterns we choose

    to foreground.

    I am therefore inviting readers to view this article essentially as an exercise in

    methodology, but I also hope to offer them some insight into what seem at thisstage to be interesting patterns of difference and variation between translated

    and non-translated English, and among individual translators. How we ulti-

    mately verify these patterns and how we interpret them is the real challenge

    that I want to focus on in the following sections.

    . The study

    The two corpora used in this study both consist, broadly, of narrative text (see

    Figure1).

    Corpus of Translated English (Subset ofTEC)

    Fiction

    Biography

    Total Size: 6,613,456 words/tokens

    94 files, all full texts

    Corpus of Non-translated English (Subset ofBNC)

    Fiction (imaginative domain)

    Total Size: 6,423,325 words/tokens

    171 files, some full texts but mostly extracts of approximately 40,000 words on average

    Figure 1. Overview of Subcorpora Used in the Study

    TEC= Translated English;BNC= Non-translated English

    There are at least two important methodological issues here, both of which

    concern the basis of comparison.5 The first problem is that although the two

  • 8/9/2019 A corpus-based view of similarity

    5/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    corpora are very similar in size, theBNCconsists largely of extracts whileTEC

    consists of full texts (this is why we have much fewer texts in TECeven though

    it is a slightly larger corpus). This has implications for the way we interpret

    some of the findings I will present shortly, and it is therefore important to keepthis difference in the composition of the two corpora in mind. At the same

    time, we must also recognize that imbalances of this type are inevitable. Fur-

    thermore, they are not specific to corpus-based studies. It is in the nature of

    any type of comparison, any attempt to look for similarities and differences,

    that what is being compared can never be totally balanced in every respect. It is

    also particularly a feature of full text corpora, and particularly those of literary

    texts, that they cannot be balanced even internally: literary texts vary tremen-

    dously in their lengths, and if we are to include full texts in our corpus (whichis desirable for many reasons),6 we have to accept that the individual texts will

    be seriously imbalanced in terms of size. In addition to this internal imbalance

    in the corpus, there is also the imbalance between full-text corpora like TEC

    and those, like theBNC, which are made up of extracts. The first methodolog-

    ical difficulty then, which is largely inevitable, concerns the imbalance in size

    between and within the two corpora which provide the basis of comparison.

    The second problem concerns the composition of the corpora in terms of

    genres. TheBNCsubcorpus on which the current study is based consists of fic-tion only.TEC, on the other hand, consists of fiction and biography,7 which I

    have chosen to group together under the heading of narrative. These are the

    kinds of decisions and compromises that researchers have to make all the time

    in the course of conducting descriptive studies. I could defend my decision

    to include biography in the TECcorpus on the basis that (a) it is narrative

    and in this sense shares many features with fiction as a genre, and (b) the dis-

    tinction between fiction and biography is deliberately being blurred by many

    contemporary authors of biography. Whether or not readers accept this type

    of justification or find it plausible, the important issue to bear in mind here

    is that decisions of this type clearly have implications for the way we interpret

    any findings that we present in our research.

    Even if we accept the decision to include biography in the TEC cor-

    pus, there is still the question of the comparability of fictional texts in BNC

    and TEC: fiction is far too broad a category to ensure a reliable basis of

    comparison.8 TheBNCcorpus used in the current study consists of a selec-

    tion of texts/extracts from the imaginative domain; these were individually

    scrutinized to match as closely as possible the fictional texts inTEC, in terms

    of type of fiction, year of publication, and so on.

  • 8/9/2019 A corpus-based view of similarity

    6/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    Overall Number of Translators in the Narrative Subcorpus ofTEC

    57 (individual); 4 (team)

    Best Represented Translators (no. of texts/words)

    Giovanni Pontiero (6 texts; 562,292 words)

    Dorothy Blair (6 texts; 462,888 words)

    Peter Bush (5 texts; 296,146 words)

    Lawrence Venuti (4 texts; 214,098 words)

    Figure 2. Translators in the Narrative Subcorpus ofTEC

    At any rate, the methodological difficulties I want to stress here concern

    the imbalance between and within the corpora in terms of size and genres. Itwould take far too much space to print a full list of the texts in each corpus, but

    details of many titles, authors and (where applicable) translators featuring in

    each corpus will naturally emerge in the course of presenting some of the find-

    ings later in this article. In addition, it may be helpful at this stage to provide

    some information on the composition of the narrative section ofTECin terms

    of the individual translators represented in it (Figure2).

    Details of this type are important to bear in mind when attempting to ex-

    plain or evaluate any patterns we might identify in our research, as will become

    apparent shortly.

    . Lexical patterns in translated text

    Many claims have been made by translation scholars about translations being

    different in a number of ways from non-translated text. The ways in which

    translations are claimed to be different are often not clearly articulated, but

    they include, for example, the assumption that translators are more conserva-

    tive in their use of language; that they tend to prefer more standard forms of

    the language; that there tends to be a raising of the level of formality in transla-

    tion; that translated text is sanitized (in terms of translators avoiding certain

    features such as regionalisms and irregular spelling); and that translators tend

    to produce more uniform texts, for example by avoiding disruption of tense

    sequences, etc.9 A particularly well known and explicitly articulated claim un-

    derpins much of Lawrence Venutis work, namely that translators in the Anglo-

    American world specifically favour fluency because this is the strategy most

    valued by their immediate readership (Venuti1995).

  • 8/9/2019 A corpus-based view of similarity

    7/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    However fuzzy some of these notions might appear to be, if there is any

    truth in them, particularly the claim of fluency as the overriding strategy in

    Anglo-American translations, we ought to be able to trace the impact of such

    strategies on the lexical make up of translated English text.10 Given the type ofcorpora now available to researchers, especiallyTEC, it should then be interest-

    ing to look at recurring lexical patterns, on the assumption that for example

    if Anglo-American translators did favour fluency as an overall strategy then we

    might reasonably expect to find a higher level of recurrence of fixed or semi-

    fixed lexical phrases in translated as opposed to non-translated English text.

    And it should also be interesting to explore how individual translators respond

    to this type of social pressure to produce fluent (and hence unmarked) lan-

    guage language that does not draw attention to itself at the lexical level.This is an important issue, and one which is acknowledged in Venutis work

    on fluency as a favoured strategy in Anglo-American translations as well as

    Gideon Tourys broader work on norms: whatever the overall pattern might

    prove to be, there will always be individual translators who opt to use different

    strategies, to go against the norm. Hence the interest in moving beyond the

    description of overall patterns to study patterns of variation among individual

    translators.

    What we are looking for then is recurring lexical patterns or lexical phrasesin translated and non-translated English, and patterns of variation in the use of

    these recurring phrases among individual translators. There are various ways

    in which a researcher can get at this kind of data. I do not propose to offer a

    full-blown description of the mechanics of pulling out repeated patterns and

    comparing their frequency at this stage, though this exercise too throws up

    various methodological issues that are interesting to debate.11 However, it is

    important to point out that the kind of software I have had available to me

    for this study (and I believe is available generally) is quite crude, because it

    only allows the identification of exact repetitions, and only if the researcher

    specifies a very precise number of words.12 For example, if one asks for a list

    of all instances of four-word repetitions, this will throw up phrases such as in

    the event thatandin the event of but not in any event, because this phrase

    consists of three rather than four words. Similarly, a request for a list of three-

    word repetitions will not return phrases such asin the event thatorin the event

    of. Moreover, the software does not identify discontinuous patterns such asin

    the[unlikely]event of.

    We are clearly working with fairly crude software at the moment, and

    therefore have to be very cautious in making any claims at this stage. Having

  • 8/9/2019 A corpus-based view of similarity

    8/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    said that, and to place this exercise in a realistic context, it is also important to

    point out that the software in question is nowhere near as crude as trying to

    find patterns of this type manually.

    To get around the software problem (at least partly) for the purpose ofthis study, I pulled out lists of phrases of various lengths, but mostly 3-word,

    4-word, and 5-word phrases; examples can be seen in Appendix 1, including

    an example of a two-word phrase with a punctuation mark (that is,). I also

    selected some lexical phrases from each list to analyze more closely on the basis

    of full concordances for each corpus.

    Because of the crudeness of the software, the lists generated by the program

    contained a great deal of noise: in this case combinations of words which are

    not recognizable as fixed or semi-fixed lexical phrases, such as the monumentto the battleand the shrine of the lady, which are listed as occurring 32 and 22

    times respectively inTEC. At this stage, no principled way or robust methodol-

    ogy suggest themselves for selecting specific phrases to analyze in detail, nor for

    systematically weeding out irrelevant patterns. Very broadly, however, I tried to

    follow two principles in selecting from the two sets of lists generated by the soft-

    ware some 50 or so patterns for closer analysis to inform this methodological

    exploration:

    (a) all patterns selected had to be recognizable as recurring lexical phrases of

    English (e.g. in other words, once and for all, at the same time, as a matter of fact)

    rather than phrases that are clearly tied to the theme of a single text. Examples

    of the latter includehistory of the siege of, which occurs 44 times inTEC. This

    is part of the title of a book, The History of the Siege of Lisbon, by Jos Saram-

    ago, translated from Portuguese by Giovanni Pontiero. Similarly, the curious

    expression I reflected in the wing chairis the most frequent 6-word pattern in

    TEC, but it was not selected for analysis because all 144 instances occur in thesame text (Cutting Timber, by Thomas Bernhard, translated from German by

    Ewald Osers).

    (b) phrases relating to temporal and spatial orientation (such asin the middle

    of, for the first time, at the end of, in front of the, the end of the month) were

    ignored, because they occur with very high frequency in both corpora.

    At any rate, the material I am about to discuss (some of which is detailed

    in Appendix 1) should not be interpreted as systematic findings, and I stressthat I am not presenting it as such. What I am trying to do here is explore

    the kind of questions that researchers can sensibly try to address (at least in

    part) using the corpus methodology, and how they might go about refining

  • 8/9/2019 A corpus-based view of similarity

    9/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    their questions and following up the various research threads that this type of

    resource can bring to their attention.

    The crude nature of the software aside, a number of patterns caught my

    attention as I began to select some phrases and look at their concordances moreclosely. Given the methodological focus of this article, it seems reasonable to

    organize the discussion under headings that might guide a methodology for

    identifying and assessing patterns of this type in general, rather than under the

    individual patterns selected for analysis.

    . Overall frequency and number of recurring lexical phrases

    in both corpora

    As a first step, it seems reasonable to establish whether there is a noticeable

    difference between the two corpora in terms of the overall number and fre-

    quencies of the lexical patterns we have chosen to focus on. We may assume,

    for instance, that if translators into English did favour fluency as an overall

    strategy, this preference would be reflected in a higher reliance on recurring,

    familiar lexical phrases of the language: frequent use of recognizable, fixed or

    semi-fixed lexical phrases must be a major way of producing an impression of

    fluency in a text.I have already stressed the unreliability of the lists generated by the software

    I have available at the moment, so we cannot rely on an automatic compari-

    son of frequencies of all phrases generated by the program in this particular

    study. Nevertheless, the lists do suggest that a significant difference might ex-

    ist between the two corpora in this respect. At least for the patterns that I se-

    lected and decided to examine more closely, the difference sometimes seems

    staggering. Here are some examples of differences in the overall frequencies of

    different types of lexical phrases that occur in the two corpora:

    TEC BNC

    at the same time 669 323

    in the middle of the 401 209

    from time to time 394 137

    on the other hand 347 150

    that is, 288 119

    in other words 161 36that is to say 129 31

    once and for all 120 26

  • 8/9/2019 A corpus-based view of similarity

    10/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    when it comes to 78 35

    at the edge of the 67 46

    I thought to myself 43 12

    in a manner of speaking 40 10

    This does not mean that there are no differences in the other direction: some

    patterns must occur more frequently in BNCthan in TEC, though these are not

    as easy to spot as the patterns listed above, and many others like them. Exam-

    ples of this type, though the difference does not seem so significant, include the

    7-word phraseout of the corner of his eye, which occurs 18 times inBNCand 13

    times inTEC, and the 6-word phraseon the other side of the, which occurs 126

    times inBNCand 117 times inTEC. What we need is a piece of software thatcan run through both lists and automatically identify significant differences in

    the frequencies of phrases which occur in both corpora.13 We would then have

    to examine these carefully and make some sense of them in terms of the other

    issues I will be tackling next. But the point Im making here is that overall fre-

    quency is only one issue to consider in this respect. It is merely a starting point,

    but one we cannot afford to ignore.

    . Distribution across texts

    Apart from differences between the two corpora in terms of overall frequencies

    of individual lexical phrases, the next question concerns the distribution of

    individual phrases across the texts which constitute each corpus: irrespective of

    the overall frequency of a given phrase, is it evenly distributed across different

    texts, or does it occur with higher frequency in some texts rather than others?

    The examples in Appendix 1 suggest that the distribution of repeated lex-

    ical phrases may prove somewhat less even in translated text, with individualtexts showing what appear to be relatively high levels of repetition of the same

    expression in many cases. The most striking example is the repetition ofthat

    is, in Shaun Whitesides translation Notebooks (63 occurrences). Appendix 1

    also includes a full concordance of the 63 instances, which readers may wish to

    examine closely at their leisure.

    Assuming this pattern of uneven distribution holds as we examine more

    data, the next thing a researcher might find useful to establish is whether there

    is a plausible reason for the high frequency of a specific lexical phrase or ofseveral lexical phrases in a specific text or indeed in the work of a specific

    translator. For example, is the high frequency partly a function of the length

  • 8/9/2019 A corpus-based view of similarity

    11/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    of the text? I have already drawn attention to the imbalance both between and

    within the corpora informing this study in terms of text lengths, so this is one

    issue to bear in mind when interpreting some of the emerging patterns.

    It is possible to run the software program I used in this study on individualtexts as well as a corpus of texts. I have done so with a number of individual

    texts by Giovanni Pontiero and Peter Bush, who are among the best represented

    translators in the corpus. My initial reaction is that the frequency of use of lexi-

    cal phrases is not entirely a function of the length of the text. Even in the longest

    translation by Peter Bush, there seems to be very little repetition of specific lex-

    ical phrases. This is not the case in translations by Giovanni Pontiero, as we can

    see by comparing the top of the lists of 4-word phrases in one of Bushs trans-

    lations (Realms of Strife, by Juan Goytisolo; 96600 words in total) and one ofPontieros (The History of the Siege of Lisbon, by Jos Saramago, 125713 words).

    List of individual four-word chains

    cut-off point 5):

    17 ON_THE_RUE_POISSONNIERE

    14 AT_THE_END_OF

    14 ON_THE_RUE_DE

    12 IN_THE_COURSE_OF

    11 FOR_THE_FIRST_TIME

    9 AS_A_RESULT_OF

    9 CASA_DE_LAS_AMERICAS

    9 THE_CASA_DE_LAS

    8 A_MEMBER_OF_THE

    8 WITH_A_GROUP_OF

    7 BY_THE_IDEA_OF

    7 FLAT_ON_THE_RUE

    7 IN_ONE_OF_THE7 IN_THE_COMPANY_OF

    7 IN_THE_FIELD_OF

    7 ON_THE_EVE_OF

    7 THE_FIRST_TIME_IN

    6 A_FEW_DAYS_LATER

    6 A_GROUP_OF_FRIENDS

    6 FROM_TIME_TO_TIME

    6 IN_RELATION_TO_THE

    6 MY_RETURN_TO_PARIS

    6 ON_MY_RETURN_TO

    6 ON_THE_OTHER_HAND

    6 THE_RUE_DE_BIEVRE

  • 8/9/2019 A corpus-based view of similarity

    12/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    6 TO_MEET_UP_WITH

    6 TO_THE_POINT_OF

    6 TURNED_OUT_TO_BE

    6 WAS_NOT_AT_ALL

    5 AT_THE_SAME_TIME

    List of individual four-word chains (The History of the Siege

    of Lisbon; cut-off point 5):

    48 HISTORY_OF_THE_SIEGE

    48 THE_SIEGE_OF_LISBON

    44 OF_THE_SIEGE_OF

    36 THE_HISTORY_OF_THE

    25 THE_PORTA_DE_FERRO

    19 THAT_IS_TO_SAY18 AS_IF_HE_WERE

    18 ESCADINHAS_DE_SAO_CRISPIM

    18 IT_SAYS_HERE_THAT

    18 THE_ESCADINHAS_DE_SAO

    15 IT_IS_TRUE_THAT

    14 TO_BE_ABLE_TO

    12 AS_FAR_AS_THE

    12 AT_THE_SAME_TIME

    12 FOR_THE_FIRST_TIME

    12 OF_THE_HISTORY_OF

    10 IF_HE_WERE_TO

    10 OUR_LORD_JESUS_CHRIST

    9 AS_IF_HE_HAD

    9 A_MANNER_OF_SPEAKING

    9 FROM_TIME_TO_TIME

    9 IN_A_MANNER_OF

    8 AND_AT_THAT_MOMENT

    8 IF_WE_WERE_TO

    8 IN_THE_CASE_OF

    8 IN_THE_DIRECTION_OF

    8 IN_THE_PRESENCE_OF

    8 IT_IS_DIFFICULT_TO

    8 MILAGRE_DE_SANTO_ANTONIO

    8 NO_MORE_THAN_A

    8 ON_THE_ESCADINHAS_DE

    8 ON_THE_OTHER_HAND

    8 THE_BISHOP_OF_OPORTO

    8 THE_FACT_IS_THAT8 WERE_IT_NOT_FOR

    8 WHEN_IT_COMES_TO

    7 AT_DEAD_OF_NIGHT

  • 8/9/2019 A corpus-based view of similarity

    13/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    7 DO_MILAGRE_DE_SANTO

    7 GUILLAUME_OF_THE_LONG

    7 IN_A_STATE_OF

    7 IN_FRONT_OF_THE

    7 IN_ORDER_TO_BE

    7 OF_DOM_AFONSO_HENRIQUES

    7 OF_THE_PORTA_DE

    7 ONCE_AND_FOR_ALL

    7 ON_THE_OTHER_SIDE

    7 ON_TOP_OF_THE

    7 RUA_DO_MILAGRE_DE

    7 TAKING_INTO_ACCOUNT_THE

    7 THERE_MUST_HAVE_BEEN

    7 THE_MONTE_DA_GRAA

    7 THE_RUA_DO_MILAGRE

    7 YOU_ONLY_HAVE_TO

    6 AND_THEN_IT_MIGHT

    6 BEAR_IN_MIND_THAT

    6 DOM_AFONSO_HENRIQUES_WAS

    6 DR_MARIA_SARA_WHO

    6 IN_THE_MIDDLE_OF

    6 IT_MIGHT_BE_ST

    6 IT_WOULD_HAVE_BEEN6 MARIA_SARA_AND_RAIMUNDO

    6 NOT_TO_MENTION_THE

    6 OVER_AND_OVER_AGAIN

    6 SARA_AND_RAIMUNDO_SILVA

    6 THAT_HE_SHOULD_HAVE

    6 THEN_IT_MIGHT_BE

    6 THERE_WILL_BE_NO

    6 THE_ARCHBISHOP_OF_BRAGA

    6 THE_CONQUEST_OF_SANTAREM6 THE_DIRECTION_OF_THE

    6 THE_FAITHFUL_TO_PRAYER

    6 THE_PROOFS_OF_THE

    6 THE_SHEET_OF_PAPER

    6 WHAT_IS_YOUR_NAME

    5 ALL_THE_MORE_SO

    Even if we allow for the difference in overall length of the two texts, there

    does seem to be a greater tendency to rely on fixed and semi-fixed lexicalphrases in Pontieros translation. This type of strategy becomes easier to iden-

    tify when we examine several works by the same translator and find that the

  • 8/9/2019 A corpus-based view of similarity

    14/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    same lexical phrases are used again and again in different texts by different au-

    thors. I will return to the question of individual translators and their stylistic

    preferences shortly.

    Leaving the question of individual translators and their stylistic preferencesaside for a moment, and staying with the question of individual texts, there are

    other reasons why a lexical phrase or set of phrases might occur with higher

    frequencies in individual texts, apart from the overall length of the text in ques-

    tion. One such reason has to do with strategies of characterization. I have not

    found any examples of this in TECbut the only two instances of noticeably high

    frequencies of specific lexical phrases in an individual text that I have identified

    in theBNCsubcorpus (for the phrasethat is to say) can be explained in these

    terms. As can be seen in Appendix 1, this expression occurs 17 times in the36,433-word extract fromThe Remains of the Dayby Izuo Kashiguro. The nar-

    rator in this book is a butler; he is portrayed as a very old-fashioned character

    who is obsessed with accuracy and detail and is therefore constantly rewording

    what he says in order to be more accurate. The second example is of the same

    expression being used 6 times in the 43,859-word extract fromNice Work by

    David Lodge, where the author or narrator himself draws attention to a par-

    ticular characters use of the expression. In both cases the reuse of a lexical

    phrase is a conscious, deliberate strategy on the part of the writer. Strategiesof this type could in principle also account for higher frequencies of specific

    phrases in individual texts inTEC.

    By contrast, if fluency was a favoured strategy in translation then we would

    not expect the high frequency of a lexical phrase in a given translation to be as-

    sociated with a particular character. A good example of this isInfanta (by Bodo

    Kirchhoff, translated by John Brownjohn), where the 21 uses of the expression

    in other wordsare spread across the speech of five different characters, as well

    as the voice of the narrator. This is of course where the distinction between

    fiction (with its many voices and characters) and biography (where we often

    have much fewer voices represented) becomes important, and the inclusion of

    biography in this study may then raise specific problems.

    As far as translations are concerned, another possible explanation for the

    high frequency of a specific lexical phrase in an individual text could be that it

    is a direct carrying over of a feature of the source text: it could have nothing

    to do with a translators attempt (conscious or otherwise) to use familiar or

    unmarked lexical phrases to give an impression of fluency. This is clearly one

    avenue that many translation scholars would be keen to explore, but it is pre-

    cisely this tendency to refer everything back to the source text that theTransla-

  • 8/9/2019 A corpus-based view of similarity

    15/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    tional English Corpusproject was designed to counterbalance. I therefore wish

    to highlight other ways in which we might explore questions of this type on a

    larger scale, without being unduly restricted to one source text and one target

    text at a time, even if ultimately we will want to go back to the source text insome cases to seek further or complementary explanations.

    Other interesting patterns might capture our attention as we sift through

    the data, and these patterns might lead us to seek explanations outside the di-

    rect source/target text relationship. For example, the expressionin other words

    occurs 21 times in a biography translated by Carol Maier (Delirium and Des-

    tiny: A Spaniard in Her Twenties). Of these 21 instances, two occur in Carol

    Maiers own afterword. Might this suggest a stylistic quirk of the translator

    rather than a carrying over of a feature in the source text?

    . Distribution across translators

    Next is the issue of distribution across translators. I have already mentioned

    that even the longest translation by Peter Bush seems sparing in its repetition

    of individual lexical phrases. In fact, some of the best represented translators

    inTECare conspicuous by their absence or very marginal presence in the var-

    ious concordances I have examined closely. Peter Bush and Lawrence Venuti,for example, do not seem to make heavy use of any particular fixed or semi-

    fixed lexical phrases. This may or may not be confirmed by further and closer

    analysis of lexical phrases other than the ones I have managed to study so far. If

    it were to be confirmed, it would not come as a surprise to those familiar with

    the translators in question: both are very conscious of their use of language and

    have repeatedly argued that translators should not pander to the expectations

    of an Anglo-American readership.

    At any rate, in terms of focusing on individual translators rather than in-

    dividual texts, some of the questions we may wish to address are as follows. If

    an expression occurs with high frequency in the work of a specific translator,

    could it be because:

    (a) it is a favourite expression/quirk of the translator independent of the style

    of the author? Translators are writers, and like other writers may have their

    particular favoured expressions. SinceTECis specifically designed to represent

    several works by the same translator working with different authors, we shouldbe able to establish in most cases whether the frequent use of a set expression

  • 8/9/2019 A corpus-based view of similarity

    16/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    is a feature of the output of a particular translator, irrespective of source text

    author, or whether it only occurs in translations of a specific author.

    (b) it reflects a translation strategy, rather than simply stylistic preference onthe part of the translator? For example, I specifically chose to look more closely

    at three lexical phrases used for glossing/explicating: that is, / that is to say /

    in other words. Giovanni Pontiero, the best represented translator in the cor-

    pus, usesthat is to say47 times andin other words31 times overall. The con-

    cordance ofthat is to sayin Appendix 1 shows that Pontiero uses this type of

    glossing expression fairly heavily in practically all his translations. Bearing in

    mind that some of these are translations of the Portuguese author Jos Sara-

    mago and some are translations of the very different Brazilian author ClariceLispector, this is an interesting pattern which might be worth looking into in

    more detail.

    (c) Finally, but this would require much more extensive and detailed study of

    the work of a specific translator, there is the question of whether we can identify

    an overall tendency for a given translator to rely heavily on fixed or semi-fixed

    lexical phrases throughout his or her work. This would be an attempt to explore

    not so much the overall question of whether fluency is a preferred strategy in

    English translations but whether it is a preferred strategy of a specific translator.

    We can only explore this question of course, or rather any question relating to a

    specific translator, if we have several works by the translator in the corpus. This

    means that, for example, although the repetition of the expression that is,63

    times in Shaun Whitesides translation (Notebooks) is striking, there is little we

    can say about this because it is the only translation we currently have by him in

    TEC.14

    . The temporal dimension

    It would also be interesting to look into the development of a translators style

    over time. I say this because, for instance, there are six translations by Giovanni

    Pontiero inTEC, the first published in 1986 and the last in 1996, with a spread

    of 10 years between them. The first translation,The Hour of the Starby Clarice

    Lispector (28,580 words), is admittedly much shorter than the other five, but it

    simply does not figure in any of the concordances I have examined, includingconcordances of glossing expressions such asthat is to sayandin other words,

    which feature prominently in Pontieros other translations. It would be inter-

  • 8/9/2019 A corpus-based view of similarity

    17/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    esting to use a resource such asTECto explore whether there is evidence of a

    change of style and strategy over time in the case of a particular translator.

    . Summary and conclusions

    In trying to conclude what I tried to do in this exploratory study, I would like

    to start by stressing that in corpus work, and generally, figures and frequen-

    cies are only a starting point. We need to take a closer look at the data and

    get a feel for the texts and what is happening in them, as well as the people

    who produce these texts, in order to move beyond low-level description to sit-

    uated explanation. The value of raw figures and frequencies is simply that theydraw our attention to some features that are likely to be worth investigating in

    more detail.15 They offer one rationale for selecting features to focus on, but

    they cannot offer an interpretation of those features, nor does documenting

    such quantitative features in itself provide a justification for undertaking the

    research in the first place. Indeed, in corpus work, as in any other type of re-

    search, the real challenge lies in two things: one is how a researcher might select

    features to focus on, and the other is how he or she might interpret what they

    find in their data.In terms of the first issue, researchers working with corpora must realize

    that just because the computer appears to be objectively churning out data

    this does not mean that the process of selecting what one focuses on is not just

    as subjective and just as variable as it is in any other type of research. Here, as

    elsewhere in research, we all create our object of study. Indeed, one thing that

    is interesting to monitor in this type of research is the way in which ones own

    perspective as a researcher creates the object of research and contextualizes the

    findings. For example, of all the wealth of potentially interesting data that an

    exercise such as this can throw up, I decided to focus on a number of expres-

    sions that are typically used for glossing or explicating (that is, that is to say, in

    other words); another researcher might have chosen to focus on completely dif-

    ferent types of expression. Moreover, my attempt to explain the patterns that I

    saw emerging as I examined the distribution of these phrases in translated and

    non-translated text focused on the translator and the individual text, where an-

    other researcher might have been more inclined to focus on something like the

    source language (maybe the texts that have a higher incidence of repetitions are

    translations from a particular source language or languages), or the translators

  • 8/9/2019 A corpus-based view of similarity

    18/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    gender. I chose to focus largely on individual translators, irrespective of their

    gender or the languages from which they translate.

    Secondly, irrespective of the issue of subjectivity in selecting what to focus

    on, the question of how one arrives at plausible explanations of whatever he orshe chooses to find is just as complex and elusive in corpus work as it is in all

    research. The computer can help us locate features textual features but it

    cannot explain them. The onus of interpretation still lies with the researcher.

    Where the corpus methodology does score highly in my view is in allowing

    a higher level of transparency. Corpus-based work, if done responsibly, at least

    has the virtue of being transparent and allowing other researchers not only to

    check the validity of the basic claims being made but also to offer different

    interpretations of the same data.And finally, I would like to stress again that corpus-based research in prin-

    ciple takes textual material as a starting point, but this does not mean that it

    necessarily ignores or sets out to downplay the human element. Nor does it,

    or should it, be seen as a free-standing methodology that does not need to be

    complemented by other methods of research. Like any other methodology, it

    can only take us so far, and no further.

    Acknowledgements

    I am grateful to the following for assistance in undertaking this piece of re-

    search. For access to the chains program (authored by Isabel Barth): Professor

    Michael Stubbs, Universitt Trier. For software development and maintenance

    ofTEC: Saturnino Luz, Trinity College Dublin. For computational support:

    Paul Johnston, UMIST. For administrative assistance onTECproject: Gabriela

    Saldanha, former MPhil student at the Centre for Translation & Intercultural

    Studies, UMIST, currently PhD student at Dublin City University.

    Notes

    . TECis held at the Centre for Translation and Intercultural Studies, University of Man-

    chester (http://www.art.man.ac.uk/SML/ctis/research/tec.htm). For a detailed description of

    this corpus, see Laviosa (1998b); Baker (1999); Olohan & Baker (2000).

    . TECis being enlarged on a regular basis. As the size ofTECgrows, my colleagues and

    I have also been adding more texts to the BNC subcorpus that we use in our studies. Thesizes of both corpora, and hence their composition, may therefore vary from one study to

    another, but details of such variation are provided where relevant.

  • 8/9/2019 A corpus-based view of similarity

    19/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    . See Sinclair (1986) on the issue of too much evidence.

    . See Laviosa (1998a) and Olohan (2000) for examples of such studies.

    . See Kilgariff (2001) for a detailed discussion of issues relating to the comparison of

    corpora.

    . For a very good and accessible discussion of various issues relating to corpus compilation,

    including the pros and cons of opting for full texts vs text extracts, see Kenny (2001, Chapter

    5, especially pp. 105117).

    . And biography, in turn, includes a number of sub-genres which the compilers ofTEC

    chose to treat as one broad genre: biographies, autobiographies, memoirs, and books which

    consist of correspondence between well-known personalities. An example of the latter is

    The Boulez-Cage Correspondence, translated from French by Robert Samuels and published

    by Cambridge University Press.

    . There is of course ultimately no ideal basis of comparison, whatever phenomena we are

    attempting to compare and whatever standards of comparison we choose to use. Since every

    phenomenon, every event, and every text is by nature unique, comparability will always

    remain a relative issue.

    . For an overview of some of these claims, see Baker (1996,1999).

    . As well of course as its syntactic make up and a host of other features on the discourse

    level.

    . For details of computational methods used in capturing the data presented in this study,

    readers may contact Paul Johnston in the first instance ([email protected]).

    . The software in question is called Chains (authored by Isabel Barth, February 2001). It

    is not available commercially. The program identifies chains of words which recur in a text

    or corpus. A chain is a sequence of word-forms: either two-word pairs (i.e. sequences of two

    adjacent word-forms) or longer chains of repeated word-forms (e.g. a five-word sequence).

    The program proceeds through the text, with a moving window, identifying each x-word

    sequence (as specified by the user). Each chain is then checked against stored sequences, and

    the program eventually prints out a list of all x-word sequences and their frequency.

    . Something similar to what Im envisaging here exists for lists of individual words, but

    not for phrases. This is the compare wordlists function in Wordsmith Tools. The procedure

    compares all the words in two lists, already generated by the wordlist program, and reports

    on all those which appear significantly more often in one than the other, including words

    which appear more than a minimum number of times in one even if they do not appear at

    all in the other.

    . One of the more interesting aspects of corpus work is that it can throw up unexpected

    patterns of this type, which can then feed back into the design of the corpus itself. In this

    case, theTECteam is now actively seeking other translations by Shaun Whiteside to include

    in the corpus.

    . On the issue of using raw frequencies without recourse to measures of statistical signifi-

    cance, see Danielsson (2001,2003).

  • 8/9/2019 A corpus-based view of similarity

    20/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    . A character who, rather awkwardly for me, doesnt herself believe in the concept of

    character.Thatis to say(a favourite phrase of her own), Robyn Penrose, Temporary Lecturer

    in English Literature at the University of Rummidge, holds that character is a bourgeois

    myth, an illusion created to reinforce the ideology of capitalism.

    . What is more, I learned from Mr. Tomero Alarcn that a substantial part ofDelirio y

    destinomay well have been dictated to an amanuensis, whose identity is now unknown. In

    other words, my developing hunch about Zambranos sentences had been correct: not only

    was the book written quickly and its narrative mixed with both philosophical thinking

    and poetic association, parts of it may also have been spoken and simultaneously recorded.

    I have deliberately referred to that goal in terms of an occurrence rather than a product,

    because what results when delirium (with its precarious and potentially rewarding conse-

    quences) and destiny (with its precarious and potentially rewarding possibility) interact is

    the occurrence I most wanted my translation to convey. In other words, I wanted to trans-

    late, above all, Zambranos razn potica, which is present inDelirio y destinomore as an

    event, a manifestation of what Giles Deleuze has discussed, also in terms of writing, as a

    possibility of life that invokes the oppressed bastard race that ceaselessly stirs beneath

    dominations, resisting everything that crushes and oppresses.

    References

    Baker, M. (2000). Towards a Methodology for Investigating the Linguistic Behaviour of

    Professional Translators.Target, 12(2), 241266.

    Baker, M. (1999). The Role of Corpora in Investigating the Linguistic Behaviour of

    Professional Translators.International Journal of Corpus Linguistics, 4 (2), 281298.

    Baker, M. (1998). Rexplorer la langue de la traduction: une approche par corpus. Meta, 43

    (4), 480485.

    Baker, M. (1996). Corpus-based Translation Studies: the Challenges that Lie Ahead. In

    H. Somers (Ed.), Terminology, LSP and Translation (pp. 175186). Amsterdam &

    Philadelphia: John Benjamins.

    Baker, M. (1995). Corpora in Translation Studies: An Overview and Some Suggestions for

    Future Research.Target, 7(2), 223243.Baker, M. (1993). Corpus Linguistics and Translation Studies. Implications and Appli-

    cations. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and Technology: In

    Honour of John Sinclair(pp. 233250). Amsterdam: John Benjamins.

    Bosseaux, Ch. (2001). A Study of the Translators Voice and Style in the French Translations

    of Virginia WoolfsThe Waves. In M. Olohan (Ed.),CTIS Occasional Papers, Volume 1

    (pp. 5575). Manchester: CTIS, UMIST.

    Danielsson, P. (2001). The Automatic Identification of Meaningful Units in Language.

    Doctoral Dissertation, Department of Swedish, Gtenborg University, Sweden.

    Danielsson, P. (2003). Automatic Extraction of Meaningful Units from Corpora: A Corpus-driven Approach Using the Word stroke. International Journal of Corpus Linguistics, 8

    (1), 109127.

  • 8/9/2019 A corpus-based view of similarity

    21/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    Gellerstam, M. (1986). Translationese in Swedish Novels Translated from English. In L.

    Wollin & H. Lindquist (Eds.), Translation Studies in Scandinavia (pp. 8895). Lund:

    CWK Gleerup.

    Kenny, D. (2001).Lexis and Creativity in Translation. Manchester: St. Jerome.

    Kenny, D. (2000a). Lexical Hide-and-Seek: looking for creativity in a parallel corpus. In M.

    Olohan (Ed.),Intercultural Faultlines. Research Models in Translation Studies I: Textual

    and Cognitive Aspects(pp. 93104). Manchester: St. Jerome.

    Kenny, D. (2000b). Translators at Play: Exploitations of Collocational Norms in German-

    English Translation. In B. Dodd (Ed.), Working with German Corpora (pp. 143160).

    Birmingham: University of Birmingham Press.

    Kenny, D. (1997). (Ab)normal Translations: a German-English Parallel Corpus for Inves-

    tigating Normalization in Translation. In B. Lewandowska-Tomaszczyk & P. J. Melia

    (Eds.),Practical Applications in Language Corpora. PALC 97 Proceedings(pp. 387392).

    dz: dz University Press.Kilgariff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6(1),

    97132.

    Laviosa, S. (Ed.). (1998a). LApproche base sur le corpus/The Corpus-based Approach.Special

    Issue ofMeta, 43(4).

    Laviosa, S. (1998b). The English Comparable Corpus: A Resource and a Methodology. In L.

    Bowker, M. Cronin, D. Kenny & J. Pearson (Eds.),Unity in Diversity: Current Trends in

    Translation Studies(pp. 101112). Manchester: St. Jerome Publishing.

    Laviosa, S. (1997). How Comparable Can Comparable Corpora Be?Target, 9(2), 289319.

    Laviosa-Braithwaite, S. (1997). Investigating Simplification in an English ComparableCorpus of Newspaper Articles. In K. Klaudy & J. Kohn (Eds.), Transferre Necesse Est

    (pp. 531540). Budapest: Scholastica.

    Laviosa-Braithwaite, S. (1995). Comparable Corpora: Towards a Corpus Linguistic

    Methodology for the Empirical Study of Translation. In M. Thelen & B. Lewandoska-

    Tomaszczyk (Eds.), Translation and Meaning (Part 3) (pp. 153163). Maastricht:

    Hogeschool Maastricht.

    Olohan, M. (2001). Spelling out the Optionals in Translation: A Corpus Study. UCREL

    Technical Papers, 13, 423432.

    Olohan, M. (Ed.). (2000).Intercultural Faultlines. Research Models in Translation Studies I:Textual and Cognitive Aspects.Manchester: St. Jerome Publishing.

    Olohan, M. & Baker, M. (2000). Reporting that in Translated English: Evidence for

    Subconscious Processes of Explicitation.Across Languages & Cultures, 1(2), 141158.

    Sinclair, J. (1986). First throw away your evidence. In G. Leitner (Ed.),The English Reference

    Grammar(pp. 5665). Tbingen: Max Niemeyer.

    Sinclair, J. (1996). The Search for Units of Meaning.Textus, IX, 75106.

    Stubbs, M. (1996).Text and Corpus Analysis.Oxford: Basil Blackwell.

    Toury, G. (1995).Descriptive Translation Studies and Beyond. Amsterdam & Philadelphia:

    John Benjamins.

    Venuti, L. (1995).The Translators Invisibility.London & New York: Routledge.

  • 8/9/2019 A corpus-based view of similarity

    22/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

    Appendix 1

    Key

    Lexical pattern in large bold italics. Details forBNC(corpus of non-translated

    English) &TEC(corpus of translated English). Detailed information on files

    accounting for a high percentage of occurrences in either corpus: filename, fol-

    lowed by number of occurrences of expression in file, total extent of file, name

    of translator, source language, title of published translation, author of source

    text.

    that is,BNC: Total: 119 instances. Maximum 5 in 1 text.

    TEC: Total: 288 instances. Maximum 63 in 1 text.

    bb000003 (63) Extent: 78,144 words

    Shaun Whiteside, German.Notebooks 19241954(Wilhelm Furtwngler)

    fn000071 (31) Extent: 123,865

    Nancy Roberts, Arabic.Beirut Nightmares(Ghada Samman)

    fn000008 (17) Extent: 72,239

    Sophie Bennett, Arabic.The Stone of Laughter(Hoda Barakat)

    bb000005 (16) Extent: 135,019 wordsNaomi Seidman, Hebrew.Conversations with Dvora. An Experimental Biography of the First

    Modern Hebrew Woman Writer(Amia Lieblich)fn000011 (12) Extent: 27,770

    Margaret Jull Costa, Portuguese.Lucios Confession(Mario de Sa Carneiro)

    that is to say

    BNC: Total instances: 31; maximum 17 in one file.

    ar3 (17) Extent: 36,433 wordsKazuo Ishiguro,The Remains of the Day(voice of butler throughout; obsessed with detail

    and accuracy; part of characterization)any (6) Extent: 43,859 words

    David Lodge, Nice Work (5 instances voice of Robyn Penrose, the boring academic; 1

    instance of narrator commenting on the characters use of the expression; again part of

    characterization)16

  • 8/9/2019 A corpus-based view of similarity

    23/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

    TEC: Total instances: 129; maximum 19 in one file.

    fn000005 (19) Extent: 125,713

    Giovanni Pontiero, Portuguese.The History of the Siege of Lisbon(Jos Saramago)

    fn000006 (15) Extent: 193,720Giovanni Pontiero, Brazilian Portuguese.Discovering the World(Clarice Lispector)

    fn000018 (12) Extent: 80,530

    Michael Hulse, German.Wonderful, Wonderful World(Elfriede Jelinek)

    fn000008 (9) Extent: 72,239

    Sophie Bennett, Arabic.The Stone of Laughter(Hoda Barakat)

    fn000024 (9) Extent: 116,273

    Terry Hale & Liz Heron, French.The Dedalus Book of French Horror: The 19th Century(various)

    fn000007 (7) Extent: 142,178

    Giovanni Pontiero, Portuguese.The Gospel According to Jesus Christ(Jos Saramago)

    in other words

    BNC: Total: 36 instances; maximum 3 in one text.

    TEC: Total: 162 instances; maximum 21 (in two texts).

    bb000011 (21) Extent: 120,643

    Carol Maier, Spanish.Delirium and Destiny: A Spaniard in Her Twenties (Maria Zambrano)

    Note: 2 of the 21 instances in bb000011 are in Carol Maiers own afterword.17

    fn000020 (21) Extent: 144,659John Brownjohn, German.Infanta(Bodo Kirchhoff)

    Note: multiplicity of voices; not part of characterization.

    bb000012 (12) Extent: 70,521

    Robert Samuels, French.The Boulez-Cage Correspondence(Pierre Boulez and John Cage)

    fn000006 (11) Extent: 193,720

    Giovanni Pontiero, Brazilian Portuguese.Discovering the World(Clarice Lispector)

    fn000058 (11) Extent: 64,103

    Ewald Osers, German.Cutting Timber(Thomas Bernhard)

    once and for all

    BNC: Total instances: 26; maximum 2 in 1 file.

    TEC: Total instances: 120; maximum 10 in one file.

    fn000014 (10) Extent: 80,900

    Ines Rieder and Jill Hannum, German.Violetta(Pieke Bierman)

    fn000005 (7) Extent: 125,713

    Giovanni Pontiero, Portuguese.The History of the Siege of Lisbon(Jos Saramago)

    fn000007 (6) Extent: 142,178

    Giovanni Pontiero, Portuguese.The Gospel According to Jesus Christ(Jos Saramago)

    fn000071 (6) Extent: 123,865

    Nancy Roberts, Arabic.Beirut Nightmares(Ghada Samman)

  • 8/9/2019 A corpus-based view of similarity

    24/27

  • 8/9/2019 A corpus-based view of similarity

    25/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation

  • 8/9/2019 A corpus-based view of similarity

    26/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    Mona Baker

  • 8/9/2019 A corpus-based view of similarity

    27/27

    (c) John BenjaminsDelivered by Ingenta

    on: Mon, 27 Mar 2006 08:54:30to: Chinese University of Hong KongIP: 137.189.174.203

    A corpus-based view of similarity and difference in translation