Corpus Linguistics and Corpora. Corpus Corpus, plural Corpora A collection of linguistic data, either compiled as written texts or as a transcription.

Post on 21-Dec-2015

241 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

Corpus Linguistics and Corpus Linguistics and Corpora Corpora

CorpusCorpus

Corpus, Corpus, plural plural Corpora Corpora A collection A collection of of linguistic datalinguistic data, either compiled as , either compiled as written textswritten texts or as a transcription of or as a transcription of recorded speechrecorded speech. The main purpose . The main purpose of a corpus is to verify a hypothesis of a corpus is to verify a hypothesis about language - for example, to about language - for example, to determine how the usage of a determine how the usage of a particular sound, word, or syntactic particular sound, word, or syntactic construction varies. construction varies.

Corpus LinguisticsCorpus Linguistics

Corpus linguisticsCorpus linguistics deals with the deals with the principles and practice of using principles and practice of using corpora in language study. A corpora in language study. A computer corpuscomputer corpus is a large body of is a large body of machine-readable texts.machine-readable texts.  

(cf. Crystal, David. 1992. (cf. Crystal, David. 1992. An Encyclopedic Dictionary of Language An Encyclopedic Dictionary of Language

and Languagesand Languages. Oxford, 85). Oxford, 85)

CorpusCorpus

CORPUSCORPUS (13c: from Latin (13c: from Latin corpus corpus body. The plural is usually body. The plural is usually corporacorpora) ) (1) A collection of texts, especially if (1) A collection of texts, especially if complete and self-contained: complete and self-contained: the the corpus of Anglo-Saxon versecorpus of Anglo-Saxon verse………..………..

  (cf. McArthur, Tom 1992 "Corpus" , (cf. McArthur, Tom 1992 "Corpus" , The Oxford Companion to the The Oxford Companion to the English LanguageEnglish Language. Oxford, 265-266) . Oxford, 265-266)

Chomsky 1957 Chomsky 1957

"Any natural corpus will be skewed. "Any natural corpus will be skewed. Some sentences won't occur because Some sentences won't occur because they are obvious, others because they they are obvious, others because they are false, still others because they are are false, still others because they are impolite. The corpus, if natural, will be impolite. The corpus, if natural, will be so wildly skewed that the description so wildly skewed that the description [of language based on the corpus] [of language based on the corpus] would be no more than a mere list. " would be no more than a mere list. " Syntactic structures. The Hague, 159 Syntactic structures. The Hague, 159

Fillmore 1992Fillmore 1992

"I have two main observations to make. "I have two main observations to make.

The first is that I don't think there can be The first is that I don't think there can be any corpora, however large, that contain any corpora, however large, that contain information about all of the areas of information about all of the areas of English lexicon and grammar that I want English lexicon and grammar that I want to explore; all that I have seen are to explore; all that I have seen are inadequate. inadequate.

Fillmore 1992Fillmore 1992

The second observation is that every The second observation is that every corpus that I've had a chance to examine, corpus that I've had a chance to examine, however small, has taught me facts that I however small, has taught me facts that I couldn't imagine finding out about in any couldn't imagine finding out about in any other way." other way."

In "Corpus linguistics" or "Computer-aided armchair linguistics", In "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35.York, 35.

Types of corpusTypes of corpus

Monolingual corporaMonolingual corpora - in which the - in which the texts are all in the same languagetexts are all in the same language

Parallel and/or aligned corporaParallel and/or aligned corpora - in - in which originals and translations are which originals and translations are aligned so that both texts are aligned so that both texts are synchronized to appear on the screen synchronized to appear on the screen together and it is easy to see how the together and it is easy to see how the translator has translated the original.translator has translated the original.

Types of corpusTypes of corpus

Comparable corporaComparable corpora - in which a - in which a selection of original texts has been selection of original texts has been made in two or more languages dealing made in two or more languages dealing with the same subject or genre.with the same subject or genre.

Concurrent corporaConcurrent corpora - a term used to - a term used to describe texts taken from newspapers describe texts taken from newspapers on the same subject on approximately on the same subject on approximately the same dates.the same dates.

Types of corpusTypes of corpus

Specialized corporaSpecialized corpora - texts on - texts on specialized subjects.  The principal specialized subjects.  The principal use for these corpora is the use for these corpora is the extraction of terminology and extraction of terminology and complementary explanatory complementary explanatory material - definitions, explanations, material - definitions, explanations, semantic relations etcsemantic relations etc

Types of corpusTypes of corpus

'Do-it-yourself ' corpora'Do-it-yourself ' corpora - a term coined by - a term coined by those of us using small specialized corpora those of us using small specialized corpora for the purpose of teaching translation or for the purpose of teaching translation or language language

Disposable corporaDisposable corpora - the same as 'do-it- - the same as 'do-it-yourself' corpora, but taking into account yourself' corpora, but taking into account that such corpora need to be disposed of that such corpora need to be disposed of after use so that their users do not get into after use so that their users do not get into trouble with copyright restrictions.trouble with copyright restrictions.

How do you search a corpus?How do you search a corpus?

ConcordancingConcordancing Sentence level – see BNCSentence level – see BNC

http://www.natcorp.ox.ac.uk http://www.natcorp.ox.ac.uk COMPARA – parallel concordance COMPARA – parallel concordance

http://www.linguateca.pt/COMPARA http://www.linguateca.pt/COMPARA

The Survey of English UsageThe Survey of English Usage

60s - Randolph Quirk et al > 60s - Randolph Quirk et al > launched the Survey of English launched the Survey of English Usage (SEU)Usage (SEU) • "with the aim of collecting a large and "with the aim of collecting a large and

stylistically varied corpus as the basis stylistically varied corpus as the basis for a systematic description of spoken for a systematic description of spoken and written Englishand written English   

The Survey of English UsageThe Survey of English Usage

• Brown, Lancaster-Oslo/Bergen (LOB) and Brown, Lancaster-Oslo/Bergen (LOB) and London-Lund Corpus of Spoken EnglishLondon-Lund Corpus of Spoken English

• See ICAME - International Computer See ICAME - International Computer Archive of Modern and Medieval English Archive of Modern and Medieval English at the Norwegian Computing Centre for at the Norwegian Computing Centre for the Humanities at the Humanities at http://gandalf.aksis.uib.no/icame.html    

The Survey of English UsageThe Survey of English Usage

Today at University of London at Today at University of London at http://www.ucl.ac.uk/english-usage/

ICE - the International Corpus of ICE - the International Corpus of English English

Download the sampler of this corpus Download the sampler of this corpus fully tagged and analysed from fully tagged and analysed from http://www.ucl.ac.uk/english-usage/ice-gb/sampler/form.htm

Quality versus quantityQuality versus quantity

A small but fully analyzed and tagged - A small but fully analyzed and tagged - e.g. early corpora and ICE (1 million e.g. early corpora and ICE (1 million words) words)

British National Corpus – 100 million wordsBritish National Corpus – 100 million words Other corpora Other corpora

• Bank of English - 450 millionBank of English - 450 million The InternetThe Internet

Corpora, lexicography & Corpora, lexicography & terminologyterminology

Lexicography BEFORE corporaLexicography BEFORE corpora• Emphasis on etymologyEmphasis on etymology• Complex definitions Complex definitions • Usage based on intuitions of Usage based on intuitions of

lexicographerslexicographers Terminology BEFORE corporaTerminology BEFORE corpora

• Standardization > one word= one Standardization > one word= one concept, rigid definitionsconcept, rigid definitions

• Paper dictionaries/glossariesPaper dictionaries/glossaries

Corpora, lexicography & Corpora, lexicography & terminologyterminology

Lexicography & terminology Lexicography & terminology AFTER corporaAFTER corpora• Emphasis on modern usage in contextEmphasis on modern usage in context• Simple definitionsSimple definitions• Usage based on evidence in textsUsage based on evidence in texts• emphasis on establishing REAL rather emphasis on establishing REAL rather

than IDEAL usagethan IDEAL usage

COBUILD projectCOBUILD project

Begun in 1969Begun in 1969 Collins, the well-known dictionary Collins, the well-known dictionary

publisher, and the  University of publisher, and the  University of Birmingham – led by John SinclairBirmingham – led by John Sinclair

A pioneering projectA pioneering project Objective > to collect texts for a corpus of Objective > to collect texts for a corpus of

contemporary texts from which to extract contemporary texts from which to extract information on modern English usageinformation on modern English usage

Work proceeded during the 70s and 80s - Work proceeded during the 70s and 80s - see Sinclair (Ed.) 1987see Sinclair (Ed.) 1987

COBUILD > Bank of EnglishCOBUILD > Bank of English

Present site for COBUILD > Bank of Present site for COBUILD > Bank of English English http://www.titania.bham.ac.uk/docs/about.htm

  

British National Corpus (BNC) - British National Corpus (BNC) - originaloriginal

Oxford University Computing Service Oxford University Computing Service at http://www.natcorp.ox.ac.uk/at http://www.natcorp.ox.ac.uk/

This completely free – but you only This completely free – but you only get up to 50 resultsget up to 50 results

Brigham Young University (BYU)Brigham Young University (BYU)

http://corpus.byu.edu/ http://corpus.byu.edu/ Note: Note: Corpus of American EnglishCorpus of American English BNCBNC TIME corpusTIME corpus Corpus de PortuguêsCorpus de Português Corpus de EspañolCorpus de Español

Brigham Young University (BYU)Brigham Young University (BYU)

PLEASE NOTE: You will need to create a username and password to use this – but it costs nothing

BNC – CQP version

Lancaster university http://bncweb.lancs.ac.uk/

bncwebSignup/ PLEASE NOTE: You will need to

create a username and password to use this – but it costs nothing

Other large monolingual corporaOther large monolingual corpora

Portuguese > CETEMPUBLICO Portuguese > CETEMPUBLICO http://http://www.linguateca.pt/cetempublico/ www.linguateca.pt/cetempublico/

Spanish > Real AcademiaSpanish > Real Academia German > Mannheimer corpusGerman > Mannheimer corpus

Using corpora to study syntaxUsing corpora to study syntax

For example:For example:• whether certain nouns occur more often in whether certain nouns occur more often in

the singular than plural the singular than plural • how pronouns are used in different how pronouns are used in different

languages languages • which verbs favour certain forms of tense, which verbs favour certain forms of tense,

aspect or mood aspect or mood • how adjectives combine with nouns how adjectives combine with nouns • where adjuncts occur in sentenceswhere adjuncts occur in sentences• ETC ETC

Monolingual corporaMonolingual corpora

General language corpora useful for General language corpora useful for studying:studying:• Words in contextWords in context• Problems of COLLOCATIONProblems of COLLOCATION• Relative usage of synonymsRelative usage of synonyms• Syntactic structuresSyntactic structures• Sentence structureSentence structure

Parallel Corpora - multilingualParallel Corpora - multilingual

European commission - Multilingual European commission - Multilingual http://ec.europa.eu/http://ec.europa.eu/

EUROPARL - Multilingual EUROPARL - Multilingual http://www.statmt.org/europarl/ http://www.statmt.org/europarl/

ELDA ELDA http://www.elda.org/sommaire.phphttp://www.elda.org/sommaire.php

Parallel CorporaParallel Corpora

COMPARA EN/PT COMPARA EN/PT http://www.linguateca.pt/comparahttp://www.linguateca.pt/compara

Corpógrafo - LINGUATECACorpógrafo - LINGUATECA

An on-line suite of tools we have An on-line suite of tools we have developed for:developed for:• Construction of corporaConstruction of corpora• Semi-automatic extraction of Semi-automatic extraction of

terminologyterminology• Construction of terminology databasesConstruction of terminology databases• Terminology & corpora researchTerminology & corpora research• Research into information retrieval and Research into information retrieval and

knowledge engineeringknowledge engineering

CORPÓGRAFOCORPÓGRAFO

http://www.linguateca.pt/corpografo http://www.linguateca.pt/corpografo FREE!FREE! On-line! On-line! For individual researchFor individual research

BibliographyBibliography

ICAME site at http://helmer.aksis.uib.no/icame.htmlICAME site at http://helmer.aksis.uib.no/icame.html BIBER, D., CONRAD, S. & REPPEN, R. 1998 BIBER, D., CONRAD, S. & REPPEN, R. 1998

Corpus Linguistics: Investigating Language structure Corpus Linguistics: Investigating Language structure and Use.and Use. Cambridge: Cambridge University Press.  Cambridge: Cambridge University Press. 

BIBER, Douglas,Stig Johansson, Geoffrey Leech, BIBER, Douglas,Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. Susan Conrad & Edward Finegan. 1999. Longman Longman Grammar of Spoken and Written English.Grammar of Spoken and Written English. Harlow: Harlow: Pearson Education Ltd.  Pearson Education Ltd.  

BibliographyBibliography HOEY, Michael. 1991. HOEY, Michael. 1991. Patterns of Lexis in Text.Patterns of Lexis in Text. Oxford: Oxford:

Oxford University Press. ISBN 0 19 437142 5.Oxford University Press. ISBN 0 19 437142 5. MCENERY, Tony & WILSON, Andrew.  2001. MCENERY, Tony & WILSON, Andrew.  2001. Corpus Corpus

Linguistics. 2nd Edition.  Linguistics. 2nd Edition.  Edinburgh: Edinburgh University Edinburgh: Edinburgh University Press.   Press.  

OAKES, Michael P. 1998. OAKES, Michael P. 1998. Statistics for Corpus Linguistics. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. ISBN 0 7486 0817 6Edinburgh: Edinburgh University Press. ISBN 0 7486 0817 6

SINCLAIR, John (ed) 1987.  LSINCLAIR, John (ed) 1987.  Looking Up - An account of the ooking Up - An account of the COBUILD project in lexical computing.COBUILD project in lexical computing. Collins COBUILD. Collins COBUILD. Collins ELT: London and Glasgow.Collins ELT: London and Glasgow.

STUBBS, Michael. 1996. STUBBS, Michael. 1996. Text and Corpus Analysis: Text and Corpus Analysis: Computer-assisted Studies of Language and Culture.Computer-assisted Studies of Language and Culture. Oxford: Oxford: Blackwell Publications Ltd. ISBN 0-631-19512-2 (pbk).Blackwell Publications Ltd. ISBN 0-631-19512-2 (pbk).  

top related