Top Banner
The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29 July 2010
46

The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Mar 28, 2015

Download

Documents

Evelyn Knight
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

The COMET Project: Comparable and Parallel Corpora for the English-Portuguese Pair

Stella E. O. TagninUniversity of São PauloUCCTS – Ormskirk27-29 July 2010

Page 2: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Brief history

1998 – Projeto CoMET is conceived:Technical Corpus – CorTecTranslation Corpus – CorTradLearner Corpus – CoMAprend

Originally: 5 languagesEnglish, French, German, Italian, and

Spanish

Page 3: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

CorTec

2001– Technical Translation subject at Specialization in Translation Course11 different glossaries:

http://www.fflch.usp.br/citrat/citrat.htm 11 bilingual comparable corpora

Subsequent years: more corpora and more glossaries (not published)

Plus: corpora from graduate students

Page 4: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

2005 – 1st launching

CorTec (Technical Corpus)http://www.fflch.usp.br/dlm/comet/

consulta_cortec.html

CoMAprend (Learner Corpus)http://www.fflch.usp.br/dlm/comet/

comaprend.html

Page 5: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

CorTec 2005

5 comparable corpora: Cooking recipesEcotourism - environment Computer ScienceCardiology – HipertensionLaw – agreements

English – Portuguese original textsapproximately 200,000 words each

Page 6: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

CorTec 2005

Online ToolsFrequency List (also alphabetical)Concordancer

equal to (exact word)starting with (prefixes)finishing with (suffixes)containing (root of word)

n-grams

Page 7: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

CoMAprend - 2005

Writings by studentsundergraduate coursesextracurricular courses

LanguagesEnglish, French, German, Italian, and

SpanishOnly corpora for download2008: inclusion of investigation tools

Page 8: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

CorTec 2008 – 2nd launching

14 corpora

Ecotourism Hipertension Legal agreements Astronomy Renal failure Linguistics Magnetic flowmeters

Nutritional Supplements

Computer Science Football Coffee Cultural Tourism Cooking recipes 1 & 2

Page 9: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

CorTrad 2009 – new! Cooperation began May 2008:

CoMET: collection and preparation of texts Linguateca: computational implementation -

DISPARA (Santos, 2002); alignment, POS tagging and semantic annotation

Parallel Corpus English Portuguese Portuguese English

Interface: only in Portuguese (being translated into English)

http://www.fflch.usp.br/dlm/comet/consulta_cortrad.html

Page 10: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Science JournalismPtg Eng1,076 texts

Technical-Scientific(Cookbook) Ptg Eng

130,000 words

Literary (Short Stories)28 Australian

Canadian (coming soon) Eng Ptg

CorTrad – 3 parallel subcorpora

Page 11: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

CorTrad 2009

Population: availability

Special features: Multiversion – comparison of various

stages of translation Elaborate search queries – specific

for each subcorpus

Page 12: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.
Page 13: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Copyright - Disclaimer

Page 14: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

FAQ

Page 15: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Science Journalism: Revista Fapesp

Page 16: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.
Page 17: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Examples of Search Queries

Page 18: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Help

Page 19: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.
Page 20: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.
Referee
Precisa tirar esses slides com o compara dentro da frame do comet - vá ao site do compara e repita as pesquisas!
Page 21: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.
Page 22: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Search possibilities

for Science Journalism

Page 23: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Comparing results

How are verbs “acreditar” (= believe) and “achar” (= think) used in different text types in the journalistic corpus?

[lema=“acreditar”] + Distribuição por gênero de texto

[lema=“achar”] + Distribuição por gênero de texto

Page 24: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.
Page 25: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

= believe

Page 26: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

= think

Page 27: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

= believe

Page 28: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

= think

Page 29: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Technical-Scientific: Cookbook

Page 30: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

How are adverbs distributed among the 3 parts of the Cooking corpus:

filling – introduction - conclusion?

[pos=”ADV”]

distribuição por parte da obra distribution by part of file

Page 31: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.
Page 32: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

When is “natural” in Portuguese

NOT translated as natural in English?

natural vs !natural

Page 33: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Resultado: “natural” ≠ “natural”

Page 34: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Literary: Australian short stories

Page 35: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Search word: house

Page 36: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Semantic Tagging

Clothes

Color

Page 37: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Clothes

Page 38: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Color

Page 39: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Syntactic Function

Page 40: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Journalistic by document

Page 41: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

pos = part-of-speech

Page 42: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Semantic field: color

Page 43: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

CorTrad - specificsCorTrad - specifics

Improvements over other English-Portuguese parallel corpora

Multiversion – comparison of various stages of translation process

Elaborate Search queries: specific for each corpus

Page 44: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Computational background

DISPARA (Santos, 2002) – system to make parallel corpora available online

Corpus processing IMS-CWB (Christ et al., 1999), now Open CWB

PoS tagging Portuguese: PALAVRAS (Bick, 2000)

http://visl.hum.sdu.dk/visl/pt/ English: CLAWS (Rayson & Garside, 1998)

http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/

Interface conceived by team and implemented by Patricia Tagnin

Page 45: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Thanks to

Eckhard Bick and Paul Rayson for permission to use PALAVRAS and CLAWS, respectively, for the CorTrad.

Sandra Aluísio and Arnaldo Candido Júnior from NILC for hosting the CoMET Project

Diana Santos - Linguateca, co-financed by the Portuguese Government, by EU (FEDER e FSE), under agreement POSC/339/1.3/C/NAC, by UMIC and by FCCN.

CNPq, for grants to develop COMET (2005) and COMET (2008).

Page 46: The COMET Project: Comparable and Parallel Corpora for the English- Portuguese Pair Stella E. O. Tagnin University of São Paulo UCCTS – Ormskirk 27-29.

Obrigada

Stella

([email protected])

Thank you