Top Banner
Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva http:// gate.ac.uk / http:// nlp.shef.ac.uk / March 2004
14

1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk

Jan 01, 2016

Download

Documents

Shanon Young
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

Using Corpora and Evaluation Tools

Diana Maynard

Kalina Bontcheva

http://gate.ac.uk/ http://nlp.shef.ac.uk/

March 2004

Page 2: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

Corpus structure

• Located in gatecorpora in cvs• Each directory under gatecorpora has a corpus, e.g.,

gatecorpora/ace• Each corpus can have sub-parts, e.g. ace/bnews• Each (sub-)corpus has a clean and marked directory,

these are important• Clean holds the unannotated version, while marked holds

the human-marked ones• There may also be a processed subdirectory – this is a

datastore (unlike the other two)• Corresponding files in each subdirectory must have the

same name

Page 3: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

Tools for corpus manipulation

• There are lots of tools available in gatecorpora/utilities and in subdirectories of each corpus

• Many of the corpora, e.g. MUC, ACE come in different formats (e.g. inline vs standoff markup) and have been converted to GATE-style annotations

• Also tools for e.g. counting things, changing annotation names etc (mostly JAPE grammars)

Page 4: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

Corpora available

• MUC7 (newswires)• MUSE (news texts from the web)• ACE • ACE Chinese• ACE Arabic• Romanian (news texts; 1984)• CMU seminars• Jobs• CONLL’03 – part of Reuters with NEs• Bulgarian - news

Page 5: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

MUC 7 corpus

• Newswires used in the official MUC 7 evaluation• Data available in MUC format and GATE format• Annotation types: Person, Location,

Organization, Money, Percent, Date, Time• Division into training and test sets

Page 6: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

MUSE corpus

• News texts from various websites (BBC, Guardian, etc.)

• Annotation types: Person, Organisation, Location, Date, Time, Money, Percent, Address

• Slight differences in annotation guidelines from MUC, e.g. people’s titles are included in names

• Available from gatecorpora/news in various subdirectories

Page 7: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

ACE corpus

• 3 types of text: newswire, broadcast news and newspaper

• Broadcast news and newspaper available as ground truth and original (degraded) texts

• Annotation types: Person, Organisation, Location, GPE, Facility

• Some annotations have roles to indicate metonymous usage

• Guidelines are different from MUC and MUSE• Available from gatecorpora/ace in various

subdirectories

Page 8: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

Multilingual ACE

• As for ACE, but in Chinese and Arabic

• Texts are in UTF-8

• No degraded versions of these texts

• Available from gatecorpora/ace/ace03/Chinese/

and

gatecorpora/ace/ace03/Arabic/

Page 9: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

CMU Seminars & Jobs

• Corpora frequently used to evaluate relation extraction and wrapper induction systems

• gatecorpora/jobs-corpus and gatecorpora/cmu-seminars

• Converted into gate xml, ready for use

Page 10: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

CONLL’03 shared task

• Corpus used in the CONLL’03 shared task for evaluating NE recognition

• In English, part of the Reuters corpus

• Markup is e.g., <I-LOC>, not converted to Muse tags

• Use reuterstogate.jape to convert to Muse tags

• gatecorpora/ReutersWithNamedEntities

Page 11: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

Annotation Diff:per-document evaluation

Page 12: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

Regression TestAt corpus level – corpus benchmark tool – tracking system’s performance over time

Page 13: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

How it works

• Clean, marked, and processed• Corpus_tool.properties – must be in the directory

from where gate is executed• Specifies configuration information about

– What annotations types are to be evaluated– Threshold below which to print out debug info– Input set name and key set name

• Modes– Default – regression testing– Human marked against already stored, processed – Human marked against current processing results

Page 14: 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/ //nlp.shef.ac.uk/

Conclusion

This talk: http://gate.ac.uk/sale/talks/corpora-tutorial.ppt

More information: http://gate.ac.uk/