27-31 May 2008 LREC 2008 (Marrakech, Mor occo) 1 The ACL ARC Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics Steven Bird 1 , Robert Dale 2 , Bonnie Dorr 3 , Bryan Gibson 4 , Mark T. Joseph 4 , Min-Yen Kan 5 , Dongwon Lee 6 , Brett Powley 2 , Dragomir R. Radev 4 , Yee Fan Tan 5 3 4 5
The ACL ARC A nthology R eference C orpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. Steven Bird 1 , Robert Dale 2 , Bonnie Dorr 3 , Bryan Gibson 4 , Mark T. Joseph 4 , Min-Yen Kan 5 , Dongwon Lee 6 , Brett Powley 2 , Dragomir R. Radev 4 , Yee Fan Tan 5. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
27-31 May 2008 LREC 2008 (Marrakech, Morocco)
1
The ACL ARCAnthology Reference Corpus:
A Reference Dataset for Bibliographic Research in Computational Linguistics
Steven Bird1, Robert Dale2, Bonnie Dorr3, Bryan Gibson4, Mark T. Joseph4, Min-Yen Kan5,
Dongwon Lee6, Brett Powley2, Dragomir R. Radev4, Yee Fan Tan5
• Graphical Methods for NLP – Social network analysis
• Text categorization– Sentence / Citation function
• Sequence Labeling– Reference string parsing
• Bayesian Models– Topic Models
• Summarization– Survey Paper Generation
27-31 May 2008 LREC 2008 (Marrakech, Morocco)
6
The Anthology as Corpus
Why use newswire?– Because our funding agencies want it
Let's build a corpus from our own publications!– Test domain adaptation techniques– Characterize what’s special about scientific discourse– Help ourselves and others understand our research better
Start with the largest freelyavailable NLP research archive
27-31 May 2008 LREC 2008 (Marrakech, Morocco)
7
The Anthology Reference Corpus
Scholars have already been using scientific articles as input
But datasets largely disparate
Results not comparable
Goal: unify such work by agreeing to work on a central dataset (à la TREC)
27-31 May 2008 LREC 2008 (Marrakech, Morocco)
8
The ACL ARC
Consists of most articles available as of February 2006 that have extractable text
Papers 10,921Total References 152,546References to articles 38,767 (25.4%)inside ACL ARCReferences to articles 113,779 (74.6%)outside ACL ARC
27-31 May 2008 LREC 2008 (Marrakech, Morocco)
9
What’s included now
Version 2008 03 25:– PDFs for all 10,921 articles
– <Title, Author, Booktitle> metadata tuples
– Noisy, text extracted output from the PDFs• Using non-OCR based extractor (pdfbox)
27-31 May 2008 LREC 2008 (Marrakech, Morocco)
10
The road ahead
1. Improve data quality
2. Establish subsets for smaller experiments
3. Build and release open-source tools
4. Enlarge coverage of newer materials
5. Release major revisions (infrequently)
Achieving the goals of the Linked Anthology Proposal
27-31 May 2008 LREC 2008 (Marrakech, Morocco)
11
Future data (near-term)
Inter-document– Manually cleaned citation graph
from the ACL Anthology Network
Intra-document – Citation to reference string matching
Document– Automatic keyphrase generation– OCR based text extracted output