PASCAL PASCAL CHALLENGE ON INFORMATION EXTRACTION & MACHINE LEARNING Designing Knowledge Management using Adaptive Information Extraction from Text PASCAL Network of Excellence on Pattern Analysis, Statistical Modelling and Computational Learning Call for participation: Evaluating Machine Learning for Information Extraction July 2004 - November 2004 The Dot.Kom European project and the Pascal Network of Excellence invite you in participating in the Challenge on Evaluation of Machine Learning for Information Extraction from Documents. Goal of the challenge is to assess the current situation concerning Machine Learning (ML) algorithms for Information Extraction (IE), identifying future challenges and to foster additional research in the field. Given a corpus of annotated documents, the participants will be expected to perform a number of tasks; each examining different aspects of the learning process. Corpus A standardised corpus of 1100 Workshop Call for Papers (CFP) will be provided. 600 of these documents will be annotated with 12 tags that relate to pertinent information (names, locations, dates, etc.). Of the annotated documents 400 will be provided to the participants as a training set, the remaining 200 will form the unseen test set used in the final evaluation. All the documents will be pre-processed to include tokenisation, part-of-speech and named-entity information. Tasks Full scenario: The only mandatory task for participants is learning to annotate implicit information: given the 400 training documents, learn the textual patterns necessary to extract the annotated information. Each participant provides results of a four-fold cross-validation experiment using the same document partitions for pre-competitive tests. A final test will be performed on the 200 unseen documents. Active learning: Learning to select documents: the 400 training documents will be divided into fixed subsets of increasing size (e.g. 10, 20, 30, 50, 75, 100, 150, and 200). The use of the subsets for training will show effect of limited resources on the learning process. Secondly, given each subset the participants can select the documents to add to increment to the next size (i.e. 10 to 20, 20 to 30, etc.), thus showing the ability to select the most suitable set of documents to annotate. Enriched Scenario: the same procedure as task 1, except the participants will be able to use the unannotated part of the corpus (500 documents). This will show how the use of unsupervised or semi-supervised methods can improve the results of supervised approaches. An interesting variant of this task could concern the use of unlimited resources, e.g. the Web. Participation Participants from different fields such as machine learning, text mining, natural language processing, etc. are welcome. Participation in the challenge is free. After registration, participant will receive the corpus of documents to train on and the precise instructions on the tasks to be performed. At an established date, participants will be required to submit their systems’ answers via a Web portal. An automatic scorer will compute the accuracy of extraction. A paper will have to be produced in order to describe the system and the results obtained. Results of the challenge will be discussed in a dedicated workshop. Timetable 5 th July 2004: Formal definition of the tasks, annotated corpus and evaluation server 15 th October 2004: Formal evaluation November 2004: Presentation of evaluation at Pascal workshop Organizers Fabio Ciravegna: University of Sheffield, UK; (coordinator) Mary Elaine Califf, Illinois State University, USA, Neil Ireson Local Challenge Coordinator Web Intelligent Group Department of Computer Science University of Sheffield
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
PASCAL
PASCAL CHALLENGE ON INFORMATION EXTRACTION
& MACHINE LEARNING
Designing Knowledge Management using Adaptive Information Extraction from Text
PASCAL Network of Excellence on Pattern Analysis, Statistical Modelling and Computational Learning
Call for participation:
Evaluating Machine Learning for Information Extraction
July 2004 - November 2004
The Dot.Kom European project and the Pascal Network of Excellence invite you in participating in the Challenge on Evaluation of Machine Learning for Information Extraction from Documents. Goal of the challenge is to assess the current situation concerning Machine Learning (ML) algorithms for Information Extraction (IE), identifying future challenges and to foster additional research in the field. Given a corpus of annotated documents, the participants will be expected to perform a number of tasks; each examining different aspects of the learning process.
Corpus A standardised corpus of 1100 Workshop Call for Papers (CFP) will be provided. 600 of these documents will be annotated with 12 tags that relate to pertinent information (names, locations, dates, etc.). Of the annotated documents 400 will be provided to the participants as a training set, the remaining 200 will form the unseen test set used in the final evaluation. All the documents will be pre-processed to include tokenisation, part-of-speech and named-entity information.
Tasks Full scenario: The only mandatory task for participants is learning to annotate implicit information: given the 400 training documents, learn the textual patterns necessary to extract the annotated information. Each participant provides results of a four-fold cross-validation experiment using the same document partitions for pre-competitive tests. A final test will be performed on the 200 unseen documents. Active learning: Learning to select documents: the 400 training documents will be divided into fixed subsets of increasing size (e.g. 10, 20, 30, 50, 75, 100, 150, and 200). The use of the subsets for training will show effect of limited resources on the learning process. Secondly, given each subset the participants can select the documents to add to increment to the next size (i.e. 10 to 20, 20 to 30, etc.), thus showing the ability to select the most suitable set of documents to annotate. Enriched Scenario: the same procedure as task 1, except the participants will be able to use the unannotated part of the corpus (500 documents). This will show how the use of unsupervised or semi-supervised methods can improve the results of supervised approaches. An interesting variant of this task could concern the use of unlimited resources, e.g. the Web.
Participation Participants from different fields such as machine learning, text mining, natural language processing, etc. are welcome. Participation in the challenge is free. After registration, participant will receive the corpus of documents to train on and the precise instructions on the tasks to be performed. At an established date, participants will be required to submit their systems’ answers via a Web portal. An automatic scorer will compute the accuracy of extraction. A paper will have to be produced in order to describe the system and the results obtained. Results of the challenge will be discussed in a dedicated workshop.
Timetable 5th July 2004: Formal definition of the tasks, annotated corpus and evaluation server 15th October 2004: Formal evaluation November 2004: Presentation of evaluation at Pascal workshop
Organizers Fabio Ciravegna: University of Sheffield, UK; (coordinator) Mary Elaine Califf, Illinois State University, USA,
Neil Ireson
Local Challenge Coordinator
Web Intelligent GroupDepartment of Computer ScienceUniversity of Sheffield
PASCAL
Organisers• Sheffield – Fabio Ciravegna
• UCD Dublin – Nicholas Kushmerick
• ITC-IRST – Alberto Lavelli
• University of Illinois – Mary-Elaine Califf
• FairIsaac – Dayne Freitag
Website• http://tyne.shef.ac.uk/Pascal
PASCAL
Outline
• Challenge Goals
• Data
• Tasks
• Participants
• Results on Each Task
• Conclusion
PASCAL
Goal : Provide a testbed for comparative evaluation of ML-based IE
• Standardised data• Partitioning• Same set of features
– Corpus preprocessed using Gate– No features allowed other than the ones provided
• Explicit Tasks• Standard Evaluation
• Provided independently by a server
• For future use• Available for further test with same or new systems• Possible to publish and new corpora or tasks