Top Banner
Information Extraction Referatsthemen CIS, LMU München Winter Semester 2013-2014 Dr. Alexander Fraser, CIS
26

Information Extraction Referatsthemen

Feb 24, 2016

Download

Documents

Iman

Information Extraction Referatsthemen. CIS, LMU München Winter Semester 2013-2014 Dr. Alexander Fraser, CIS. Information Extraction – Reminder. Vorlesung Learn the basics of Information Extraction (IE) Klausur – only on the Vorlesung ! Seminar Deeper understanding of IE topics - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Extraction Referatsthemen

Information ExtractionReferatsthemen

CIS, LMU MünchenWinter Semester 2013-2014

Dr. Alexander Fraser, CIS

Page 2: Information Extraction Referatsthemen

Information Extraction – Reminder• Vorlesung

• Learn the basics of Information Extraction (IE)• Klausur – only on the Vorlesung!

• Seminar• Deeper understanding of IE topics• Each student who wants a Schein will have to make a presentation on IE• 25 minutes (powerpoint, LaTeX, Mac)

• If two students work together, 40 minutes (each student speaks for 20 minutes)

• Hausarbeit• 6 pages worked out, due 3 weeks after the Referat• Separate for each student!

Page 3: Information Extraction Referatsthemen

3

Topics

• Topic will be presented in roughly the same order as the related topics are discussed in the Vorlesung

• Most of the topics require you to do a literature search• There will usually be one article (or maybe two) which you find is

the key source• If appropriate, please turn in PDF files of the key article and a few

other important articles• There are a few projects involving programming

• These are particularly suitable to be done by two students

Page 4: Information Extraction Referatsthemen

Referat• 25 minutes for one student• 40 minutes for two• Start with what the problem is, and why it is interesting to solve it (motivation!)

• It is often useful to present an example and refer to it several times• Then go into the details• If appropriate for your topic, do an analysis

• Don't forget to address the disadvantages of the approach as well as the advantages (advantages tend to be what the authors focus on)

• List references and recommend further reading• Have a conclusion slide!

Page 5: Information Extraction Referatsthemen

Information Extraction

SourceSelection

Tokenization&Normalization

Named EntityRecognition

InstanceExtraction

FactExtraction

OntologicalInformationExtraction

?05/01/67 1967-05-01

and beyond

...married Elvis on 1967-05-01

Elvis Presley SingerAngela Merkel Politician✓ ✓

5

Information Extraction (IE) is the process of extracting structured information from unstructured machine-readable documents

Slide from Suchanek

Page 6: Information Extraction Referatsthemen

6

History of IE

• TOPIC: IE at the Message Understanding Conferences (MUC)• These conferences focused on tasks like extracting merger events

from unstructured text• Discuss problems solved, motivations and techniques• Survey the literature

Page 7: Information Extraction Referatsthemen

7

Source Selection

• TOPIC: Focused web crawling• Why use focused web crawling?• How do focused web crawlers work?• What are the benefits and disadvantages of focused web crawling?• Python: scrapy• Perl: WWW::Mechanize

Page 8: Information Extraction Referatsthemen

8

• TOPIC: Language Identification• Compare the approaches used• Such as the compression, dictionary and character histogram

approaches• Discuss the issue of 8-bit encodings and dealing with code pages

Source Selection

Page 9: Information Extraction Referatsthemen

9

Source Selection

• TOPIC: Wrappers• Wrappers are used to extract tuples (database entries) from

structured web sites• Discuss the different ways to create wrappers• Advantages and disadvantages• How do wrappers deal with changing websites?

• Give some examples of different wrapper creation software packages and discuss their pros and cons

Page 10: Information Extraction Referatsthemen

10

Rule-based Named Entity Recognition: Regular Sets

• TOPIC: Regular Sets in the Proceedings of the European Parliament (Europarl corpus)• Annotate regular classes in the Europarl corpora in both English and

German• Interesting regular classes: report-IDs and dates. Others could include

nationalities/countries or monetary amounts.• Programming intensive: this will involve writing a lot of regular

expressions, and analysis

Page 11: Information Extraction Referatsthemen

11

Named Entity Recognition – Entity Classes

• TOPIC: fine-grained open classes of named entities• Survey the proposed schemes of fine-grained open classes, such as BBN's

classes used for question answering• Discuss the advantages and disadvantages of the schemes• Discuss also the difficulty of human annotation – can humans annotate

these classes reliably?

Page 12: Information Extraction Referatsthemen

12

Named Entity Recognition – Training Data

• TOPIC: Crowd-sourcing with Amazon Mechanical Turk (AMT)• AMT's motto: artificial artificial intelligence• Using human annotators to get quick (but low quality) annotations• What are the pros and cons of this approach? How well do NER systems

perform when trained on this data?

Page 13: Information Extraction Referatsthemen

13

Rule-Based Extraction

• TOPIC: Using Local Grammars for Citation Parsing• Discuss how to define regular expressions for the different fields in a

citation• Discuss how these regular expressions are combined in the local grammar

approach• Present the advantages and disadvantages of this approach to citation

parsing

Page 14: Information Extraction Referatsthemen

14

Learning Rules for Named Entity Recognition

• TOPIC: Learning Rules for Named Entity Recognition• Discuss how rules can be learned given annotated corpora• Present the basic algorithms• Discuss the advantages and disadvantages of these approaches

Page 15: Information Extraction Referatsthemen

15

Named Entity Recognition – Statistical Model

• TOPIC: Hidden Markov Models for Named Entity Recognition• Discuss how to formulate named entity recognition in the HMM

framework• Discuss the transition model and the emission model – which features are

useful?• One possible emphasis: what we want the models to be able to do, what

problems they have versus other statistical approaches

Page 16: Information Extraction Referatsthemen

16

Named Entity Recognition – Statistical Model

• TOPIC: Structured Perceptron for Named Entity Recognition• Discuss how to formulate named entity recognition in the structured

perceptron framework• This requires previous exposure to machine learning!

• What does the structured perceptron framework allow you to do versus the HMM?

• Presentation of important features

Page 17: Information Extraction Referatsthemen

17

Named Entity Recognition - Supervision

• TOPIC: Lightly Supervised Named Entity Recognition• Starting from a few examples ("seed examples"), how do you

automatically build a named entity classifier?• This is sometimes referred to as "bootstrapping"

• What the problems with this approach – how do you block the process from generalizing too much?

• Analyze the pros and cons of this approach

Page 18: Information Extraction Referatsthemen

18

Named Entity Recognition - Supervision

• TOPIC: Distant supervision for NER• Related to the bootstrapping idea – but here we are using

information annotated for a different purpose• How can distant supervision solve the knowledge bottleneck for

NER?• What are the advantages and disadvantages of this approach?

Page 19: Information Extraction Referatsthemen

19

NER – Toolkit

• TOPIC: Stanford NER Toolkit applied to Europarl• Apply the Stanford NER Toolkit to the Europarl corpus, and compare the

output on English and German• How does the model work?• What are the differences between the English and German annotations of

parallel sentences, where do the models fail?

Page 20: Information Extraction Referatsthemen

20

NER – Domain Adaptation

• TOPIC: Domain adaptation and failure to adapt• What is the problem of domain adaptation?• How is it addressed in statistical classification approaches to NER?• How well does it work?

Page 21: Information Extraction Referatsthemen

21

NER – Twitter

• TOPIC: Named Entity Recognition of Entities in Twitter• There has recently been a lot of interest in annotating Twitter• Which set of classes is annotated? What is used as supervised

training material, how is it adapted from non-Twitter training sets?• What are the peculiarities of working on 140 character tweets

rather than longer articles?

Page 22: Information Extraction Referatsthemen

22

NER – BIO Domain

• TOPIC: Named Entity Recognition of Biological Entities• Present a specific named entity recognition problem from the

biology domain• Which set of classes is annotated? What is used as supervised

training material?• What are the difficulties of this domain vs. problems like

extraction of company mergers which have been studied longer?

Page 23: Information Extraction Referatsthemen

23

Instance Extraction

• TOPIC: Applying the Stanford Coreference Pipeline to Europarl• Apply the Stanford Coreference Pipeline to English Europarl data• How does it work? • What entities does it annotate well, and less well?• Can this information be used to translate English "it" to German?

Page 24: Information Extraction Referatsthemen

24

IE for multilingual applications

• TOPIC: Transliteration Mining• Transliteration mining is the mining of names which are transliterated

from one language to another from a list of word pairs• Present the task of transliteration mining as done in the NEWS

conferences (here is the 2010 URL)• http://translit.i2r.a-star.edu.sg/news2010/

• What approaches work well for transliteration mining?• (Some basic statistical modelling background will be necessary here)

Page 25: Information Extraction Referatsthemen

25

IE for multilingual applications

• TOPIC: Bilingual Terminology Mining• The problem of bilingual terminology mining from comparable corpora is

the task of finding terms which are translations of each other given their context

• How is this done using the vector space model vs. the pattern-based approach?• Some familiarity with basic information retrieval will be helpful here

• What the critical sources of knowledge for this approach?

Page 26: Information Extraction Referatsthemen

26

Choosing a topic

• Any questions?

• I will put these slides on the seminar page later today

• Please email me with your choice of topic• I'd also be open to using the Wiki, but I don't see how to make that work• Check the seminar page first to see if it is already taken!