Information Extraction Referatsthemen

Information ExtractionReferatsthemen

CIS, LMU MünchenWinter Semester 2013-2014

Dr. Alexander Fraser, CIS

Information Extraction – Reminder• Vorlesung

• Learn the basics of Information Extraction (IE)• Klausur – only on the Vorlesung!

• Seminar• Deeper understanding of IE topics• Each student who wants a Schein will have to make a presentation on IE• 25 minutes (powerpoint, LaTeX, Mac)

• If two students work together, 40 minutes (each student speaks for 20 minutes)

• Hausarbeit• 6 pages worked out, due 3 weeks after the Referat• Separate for each student!

3

Topics

• Topic will be presented in roughly the same order as the related topics are discussed in the Vorlesung

• Most of the topics require you to do a literature search• There will usually be one article (or maybe two) which you find is

the key source• If appropriate, please turn in PDF files of the key article and a few

other important articles• There are a few projects involving programming

• These are particularly suitable to be done by two students

Referat• 25 minutes for one student• 40 minutes for two• Start with what the problem is, and why it is interesting to solve it (motivation!)

• It is often useful to present an example and refer to it several times• Then go into the details• If appropriate for your topic, do an analysis

• Don't forget to address the disadvantages of the approach as well as the advantages (advantages tend to be what the authors focus on)

• List references and recommend further reading• Have a conclusion slide!

Information Extraction

SourceSelection

Tokenization&Normalization

Named EntityRecognition

InstanceExtraction

FactExtraction

OntologicalInformationExtraction

?05/01/67 1967-05-01

and beyond

...married Elvis on 1967-05-01

Elvis Presley SingerAngela Merkel Politician✓ ✓

5

Information Extraction (IE) is the process of extracting structured information from unstructured machine-readable documents

Slide from Suchanek

6

History of IE

• TOPIC: IE at the Message Understanding Conferences (MUC)• These conferences focused on tasks like extracting merger events

from unstructured text• Discuss problems solved, motivations and techniques• Survey the literature

7

Source Selection

• TOPIC: Focused web crawling• Why use focused web crawling?• How do focused web crawlers work?• What are the benefits and disadvantages of focused web crawling?• Python: scrapy• Perl: WWW::Mechanize

8

• TOPIC: Language Identification• Compare the approaches used• Such as the compression, dictionary and character histogram

approaches• Discuss the issue of 8-bit encodings and dealing with code pages

Source Selection

9

Source Selection

• TOPIC: Wrappers• Wrappers are used to extract tuples (database entries) from

structured web sites• Discuss the different ways to create wrappers• Advantages and disadvantages• How do wrappers deal with changing websites?

• Give some examples of different wrapper creation software packages and discuss their pros and cons

10

Rule-based Named Entity Recognition: Regular Sets

• TOPIC: Regular Sets in the Proceedings of the European Parliament (Europarl corpus)• Annotate regular classes in the Europarl corpora in both English and

German• Interesting regular classes: report-IDs and dates. Others could include

nationalities/countries or monetary amounts.• Programming intensive: this will involve writing a lot of regular

expressions, and analysis

11

Named Entity Recognition – Entity Classes

• TOPIC: fine-grained open classes of named entities• Survey the proposed schemes of fine-grained open classes, such as BBN's

classes used for question answering• Discuss the advantages and disadvantages of the schemes• Discuss also the difficulty of human annotation – can humans annotate

these classes reliably?

12

Named Entity Recognition – Training Data

• TOPIC: Crowd-sourcing with Amazon Mechanical Turk (AMT)• AMT's motto: artificial artificial intelligence• Using human annotators to get quick (but low quality) annotations• What are the pros and cons of this approach? How well do NER systems

perform when trained on this data?

13

Rule-Based Extraction

• TOPIC: Using Local Grammars for Citation Parsing• Discuss how to define regular expressions for the different fields in a

citation• Discuss how these regular expressions are combined in the local grammar

approach• Present the advantages and disadvantages of this approach to citation

parsing

14

Learning Rules for Named Entity Recognition

• TOPIC: Learning Rules for Named Entity Recognition• Discuss how rules can be learned given annotated corpora• Present the basic algorithms• Discuss the advantages and disadvantages of these approaches

15

Named Entity Recognition – Statistical Model

• TOPIC: Hidden Markov Models for Named Entity Recognition• Discuss how to formulate named entity recognition in the HMM

framework• Discuss the transition model and the emission model – which features are

useful?• One possible emphasis: what we want the models to be able to do, what

problems they have versus other statistical approaches

16

Named Entity Recognition – Statistical Model

• TOPIC: Structured Perceptron for Named Entity Recognition• Discuss how to formulate named entity recognition in the structured

perceptron framework• This requires previous exposure to machine learning!

• What does the structured perceptron framework allow you to do versus the HMM?

• Presentation of important features

17

Named Entity Recognition - Supervision

• TOPIC: Lightly Supervised Named Entity Recognition• Starting from a few examples ("seed examples"), how do you

automatically build a named entity classifier?• This is sometimes referred to as "bootstrapping"

• What the problems with this approach – how do you block the process from generalizing too much?

• Analyze the pros and cons of this approach

18

Named Entity Recognition - Supervision

• TOPIC: Distant supervision for NER• Related to the bootstrapping idea – but here we are using

information annotated for a different purpose• How can distant supervision solve the knowledge bottleneck for

NER?• What are the advantages and disadvantages of this approach?

19

NER – Toolkit

• TOPIC: Stanford NER Toolkit applied to Europarl• Apply the Stanford NER Toolkit to the Europarl corpus, and compare the

output on English and German• How does the model work?• What are the differences between the English and German annotations of

parallel sentences, where do the models fail?

20

NER – Domain Adaptation

• TOPIC: Domain adaptation and failure to adapt• What is the problem of domain adaptation?• How is it addressed in statistical classification approaches to NER?• How well does it work?

21

NER – Twitter

• TOPIC: Named Entity Recognition of Entities in Twitter• There has recently been a lot of interest in annotating Twitter• Which set of classes is annotated? What is used as supervised

training material, how is it adapted from non-Twitter training sets?• What are the peculiarities of working on 140 character tweets

rather than longer articles?

22

NER – BIO Domain

• TOPIC: Named Entity Recognition of Biological Entities• Present a specific named entity recognition problem from the

biology domain• Which set of classes is annotated? What is used as supervised

training material?• What are the difficulties of this domain vs. problems like

extraction of company mergers which have been studied longer?

23

Instance Extraction

• TOPIC: Applying the Stanford Coreference Pipeline to Europarl• Apply the Stanford Coreference Pipeline to English Europarl data• How does it work? • What entities does it annotate well, and less well?• Can this information be used to translate English "it" to German?

24

IE for multilingual applications

• TOPIC: Transliteration Mining• Transliteration mining is the mining of names which are transliterated

from one language to another from a list of word pairs• Present the task of transliteration mining as done in the NEWS

conferences (here is the 2010 URL)• http://translit.i2r.a-star.edu.sg/news2010/

• What approaches work well for transliteration mining?• (Some basic statistical modelling background will be necessary here)

http://translit.i2r.a-star.edu.sg/news2010/

25

IE for multilingual applications

• TOPIC: Bilingual Terminology Mining• The problem of bilingual terminology mining from comparable corpora is

the task of finding terms which are translations of each other given their context

• How is this done using the vector space model vs. the pattern-based approach?• Some familiarity with basic information retrieval will be helpful here

• What the critical sources of knowledge for this approach?

26

Choosing a topic

• Any questions?

• I will put these slides on the seminar page later today

• Please email me with your choice of topic• I'd also be open to using the Wiki, but I don't see how to make that work• Check the seminar page first to see if it is already taken!

Information Extraction Referatsthemen

Documents

focused web crawlingwhy

structured information

structured web sitesdiscuss

web crawlers work

regular setstopic

students3referat25 minutes

student40 minutes

ie25 minutes powerpoint