Top Banner
Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of Computing
48

Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Dec 24, 2015

Download

Documents

Ernest Fields
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Information ExtractionLecture 9 – Multilingual Extraction

Dr. Alexander Fraser, U. Munich

September 11th, 2014ISSALE: University of Colombo School of

Computing

Page 2: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

2

Outline

• Up until today: basics of information extraction• Primarily based on named entities and

relation extraction

• However, there are some other tasks associated with information extraction• Two important tasks are terminology

extraction and bilingual dictionary extraction• I will talk very briefly about terminology

extraction (one slide) and then focus on bilingual dictionary extraction

Page 3: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

3

Terminology Extraction• Terminology extraction tries to find words or sequences of

words which have a domain-specific meaning• For instance "rotator blade" refers to a specialized concept in

helicopters or wind turbines

• To do terminology extraction, we need domain-specific corpora

• Terminology extraction is often broken down into two phases:1. First a very large list of types using a linguistic pattern (such

as noun phrase types) is made by extracting matching tokens from the domain-specific corpus

2. Then statistical tests are used to determine if the presence of this term in the domain-specific corpus implies that it is domain-specific terminology

• The challenge here is to separate terminology from general language• A "blue helicopter" is not a technical term, it is a helicopter which is blue• "rotator blade" is a technical term

• Stefan Evert may cover this to a certain extent

Page 4: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

4

Bilingual Dictionaries

• Extracting bilingual information• Easiest to extract if we have a parallel

corpus• This consists of text in one language and the

translation of the text in another language

• Given such a resource, we can extract bilingual dictionaries

• Mostly used for machine translation, cross-lingual retrieval and other natural language processing applications

• But also useful for human lexicographers and linguists

Page 5: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Alex FraserIMS Stuttgart

5

Parallel corpus• Example from DE-News (8/1/1996)

English GermanDiverging opinions about planned tax reform

Unterschiedliche Meinungen zur geplanten Steuerreform

The discussion around the envisaged major tax reform continues .

Die Diskussion um die vorgesehene grosse Steuerreform dauert an .

The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 .

Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen .

Modified from Dorr, Monz

Page 6: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

6

Availability of parallel corpora

• European Documents• Languages of the EU• For two European languages (e.g., English and

German), European documents such as the proceedings of the European parliament are often used

• United Nations Documents• Official UN languages: Arabic, Chinese, English,

French, Russian, Spanish• For any two languages out of the 6 United Nations

languages we can obtain large amounts of parallel UN documents

• For other language pairs (e.g., German and Russian), it can be problematic to get parallel data

Page 7: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Alex FraserIMS Stuttgart

AMTA 2006 Overview of Statistical MT

7

u

Most statistical machine translation research has focused on a few high-resource languages

(European, Chinese, Japanese, Arabic).

French ArabicChinese

(~200M words)

Uzbek

ApproximateParallel Text Available

(with English)

GermanSpanish

Finnish{ Various Western European

languages: parliamentary proceedings, govt documents(~30M words)

Serbian KasemChechen

{… …

{Bible/Koran/Book of Mormon/

Dianetics(~1M words)

Nothing/Univ. Decl.Of Human Rights

(~1K words)

Modified from Schafer&Smith

Tamil Pwo

Page 8: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

8

Document alignment

• In the collections we have mentioned, the document alignment is given• We know which documents contain the

proceedings of the UN General Assembly from Monday June 1st at 9am in all 6 languages

• It is also possible to find parallel web documents using cross-lingual information retrieval techniques

• Once we have the document alignment, we first need to "sentence align" the parallel documents

Page 9: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Alex FraserIMS Stuttgart

9

Sentence alignment• If document De is translation of document Df how do

we find the translation for each sentence?• The n-th sentence in De is not necessarily the

translation of the n-th sentence in document Df

• In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n, and n:1 alignments

• In European Parliament proceedings, approximately 90% of the sentence alignments are 1:1

Modified from Dorr, Monz

Page 10: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Alex FraserIMS Stuttgart

10

Sentence alignment• There are several sentence alignment algorithms:

– Align (Gale & Church): Aligns sentences based on their character length (shorter sentences tend to have shorter translations then longer sentences). Works well

– Char-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technical domains

– K-Vec (Fung & Church): Induces a translation lexicon from the parallel texts based on the distribution of foreign-English word pairs

– Cognates (Melamed): Use positions of cognates (including punctuation)

– Length + Lexicon (Moore; Braune and Fraser): Two passes, high accuracy, freely available

Modified from Dorr, Monz

Page 11: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Alex FraserIMS Stuttgart

11

Word alignments• Given a parallel sentence pair we can link (align)

words or phrases that are translations of each other:

Modified from Dorr, Monz

Page 12: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

• Word alignment is annotation of minimal translational correspondences

• Annotated in the context in which they occur

• Not idealized translations!

(solid blue lines annotated by a bilingual expert)

Page 13: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

•Automatic word alignments are typically generated using a model called IBM Model 4

•No linguistic knowledge

•No correct alignments are supplied to the system

• Unsupervised learning

(red dashed line = automatically generated hypothesis)

Page 14: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

14

Uses of Word Alignment

• Multilingual– Machine Translation– Cross-Lingual Information Retrieval– Translingual Coding (Annotation Projection)– Document/Sentence Alignment– Extraction of Parallel Sentences from Comparable Corpora

• Monolingual– Paraphrasing– Query Expansion for Monolingual Information Retrieval– Summarization– Grammar Induction

Page 15: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Extracting Word-to-Word Dictionaries

• Given a word aligned corpus, we can extract word-to-word dictionaries

• We do this by looking at all links to "das". • If there are 1000 links to "das", and 700 of them

are from "the", then we get a score of 70%Example from Koehn 2008

Page 16: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

• Word-to-word dictionaries are useful– For example, they are used to translate queries in

cross-lingual retrieval• Given the query "das Haus", the two query words are

translated independently (we use all translations and the scores)

• However, they are too simple to capture larger units of meaning, they link exactly one token to one token

Page 17: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

"Phrase" dictionaries

• Consider the links of two words that are next to each other in the source language

• The links to these two words are often next to each other in the target language too

• If this is true, we can extract a larger unit, relating two words in the source language to two words in the target language

• We call these "phrases"– WARNING: we may extract linguistic phrases, but much of

what we extract is not a linguistic phrase!

Page 18: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 19: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 20: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 21: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 22: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 23: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 24: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 25: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 26: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 27: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Using phrase dictionaries

• The dictionaries we extract like this are the key technology behind statistical machine translation systems

• Google Translate, for instance, uses phrase dictionaries for many language pairs

• There are further generalizations of this idea– We can introduce gaps in the phrases

• Like: "hat GAP gemacht | did GAP"• The gaps are processed recursively

– We can labels the rules (and gaps) with syntactic constituents to try to control what goes inside the gap

• Like: S/S -> "NP hat es gesehen | NP saw it"

Page 28: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 29: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 30: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Decoding

• Goal: find the best target translation of a source sentence• Involves search

– Find maximum probability path in a dynamically generated search graph

• Generate English string, from left to right, by covering parts of Foreign string– Generating English string left to right allows scoring with the n-gram

language model

• Here is an example of one path

Page 31: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 32: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 33: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 34: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 35: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 36: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 37: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 38: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Slide from Koehn 2008

Page 39: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Alex FraserIMS Stuttgart

• Slides will be on the course web page

• Other resources: Philipp Koehn’s book ->

• Kevin Knight’s tutorial on word alignment is long, but it is good!

Page 40: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Morphologically Rich Languages

• Statistical Machine Translation is often studied using English as the target language– English is not morphologically rich– Original work used French as source language– French only slightly richer than English (gender, verbs)

• When working with morphologically rich languages, must deal with morphology (and syntax too)!

• This is a major research focus of my group– Main talk tomorrow, a few basic issues today

Page 41: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Morphology

• We will use the term morphology loosely here– We will discus two main phenomena: Inflection,

Compounding– There is less work in SMT on modeling of these

phenomena than there is on syntactic modeling• A lot of work on morphological reduction (e.g., make it

like English if the target language is English)• Not much work on generating (necessary to translate

to, for instance, Slavic languages or Finnish)

Page 42: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Inflection

Goldwater and McClosky 2005

Page 43: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Inflection

• Inflection– The best ideas here are to strip redundant

morphology • For instance case markings that are not used in target

language

– Can also add pseudo-words• One interesting paper looks at translating Czech to

English (Goldwater and McClosky)• Inflection which should be translated to a pronoun is

simply replaced by a pseudo-word to match the pronoun in preprocessing

Page 44: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Compounds

– Find the best split by using word frequencies of components (Koehn 2003)

– Aktionsplan -> Akt Ion Plan or Aktion Plan?• Since Ion (English: ion) is not frequent, do not pick such a splitting!

– Work until 2010• Heuristic non-linguistic approaches based only on corpus statistics

better than using hand-crafted morphological knowledge

– In 2010 we have shown using SMOR (Stuttgart Morphological Analyzer) together with corpus statistics is better (Fritzinger and Fraser WMT 2010)

Page 45: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Syntax

• I'll talk a little about syntax tomorrow• There are interesting models here, most

require a constituency parse– Interestingly, some approaches parse the source

language and some parse the target language• Also some models that don't use parses (such

as Hiero, "hierarchical phrase model")

Page 46: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

Extracting Multilingual Information

• Word-aligned parallel corpora are one valuable source of bilingual information

• Other interesting multilingual extraction tasks include:– Translating words such as names between scripts

("transliteration")– Extracting the translations of technical terminology from

comparable corpora– Extracting parallel sentences (or smaller units) from

comparable corpora– Projecting linguistic annotation (such as syntactic treebank

annotation) from one language to another

Page 47: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

47

• Slide sources• The slides today are mostly from Philipp

Koehn's course Statistical Machine Translation and from me (but see also attributions on individual slides)

Page 48: Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.

48

• Thank you for your attention!