Top Banner
Distributionelle Semantik Vorlesung “Computerlinguistische Techniken” Alexander Koller 12. Januar 2016
41

16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Aug 28, 2019

Download

Documents

vantu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Distributionelle Semantik

Vorlesung “Computerlinguistische Techniken”

Alexander Koller

12. Januar 2016

Page 2: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Welt- und Wortwissen

• Semantische Inferenzen brauchen formalisiertes Wissen über die Welt und über Wortbedeutungen.

Which genetically caused connective tissue disorder has severe symptoms and complications regarding the aorta and skeletal features, and, very characteristically, ophthalmologic subluxation?

Marfan's is created by a defect of the gene that determines the structure of Fibrillin-11. One of the symptoms is displacement of one or both of the eyes' lenses. The most serious complications affect the cardiovascular system, especially heart valves and the aorta.

Page 3: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Der Wissens-Bottleneck

• Bedeutung von formalisiertem Wissen für CL-Anwendungen seit Jahrzehnten akzeptiert. ‣ z.B. Bar-Hillel 1960: Übersetzung von “the box is in the pen”?

• Breite Formalisierung impraktikabel. ‣ immerhin z.B. Cyc: mehrere Millionen Fakten

‣ Weltwissen sehr umfangreich

‣ Prädikatenlogik geeigneter Formalismus?

• Aktuelle Perspektive: lexikalisches Wissen von Hand oder automatisch formalisieren.

Page 4: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Query Expansion

hiernach gesucht

das gefunden

Page 5: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Lexikalische Semantik

He's not pining! He's passed on! This parrot is no more! He has ceased to be! He's expired and gone to meet his maker! He's a stiff ! Bereft of life, he rests in peace! His metabolic processes are now history! He's off the twig! He's kicked the bucket, he's shuffled off his mortal coil, run down the curtain and joined the bleedin' choir invisible!! THIS IS AN EX-PARROT!!

Relationen zwischen Bedeutung von Wörtern: z.B. Synonymie

Page 6: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Semantische Relationen

• Lexikalische Semantik beschreibt mögliche semantische Relationen zwischen Wörtern: ‣ Synonymie: Wörter bedeuten das gleiche.

Apfelsine/Orange; Bildschirm/Monitor; etc.

‣ Hyponymie: Ein Wort ist Oberbegriff des anderen. Auto/Fahrzeug; Blume/Pflanze; etc.

‣ Antonymie: Wörter beschreiben das Gegenteil. gewinnen/verlieren; heiß/kalt; etc.

Page 7: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

WordNetentity

physical object

artifact

structure

building complex

plant#1,works,

industrial plant

living thing

organism

plant#2,flora,

plant life

= Hyponymiegleicher Knoten = Synonymiehttp://wordnet.princeton.edu/

Page 8: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Lexikalische Ambiguitäten

• Polysemie: Wort hat zwei verschiedene Bedeutungen, die miteinander verwandt sind. ‣ Schule #1: Institution, in der Schüler lernen

‣ Schule #2: Gebäude, in dem Schule #1 arbeitet

• Homonymie: Wort hat zwei verschiedene Bedeutungen, die nicht verwandt sind. ‣ Bank #1: Geldinstitut

‣ Bank #2: Sitzgelegenheit

Page 9: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Word sense disambiguation

• Word sense disambiguation ist das Problem, jedes Wort-Token mit seinem Wortsinn zu taggen.

• Accuracy von WSD hängt vom Bedeutungs-Inventar ab. Stand der Kunst: 90% auf grobkörnigen Senses.

• Typische Techniken machen überwachtes Training auf kleineren Datenmengen und erweitern Modell mit unüberwachten Methoden.

Page 10: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Problem

• Handgeschriebene Thesauri sind viel zu klein. ‣ Englisches Wordnet: 117.000 Synsets

‣ GermaNet: 85.000 Synsets

• Anzahl von englischen Wörtern im englischen Google-n-Gramm-Korpus > 1 Million.

• Damit lösen wir das Query-Expansion-Problem nicht.

• Semantische Relationen automatisch lernen?

Page 11: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 1

(nach Folien von Katrin Erk)

Page 12: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 1

Doc 1: Guantanamo USA Verschlüsselung Yahoo Enthüllungen rechtsstaatliche

(nach Folien von Katrin Erk)

Page 13: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 1

Doc 1: Guantanamo USA Verschlüsselung Yahoo Enthüllungen rechtsstaatliche

(nach Folien von Katrin Erk)

spiegel.de zu PRISM

Page 14: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 1

Doc 1: Guantanamo USA Verschlüsselung Yahoo Enthüllungen rechtsstaatliche

Doc 2: Raumschiff Macht Imperator Todesstern Vater

(nach Folien von Katrin Erk)

spiegel.de zu PRISM

Page 15: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 1

Doc 1: Guantanamo USA Verschlüsselung Yahoo Enthüllungen rechtsstaatliche

Doc 2: Raumschiff Macht Imperator Todesstern Vater

(nach Folien von Katrin Erk)

Wikipedia zu “Star Wars”spiegel.de zu PRISM

Page 16: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 1

Doc 1: Guantanamo USA Verschlüsselung Yahoo Enthüllungen rechtsstaatliche

Doc 2: Raumschiff Macht Imperator Todesstern Vater

Doc 3: kontext-freie Algorithmus dynamische Tabelle Chomsky-Normalform

(nach Folien von Katrin Erk)

Wikipedia zu “Star Wars”spiegel.de zu PRISM

Page 17: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 1

Doc 1: Guantanamo USA Verschlüsselung Yahoo Enthüllungen rechtsstaatliche

Doc 2: Raumschiff Macht Imperator Todesstern Vater

Doc 3: kontext-freie Algorithmus dynamische Tabelle Chomsky-Normalform

(nach Folien von Katrin Erk)

Wikipedia zu “Star Wars”

Wikipedia zum CKY-Parser

spiegel.de zu PRISM

Page 18: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 1

Doc 1: Guantanamo USA Verschlüsselung Yahoo Enthüllungen rechtsstaatliche

Doc 2: Raumschiff Macht Imperator Todesstern Vater

Doc 3: kontext-freie Algorithmus dynamische Tabelle Chomsky-Normalform

Doc 4: Erntebemühungen Anbaufläche Sie Gurken Pflänzchen Zentimeter

(nach Folien von Katrin Erk)

Wikipedia zu “Star Wars”

Wikipedia zum CKY-Parser

spiegel.de zu PRISM

Page 19: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 1

Doc 1: Guantanamo USA Verschlüsselung Yahoo Enthüllungen rechtsstaatliche

Doc 2: Raumschiff Macht Imperator Todesstern Vater

Doc 3: kontext-freie Algorithmus dynamische Tabelle Chomsky-Normalform

Doc 4: Erntebemühungen Anbaufläche Sie Gurken Pflänzchen Zentimeter

(nach Folien von Katrin Erk)

Wikipedia zu “Star Wars”

Wikipedia zum CKY-Parser

www.gartenbau.org

spiegel.de zu PRISM

Page 20: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 2

(Stefan Evert, Tutorial bei NAACL 2010)

Page 21: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 2

• Was ist “bardiwac”? Im Korpus finden Sie:

(Stefan Evert, Tutorial bei NAACL 2010)

Page 22: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 2

• Was ist “bardiwac”? Im Korpus finden Sie:‣ He handed her a glass of bardiwac.

(Stefan Evert, Tutorial bei NAACL 2010)

Page 23: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 2

• Was ist “bardiwac”? Im Korpus finden Sie:‣ He handed her a glass of bardiwac.

‣ Nigel staggered to his feet, face flushed from too much bardiwac.

(Stefan Evert, Tutorial bei NAACL 2010)

Page 24: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 2

• Was ist “bardiwac”? Im Korpus finden Sie:‣ He handed her a glass of bardiwac.

‣ Nigel staggered to his feet, face flushed from too much bardiwac.

‣ Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine.

(Stefan Evert, Tutorial bei NAACL 2010)

Page 25: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 2

• Was ist “bardiwac”? Im Korpus finden Sie:‣ He handed her a glass of bardiwac.

‣ Nigel staggered to his feet, face flushed from too much bardiwac.

‣ Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine.

‣ The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish.

(Stefan Evert, Tutorial bei NAACL 2010)

Page 26: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Experiment 2

• Was ist “bardiwac”? Im Korpus finden Sie:‣ He handed her a glass of bardiwac.

‣ Nigel staggered to his feet, face flushed from too much bardiwac.

‣ Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine.

‣ The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish.

(Stefan Evert, Tutorial bei NAACL 2010)

→ Bardiwac ist ein Rotwein.

Page 27: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Distributionelle Semantik

• Ansatz, um semantische Ähnlichkeit von Wörtern aus unannotierten Daten zu lernen. ‣ Ähnlichkeit als Approximation von Synonymie

‣ Lexikon kann automatisch beliebig groß werden

• Bedeutung eines Worts ≈ Verteilung der anderen Wörter, die mit ihm zusammen auftreten.

• Grundidee aus den 1950ern (Harris 1951):“You shall know a word by the company it keeps.” (Zitat ist von Firth)

Page 28: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Kookkurrenz

• Was bedeutet es, dass zwei Wörtern “zusammen auftreten”?

• Einfachster Ansatz: Zähle im Korpus ab, wie oft Wort w1 in k-Wort-Fenster um Wort w2 auftritt.

see who can grow the biggest flower. Can we buy some fibre, pleaseAbu Dhabi grow like a hot-house flower, but decided themselves to follow the

as a physical level. The Bach Flower Remedies are prepared from non-poisonous wilda seed from which a strong tree will grow. This is the finest

(k = 6, British National Corpus)

Page 29: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Kookkurrenz

factory

flow

er

tree

plant

water

fork

grow 15 147 330 517 106 3garden 5 200 198 316 118 17worker 279 0 5 84 18 0production 102 6 9 130 28 0wild 3 216 35 96 30 0

Figure 108.4: Some co-occurrence vectors from the British National Corpus.

factory

flower

tree

plant

Figure 108.5: Graphical illustration of co-occurrence vectors.

through counts of context words occurring in the neighborhood of targetword instances. Take, as in the WSD example above, the n (e.g., 2000)most frequent content words in a corpus as the set of relevant context words;then count, for each word w, how often each of these context words occurredin a context window of n before or after each occurrence of w. Fig. 108.4shows the co-occurrence counts for a number of target words (columns),and a selection of context words (rows) obtained from a 10% portion of theBritish National Corpus (Clear 1993).

The resulting frequency pattern encodes information about the meaningof w. According to the Distributional Hypothesis, we can model the semanticsimilarity between two words by computing the similarity between their co-occurrences with the context words. In the example of Fig. 108.4, the targetflower co-occurs frequently with the context words grow and garden, andinfrequently with production and worker. The target word tree has a similardistribution, but the target factory shows the opposite co-occurrence patternwith these four context words. This is evidence that trees and flowers aremore similar to each other than to factories.

Technically, we represent each word w as a vector in a high-dimensional

23

Kookkurrenz-Matrix für BNC, aus Koller & Pinkal 12

see who can grow the biggest flower. Can we buy some fibre, pleaseAbu Dhabi grow like a hot-house flower, but decided themselves to follow the

as a physical level. The Bach Flower Remedies are prepared from non-poisonous wilda seed from which a strong tree will grow. This is the finest

(Kon

text

-Wör

ter)

Page 30: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Vektorraum-Modell

factory

flow

er

tree

plant

water

fork

grow 15 147 330 517 106 3garden 5 200 198 316 118 17worker 279 0 5 84 18 0production 102 6 9 130 28 0wild 3 216 35 96 30 0

Figure 108.4: Some co-occurrence vectors from the British National Corpus.

factory

flower

tree

plant

Figure 108.5: Graphical illustration of co-occurrence vectors.

through counts of context words occurring in the neighborhood of targetword instances. Take, as in the WSD example above, the n (e.g., 2000)most frequent content words in a corpus as the set of relevant context words;then count, for each word w, how often each of these context words occurredin a context window of n before or after each occurrence of w. Fig. 108.4shows the co-occurrence counts for a number of target words (columns),and a selection of context words (rows) obtained from a 10% portion of theBritish National Corpus (Clear 1993).

The resulting frequency pattern encodes information about the meaningof w. According to the Distributional Hypothesis, we can model the semanticsimilarity between two words by computing the similarity between their co-occurrences with the context words. In the example of Fig. 108.4, the targetflower co-occurs frequently with the context words grow and garden, andinfrequently with production and worker. The target word tree has a similardistribution, but the target factory shows the opposite co-occurrence patternwith these four context words. This is evidence that trees and flowers aremore similar to each other than to factories.

Technically, we represent each word w as a vector in a high-dimensional

23

factory

flow

er

tree

plant

water

fork

grow 15 147 330 517 106 3garden 5 200 198 316 118 17worker 279 0 5 84 18 0production 102 6 9 130 28 0wild 3 216 35 96 30 0

Figure 108.4: Some co-occurrence vectors from the British National Corpus.

factory

flower

tree

plant

Figure 108.5: Graphical illustration of co-occurrence vectors.

through counts of context words occurring in the neighborhood of targetword instances. Take, as in the WSD example above, the n (e.g., 2000)most frequent content words in a corpus as the set of relevant context words;then count, for each word w, how often each of these context words occurredin a context window of n before or after each occurrence of w. Fig. 108.4shows the co-occurrence counts for a number of target words (columns),and a selection of context words (rows) obtained from a 10% portion of theBritish National Corpus (Clear 1993).

The resulting frequency pattern encodes information about the meaningof w. According to the Distributional Hypothesis, we can model the semanticsimilarity between two words by computing the similarity between their co-occurrences with the context words. In the example of Fig. 108.4, the targetflower co-occurs frequently with the context words grow and garden, andinfrequently with production and worker. The target word tree has a similardistribution, but the target factory shows the opposite co-occurrence patternwith these four context words. This is evidence that trees and flowers aremore similar to each other than to factories.

Technically, we represent each word w as a vector in a high-dimensional

23

Vektoren inhochdimensionalem

Vektorraum

1 Dimension pro Kontextwort (hier: 6 Dimensionen)

Bild vereinfacht zu 2 Dimensionen,ist nur schematisch.

Page 31: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Ähnlichkeit

• Aus Vektorraum-Modell kann man jetzt Ähnlichkeit zwischen Wörtern ableiten.

• 1. Versuch:ähnlich = euklidische Distanz ist klein

factory

flow

er

tree

plant

water

fork

grow 15 147 330 517 106 3garden 5 200 198 316 118 17worker 279 0 5 84 18 0production 102 6 9 130 28 0wild 3 216 35 96 30 0

Figure 108.4: Some co-occurrence vectors from the British National Corpus.

factory

flower

tree

plant

Figure 108.5: Graphical illustration of co-occurrence vectors.

through counts of context words occurring in the neighborhood of targetword instances. Take, as in the WSD example above, the n (e.g., 2000)most frequent content words in a corpus as the set of relevant context words;then count, for each word w, how often each of these context words occurredin a context window of n before or after each occurrence of w. Fig. 108.4shows the co-occurrence counts for a number of target words (columns),and a selection of context words (rows) obtained from a 10% portion of theBritish National Corpus (Clear 1993).

The resulting frequency pattern encodes information about the meaningof w. According to the Distributional Hypothesis, we can model the semanticsimilarity between two words by computing the similarity between their co-occurrences with the context words. In the example of Fig. 108.4, the targetflower co-occurs frequently with the context words grow and garden, andinfrequently with production and worker. The target word tree has a similardistribution, but the target factory shows the opposite co-occurrence patternwith these four context words. This is evidence that trees and flowers aremore similar to each other than to factories.

Technically, we represent each word w as a vector in a high-dimensional

23

dist(~v, ~w) =

vuutnX

i=1

(vi � wi)2

nicht besonderssinnvoll

Page 32: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Kosinus-Ähnlichkeit

• 2. Versuch: ähnlich = Winkel ist klein. ‣ ignoriert Länge von Vektoren = absolute Worthäufigkeiten

(das ist gut)

‣ Kontextwörter kommen proportional ähnlich oft vor

• Leicht zu berechnen ist Kosinus des Winkels: ‣ cos = 1 heißt Winkel = 0°, d.h. sehr ähnlich

‣ cos = 0 heißt Winkel = 90°, d.h. sehr unähnlich

factory

flow

er

tree

plant

water

fork

grow 15 147 330 517 106 3garden 5 200 198 316 118 17worker 279 0 5 84 18 0production 102 6 9 130 28 0wild 3 216 35 96 30 0

Figure 108.4: Some co-occurrence vectors from the British National Corpus.

factory

flower

tree

plant

Figure 108.5: Graphical illustration of co-occurrence vectors.

through counts of context words occurring in the neighborhood of targetword instances. Take, as in the WSD example above, the n (e.g., 2000)most frequent content words in a corpus as the set of relevant context words;then count, for each word w, how often each of these context words occurredin a context window of n before or after each occurrence of w. Fig. 108.4shows the co-occurrence counts for a number of target words (columns),and a selection of context words (rows) obtained from a 10% portion of theBritish National Corpus (Clear 1993).

The resulting frequency pattern encodes information about the meaningof w. According to the Distributional Hypothesis, we can model the semanticsimilarity between two words by computing the similarity between their co-occurrences with the context words. In the example of Fig. 108.4, the targetflower co-occurs frequently with the context words grow and garden, andinfrequently with production and worker. The target word tree has a similardistribution, but the target factory shows the opposite co-occurrence patternwith these four context words. This is evidence that trees and flowers aremore similar to each other than to factories.

Technically, we represent each word w as a vector in a high-dimensional

23

cos(~v, ~w) =

Pni=1 vi · wipPn

i=1 v2i ·

pPni=1 w

2i

cos(tree, flower) = 0.75, i.e. 40° cos(tree, factory) = 0.05, i.e. 85°

Page 33: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Was haben wir erreicht?

• Maß für semantische Ähnlichkeit ‣ Kookkurrenzmatrix für alle Wortpaare aus unannotertem

Text berechnen.

‣ Auf dieser Grundlage Ähnlichkeitsmaß, z.B. Kosinus.

‣ Für beliebig große Textmengen leicht zu berechnen.

• Mögliche Erweiterungen: ‣ Komplexere Features und Feature-Gewichte

‣ Dimensionsreduktion

‣ Kompositionalität

Page 34: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Uninformative Dimensionen

• Nicht alle Kontextwörter gleich informativ. ‣ Kookkurrenz mit “grow” vs. mit “the”

• Einfachster Ansatz: Bestimmte häufige Wörter von Hand angeben und bei der Berechnung von Ähnlichkeit ignorieren. ‣ Solche Wörter heißen im Information Retrieval

“Stop-Wörter”.

• Allgemeiner: Gewichtung von Dimensionen automatisch lernen.

Page 35: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Komplexere Features

• Kookkurrenz von Wörtern überschätzt “gemeinsames Auftreten”.

• Lösungsansatz: Komplexere Features, die syntaktische Relationen zwischen Wörtern mit erfassen (Lin 98). ‣ zähle nicht mehr: tritt “flower” in Fenster von

7 Wörtern um “Abu Dhabi” auf?

‣ sondern: tritt “flower” als Subjekt von“grow” auf?

the Qataris had watched Abu Dhabi grow like a hot-house flower, but decided

Introduction The distributional hypothesis

Geometric interpretation

I row vector xdogdescribes usage ofword dog in thecorpus

I can be seen ascoordinates of pointin n-dimensionalEuclidean space

I illustrated for twodimensions:get and use

Ixdog = (115, 10) ●

0 20 40 60 80 100 120

020

4060

80100

120

Two dimensions of English V−Obj DSM

getuse

catdog

knife

boat

© Evert/Baroni/Lenci (CC-by-sa) DSM Tutorial wordspace.collocations.de 11 / 107

(get, obj)(u

se, o

bj)

Page 36: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Ergebnis

Introduction The distributional hypothesis

Semantic distances

I main result of distributionalanalysis are “semantic”distances between words

I typical applicationsI nearest neighboursI clustering of related wordsI construct semantic map

pota

toon

ion

cat

bana

nach

icke

nm

ushr

oom

corn

dog

pear

cher

ryle

ttuce

peng

uin

swan

eagl

eow

ldu

ckel

epha

nt pig

cow

lion

helic

opte

rpe

acoc

ktu

rtle car

pine

appl

ebo

atro

cket

truck

mot

orcy

cle

snai

lsh

ipch

isel

scis

sors

scre

wdr

iver

penc

ilha

mm

erte

leph

one

knife

spoo

npe

nke

ttle

bottl

ecu

pbo

wl0.

00.

20.

40.

60.

81.

01.

2

Word space clustering of concrete nouns (V−Obj from BNC)

Clu

ster

siz

e

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●

●●

●●

●●

●●

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8

−0.4

−0.2

0.0

0.2

0.4

0.6

Semantic map (V−Obj from BNC)

birdgroundAnimalfruitTreegreentoolvehicle

chicken

eagle duck

swanowl

penguinpeacock

dog

elephantcow

cat

lionpig

snail

turtle

cherry

banana

pearpineapple

mushroom

corn

lettuce

potatoonion

bottle

pencil

pen

cup

bowl

scissors

kettle

knife

screwdriver

hammer

spoon

chisel

telephoneboat carship

truck

rocket

motorcycle

helicopter

© Evert/Baroni/Lenci (CC-by-sa) DSM Tutorial wordspace.collocations.de 15 / 107

(Evert, NAACL Tutorial 2010)

Page 37: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Ergebnisse

(Ergebnisse von Lin 98, aus J&M)

hope (N): optimism 0.141, chance 0.137, expectation 0.136, prospect 0.126, dream 0.119, desire 0.118, fear 0.116, effort 0.111, confidence 0.109, promise 0.108

hope(V): would like 0.158, wish 0.140, plan 0.139, say 0.137, believe 0.135, think 0.133, agree 0.130, wonder 0.130, try 0.127, decide 0.125

brief (N): legal brief 0.139, affidavit 0.103, filing 0.098, petition 0.086, document 0.083, argument 0.083, letter 0.079, rebuttal 0.078, memo 0.077, article 0.076

brief (A): lengthy 0.256, hour-long 0.191, short 0.173, extended 0.163, frequent 0.162, recent 0.158, short-lived 0.155, prolonged 0.149, week-long 0.149, occasional 0.146

Page 38: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Probleme

• Ähnlichkeit = Synonymie? ‣ Synonyme sind distributionell sehr ähnlich.

‣ Aber auch Antonyme und (in geringerem Maß)Hyponyme distributionell sehr ähnlich.

• Distributionelle Ähnlichkeit ist nicht referentielle Ähnlichkeit. Erkennung von Antonymen notorisch schweres Problem.

brief (A): lengthy 0.256, hour-long 0.191, short 0.173, extended 0.163, frequent 0.162, recent 0.158, short-lived 0.155, prolonged 0.149, week-long 0.149, occasional 0.146

Page 39: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Kompositionelle distrib Semantik

• Aktueller Trend: kompositionelle Berechnung von größeren Phrasen aus distributionellen Repräsentationen von Wörtern.

• Z.B. Mitchell & Lapata 08: berechne Kookk-Vektor für Phrase durch Addition der Wortvektoren.

• Erscheint linguistisch zweifelhaft, korreliert aber mit menschlichen Bewertungen von Ähnlichkeit.

Page 40: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Kompositionelle distrib Semantik

• Baroni & Zamparelli (2010): “Nouns are vectors, adjectives are matrices” (= Funktionen). ‣ lernt Matrizen für Adjektive, so dass A* N den Kookk-

Vektor von “A N” approximiert (für alle N)

• Cf. Anwendung von Adjektiven auf Nomen in Montague-Grammatik.

related to the definition of the adjective (mental ac-tivity, historical event, green colour, quick and littlecost for easy N), and so on.

American N black N easy NAm. representative black face easy startAm. territory black hand quickAm. source black (n) little costgreen N historical N mental Ngreen (n) historical mental activityred road hist. event mental experiencegreen colour hist. content mental energynecessary N nice N young Nnecessary nice youthfulnecessary degree good bit young doctorsufficient nice break young staff

Table 1: Nearest 3 neighbors of centroids of ANs thatshare the same adjective.

How about the neighbors of specific ANs? Ta-ble 2 reports the nearest 3 neighbors of 9 randomlyselected ANs involving different adjectives (we in-spected a larger random set, coming to similar con-clusions to the ones emerging from this table).

bad electronic historicalluck communication mapbad elec. storage topographicalbad weekend elec. transmission atlasgood spirit purpose hist. materialimportant route nice girl little warimportant transport good girl great warimportant road big girl major warmajor road guy small warred cover special collection young husbandblack cover general collection small sonhardback small collection small daughterred label archives mistress

Table 2: Nearest 3 neighbors of specific ANs.

The nearest neighbors of the corpus-based ANvectors in Table 2 make in general intuitive sense.Importantly, the neighbors pick up the compositemeaning rather than that of the adjective or nounalone. For example, cover is an ambiguous word,but the hardback neighbor relates to its “front of abook” meaning that is the most natural one in com-bination with red. Similarly, it makes more sensethat a young husband (rather than an old one) wouldhave small sons and daughters (not to mention the

mistress!).We realize that the evidence presented here is

of a very preliminary and intuitive nature. Indeed,we will argue in the next section that there arecases in which the corpus-derived AN vector mightnot be a good approximation to our semantic in-tuitions about the AN, and a model-composed ANvector is a better semantic surrogate. One of themost important avenues for further work will be tocome to a better characterization of the behaviour ofcorpus-observed ANs, where they work and wherethe don’t. Still, the neighbors of average and AN-specific vectors of Tables 1 and 2 suggest that, forthe bulk of ANs, such corpus-based co-occurrencevectors are semantically reasonable.

6 Study 2: Predicting AN vectors

Having tentatively established that the sort of vec-tors we can harvest for ANs by directly collectingtheir corpus co-occurrences are reasonable represen-tations of their composite meaning, we move on tothe core question of whether it is possible to recon-struct the vector for an unobserved AN from infor-mation about its components. We use nearness tothe corpus-observed vectors of held-out ANs as avery direct way to evaluate the quality of model-generated ANs, since we just saw that the observedANs look reasonable (but see the caveats at the endof this section). We leave it to further work to as-sess the quality of the generated ANs in an appliedsetting, for example adapting Mitchell and Lapata’sparaphrasing task to ANs. Since the observed vec-tors look like plausible representations of compos-ite meaning, we expect that the closer the model-generated vectors are to the observed ones, the betterthey should also perform in any task that requires ac-cess to the composite meaning, and thus that the re-sults of the current evaluation should correlate withapplied performance.

More in detail, we evaluate here the compositionmethods (and the adjective and noun baselines) bycomputing, for each of them, the cosine of the testset AN vectors they generate (the “predicted” ANs)with the 41K vectors representing our extended vo-cabulary in semantic space, and looking at the posi-tion of the corresponding observed ANs (that werenot used for training, in the supervised approaches)

Page 41: 16 Distributionelle Semantik - ling.uni-potsdam.de Distributionelle Semantik.pdfLexikalische Semantik He's not pining! He's passed on! !is parrot is no more! He has ceased to be! He's

Zusammenfassung

• “Knowledge bottleneck” ist ein sehr ernstes Problem in der semantischen Verarbeitung.

• Wichtiges Thema in aktueller Forschung: distributionelle Methoden für semantische Ähnlichkeit von Wörtern.

• Aktueller Trend: Kombination mit kompositionellen Methoden.