Top Banner
I’m a huge metal fan! Mariana Romanyshyn Computational Linguist at Grammarly, Inc.
71

I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Jan 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

I’m a huge metal fan!

Mariana RomanyshynComputational Linguist at Grammarly, Inc.

Page 2: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

1.The Matter of Meaning

Page 3: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Words have meanings

Image by Tetiana Turchyn

Page 4: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Homonymous “bank”

● a financial institution● an area of land along the side of a river

Polysemous “man”

● the humanity● male part of the humanity● adult male part of the humanity

Homonymy vs. Polysemy

Page 5: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Homonymous “bank”

● a financial institution● an area of land along the side of a river

Polysemous “man”

● the humanity● male part of the humanity● adult male part of the humanity● a person

Homonymy vs. Polysemy

Page 6: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

● ~40% of English words are polysemous● most polysemous - verbs (~55% in WordNet)● resources disagree

○ “head”, noun:■ 11 meanings - Macmillan Dictionary■ 16 meanings - Longman Dictionary■ 33 meanings - WordNet■ 34 meanings - Oxford Dictionary

● meanings overlap○ John works for the newspaper that you are reading.

Is it serious?

Page 7: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Triangle inequality in word embeddings.

What does it mean for NLP?

Example from Neelakantan et al. (2014)

Page 8: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Word embeddings => sense embeddings

What does it mean for NLP?

Example from Neelakantan et al. (2014)

Page 9: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

... зробити так, щоби впала стіна?

● стіна будинку● стіни айсбергів● мур● те, що відокремлює, роз'єднує

Is it just English?

Page 10: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Can’t deep learningjust figure it out?

Page 11: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

US sells arms to countries well-known for violating human rights.

Using recycled prosthesis, a hospital in Tanzania sells arms for around $500 each. There is also high demand for legs.

Text classification/mining

Page 12: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

US sells arms to countries well-known for violating human rights.

Using recycled prosthesis, a hospital in Tanzania sells arms for around $500 each. There is also high demand for legs.

Text classification/mining

Page 13: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Machine translation

Example from Google Translate

Page 14: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Machine translation

Example from Google Translate

Page 15: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

You: I need to buy a big plant for my mom. She likes gardening!

Siri: Hmm...

Personal assistants

Page 16: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Personal assistants

Page 17: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Interest rates are very high.

These socks are a little high.

This area is rich in natural resources.

These comments are a bit rich coming from someone with no money worries.

Sentiment analysis

Page 18: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Interest rates are very high.

These socks are a little high. (= smelly)

This area is rich in natural resources.

These comments are a bit rich coming from someone with no money worries.

Sentiment analysis

Page 19: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Interest rates are very high.

These socks are a little high. (= smelly)

This area is rich in natural resources.

These comments are a bit rich coming from someone with no money worries.

Sentiment analysis

Page 20: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Abstract or concrete?

Man is rapidly destroying the earth.

Do you recognize man in the grey suit?

Error correction

Page 21: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Abstract or concrete?

Man is rapidly destroying the earth.

Do you recognize the man in the grey suit?

Error correction

Page 22: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Countable or uncountable?

This is a minor but moving work of literature.

Employees may take a work home if they wish.

Error correction

Page 23: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Countable or uncountable?

This is a minor but moving work of literature.

Employees may take a work home if they wish.

Error correction

Page 24: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Standard vs. non-standard

I believe women should be paid the same as men.

All men are equal in the sight of the law.

Error correction

Page 25: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Standard vs. non-standard

I believe women should be paid the same as men.

All {men=>people} are equal in the sight of the law.

Error correction

Page 26: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation
Page 27: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Animate or inanimate?

The software learns models from large quantities of data.

How to learn a model to flip her hair.

The chair was placed in the museum. He's part of the exhibit now.

The chair was awarded for a poem. He’s famous now.

Error correction

Page 28: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Animate or inanimate?

The software learns models from large quantities of data.

How to {learn=>teach} a model to flip her hair.

The chair was placed in the museum. He's part of the exhibit now.

The chair was awarded for a poem. He’s famous now.

Error correction

Page 29: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Animate or inanimate?

The software learns models from large quantities of data.

How to {learn=>teach} a model to flip her hair.

The chair was placed in the museum. {He=>It}'s part of the exhibit now.

The chair was awarded for a poem. He’s famous now.

Error correction

Page 30: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

● senses = domains?

● senses = sentiments?

● senses = animate/inanimate?

● senses = jargon/standard?

● senses = countable/uncountable?

● senses = senses?

What is “sense” than?

Page 31: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

2.

Resources

Page 32: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Dictionaries

Example from en.wiktionary.org

Page 33: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Dictionaries

Example from www.ldoceonline.com

Page 34: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Ontologies

Example of relations in WordNet

Page 35: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Knowledge Graph

Page 36: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Wikipedia, Wikidata, DBpedia

Page 37: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

BabelNet

Example from babelnet.org

Page 38: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

<wf>The</wf>

<wf lemma="model" wnsn="3">model</wf>

<wf lemma="quite" wnsn="1">quite</wf>

<wf lemma="plainly" wnsn="1">plainly</wf>

<wf lemma="think" wnsn="1">thought</wf>

<wf lemma="person" wnsn="1">Michelangelo</wf>

<wf lemma="crazy" wnsn="1">crazy</wf>

<wf>;</wf>

Corpora: SemCor

http://web.eecs.umich.edu/~mihalcea/downloads.html

Page 39: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Beverly Johnson (born October 13, 1952) is an [American|"United

States"] [model|"Model (person)"], [actress|"Actress"],

[singer|"Singer"], and [businesswoman|"Businesswoman"].

Corpora: Wikipedia

https://en.wikipedia.org/wiki/Beverly_Johnson

Page 40: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

3.Supervised word-sense disambiguation

Page 41: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Features:

● collocations● bag of words

Containing:

● word● lemma● part of speech● dependencies

If you have a corpus...

Page 42: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Simon works at an industrial plant.n.1 as an engineer.

Ngrams: [industrial plant, plant as, an industrial plant,...]

Syngrams: [works:prep_at:plant, work:prep:as, plant:amod:industrial,...]

Collocations

Parse tree by nlp.stanford.edu:8080/parser/

Page 43: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Simon works at an industrial plant as an engineer.

plant: [soil, assembly, root, industrial, contraband, agent, work...] [0, 0, 0, 1, 0, 0, 1...]

Idea

● use a predefined set of context words for each word● useful for homonyms, to detect the general topic

Bag of words

Page 44: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

1. Annotate corpora

I need to buy a big plant.n.1 for my mom. She likes gardening!Simon works at an industrial plant.n.2 as an engineer.

2. Build sense embeddings

Results

Page 45: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

SensEmbed vectors

Example from Iacobacci et al. (2015)

Page 46: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Nasari vectors

Example from Camacho-Collados (2016)

Page 47: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

1. Where do I get annotated data...2. Where do I get these bags of words...

...for each word and each sense that I need in my task?

A couple of questions...

Page 48: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

4.Linguistically-motivated word-sense disambiguation

Page 49: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

With which sense signature does your context overlap the most?

Lesk

Page 50: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Simon works at an industrial plant as an engineer.

Lesk

Example from WordNet

Page 51: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

How to find context words?

● filter functional words● take lemmas● for signature of each sense, use

○ examples○ definitions○ related terms○ synonyms, hyponyms, hypernyms, holonyms, meronyms...○ sentences from corpora, etc.

Lesk

Page 52: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

How to compute overlap?

● number of overlapping words

● weighed by the number of occurrences

● weighed by −log(P(w))

● weighed by IDF score: log( C(doc) / C(di) )

● weighed by ontological distance

Lesk

Page 53: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Which sense is the closest to context words?

Graph-Based

Example from Navigli and Lapata (2010)

Page 54: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Which sense of the context word to choose?

Graph-Based

Example from Navigli and Lapata (2010)

Page 55: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Graph-BasedSimon works at an industrial plant as an engineer.

Page 56: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Graph-BasedSimon works at an industrial plant as an engineer.

Page 57: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Graph-Based

Demo: http://lcl.uniroma1.it/adw/

Page 58: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Pros:

● good for partially annotating corpora○ can be continued in a semi-supervised fashion

● good for bag-of-words feature set● unreasonably effective: ~0.7% prec and ~0.7% recall

Cons:

● some senses are poorly covered● mapping e.g. WordNet and Wikipedia is a tricky task

Impact

Page 59: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

One sense per discourse!

I bought a plant yesterday and put it in my small tank with some inch long baby cichlids.Lost 3 fish over night i never lose fish. i dont see any nibbles on the plant though.. any advice?

Important linguistic hypothesis

Page 60: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

5.Unsupervised word-sense disambiguation

Page 61: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Idea:

● for each word occurrence, compute a context vector● cluster these context vectors● compute the sense vector in each cluster● map sense vectors to senses

The number of clusters should be predefined. Or not.

Word sense induction

Page 62: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

6.

To conclude

Page 63: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Quality

Example from Iacobacci et al. (2015)

Page 64: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Babelfy

Example from babelfy.org

Page 65: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Babelfy

Example from babelfy.org

Page 66: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Babelfy

Example from babelfy.org

Page 67: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Babelfy

Example from babelfy.org

Page 68: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Babelfy

Example from babelfy.org

Page 69: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Babelfy

Example from babelfy.org

Page 70: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

Thank.v.01 you!

Any questions.n.01?

Page 71: I’m a huge metal fan!...Word and Relational Similarity Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation

● Neelakantan et al. (2014), Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space

● Iacobacci et al. (2015), SENSEMBED: Learning Sense Embeddings for Word and Relational Similarity

● Camacho-Collados et al. (2016), Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities

● Navigli and Lapata (2010), An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation

● Athiwaratkun and Wilson (2017), Multimodal Word Distributions

● Abigail See (2017), Four deep learning trends from ACL 2017

References