Top Banner
Hacking Human Language Hendrik Heuer London
92

Hacking Human Language (PyData London)

Aug 03, 2015

Download

Education

jan doering
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hacking Human Language (PyData London)

Hacking!Human!Language!Hendrik Heuer

London

Page 2: Hacking Human Language (PyData London)

– Hacker Ethics

“Access to computers —

and anything which might !teach you something about !

the way the world works!—

should be unlimited and total.

Always yield to !the Hands-On Imperative!”

Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.). New York: Penguin Books. ISBN 0141000511. OCLC 47216793.

Page 3: Hacking Human Language (PyData London)

Agenda• Computational Social Science

• Natural Language Processing

• Word Vector Representations

• Comparing different Wikipedia revisions

• Random Indexing

• word2vec patent

Page 4: Hacking Human Language (PyData London)

About me!Hej, I’m Hendrik.

Page 5: Hacking Human Language (PyData London)

Computational Social Science

Page 6: Hacking Human Language (PyData London)

Computational Social Science Digital Humanities

• combines computer science & social sciences

• makes new research possible, e.g. the analysis of massive social networks and content of millions of books

immersion.media.mit.edu

Page 7: Hacking Human Language (PyData London)

D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI:

10.1145/2184319.2184336

Page 8: Hacking Human Language (PyData London)

Massive-scale automated !analysis of news-content• 2.5 million articles from 498 different

English-language news outlets (Reuters & New York Times Corpus)

• automatically annotated into 15 topic areas

• the topics were compared in regards to readability, linguistic subjectivity and gender imbalances

I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N. Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale

automated analysis of news-content: topics, style and gender’, Digital Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928

Page 9: Hacking Human Language (PyData London)

Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet

Page 10: Hacking Human Language (PyData London)

“Low level of political interest and engagement could be connected to the !

lack of subjectivity (adjectival excess)”

Linguistic Subjectivity!Adjectives (Part-of-Speech Tagging) & SentiWordNet

Page 11: Hacking Human Language (PyData London)

Male-to-Female Ratio!Named Entity Recognition

Page 12: Hacking Human Language (PyData London)

Male-to-Female Ratio!Named Entity Recognition

“Gender bias in sports coverage (...) females only account for between

only 7 and 25 per cent of coverage”

Page 13: Hacking Human Language (PyData London)

scikit-learn

gensimNatural Language ToolkitspaCyword2vec

Machine Learning

Text ProcessingTopic Modeling

Visualizationd3.js

Google Chart APIHighcharts

Page 14: Hacking Human Language (PyData London)

Part-of-Speech Tagging!Identifying nouns, verbs, adjectives…

>>> import nltk >>> text = "In the middle ages Sweden had the same king as Denmark and Norway." >>> words = nltk.word_tokenize( text ) !>>> nltk.pos_tag( words ) [('In', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'), ('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as', 'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway', 'NNP'), ('.', '.')]

NN* Noun VB* Verb JJ* Adjective RB* Adverb DT Determiner IN Preposition

Page 15: Hacking Human Language (PyData London)

Named Entity Recognition!Identifying people, organizations, locations…

>>> import nltk >>> text = "New York City is the largest city in the United States." >>> words = nltk.word_tokenize( text ) !>>> nltk.ne_chunk( nltk.pos_tag( words ) ) Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'), ('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the', 'DT'), Tree('GPE', [('United', 'NNP'), ('States', 'NNPS')]), ('.', '.')])

ORGANIZATION Georgia-Pacific Corp., WHO PERSON Eddy Bonte, President Obama LOCATION Murray River, Mount Everest DATE June, 2008-06-29 TIME two fifty a m, 1:30 p.m. MONEY GBP 10.40 PERCENT twenty pct, 18.75 % FACILITY Washington Monument, Stonehenge GPE South East Asia, Midlothian (geo-political entity)

Page 16: Hacking Human Language (PyData London)

Sentiment AnalysisTell if a sentence is positive or negative

Page 17: Hacking Human Language (PyData London)

Stanford Core NLP Tools

Page 18: Hacking Human Language (PyData London)

Word Vector Representations

Page 19: Hacking Human Language (PyData London)

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.

Page 20: Hacking Human Language (PyData London)

–J. R. Firth 1957

“You shall know a word by the company it keeps”

Quoted after Socher

Page 21: Hacking Human Language (PyData London)
Page 22: Hacking Human Language (PyData London)

Vectors are directions in space

Page 23: Hacking Human Language (PyData London)

Vectors are directions in space

Quoted after Socher

word2vecRepresenting a word with a vector

Page 24: Hacking Human Language (PyData London)

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

MAN

WOMANAUNT

UNCLEQUEEN

KING

word2vecVectors can encode relationships

Page 25: Hacking Human Language (PyData London)

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

word2vecVectors can encode relationships

KINGS

KING

QUEEN

QUEENS

Page 26: Hacking Human Language (PyData London)

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

word2vecVectors can encode relationships

Page 27: Hacking Human Language (PyData London)

England is to Cameron as Germany is to ?

England is to London as Germany is to ?

Link: http://radimrehurek.com/2014/02/word2vec-tutorial/

598.7ms [["Berlin",0.563393235206604],["Dusseldorf",0.5625754594802856],["Munich",0.5460122227668762],["Budapest",0.5285829901695251],

["Düsseldorf",0.5266501903533936]]

556.8ms [["Merkel",0.5016422867774963],["Schroeder",0.49941977858543396],["Klaus",0.4981233477592468],["Schröder",

0.4947296977043152],["Peer_Nils",0.492642343044281]]

word2vecAnalogy puzzles

Page 28: Hacking Human Language (PyData London)

wake is to woken as be is to ?

fast is to fastest as slow is to ?

Link: http://radimrehurek.com/2014/02/word2vec-tutorial/

806.2ms [["slowest",0.7025301456451416],["slower",0.6236234307289124],["slowed",0.5842559337615967],["slowing",0.5462259650230408],["quickest",

0.5290436744689941]]

929.9ms [["been",0.41698968410491943],["tobe",0.40402814745903015],["are",0.3866569399833679],["being",0.3746173679828644],["notbe",

0.36837878823280334]]

word2vecAnalogy puzzles

Page 29: Hacking Human Language (PyData London)

Scotland is to haggis as Germany is to ?

Link: http://radimrehurek.com/2014/02/word2vec-tutorial/

793.5ms [["Currywurst",0.5284685492515564],["schnitzel",0.5208959579467773],["wursts",0.5166285037994385],["sauerkraut",

0.512742817401886],["stollen",0.5095855593681335]]

word2vecAnalogy puzzles

Page 30: Hacking Human Language (PyData London)

communism is to Karl_Marx as capitalism is to ?

Link: http://radimrehurek.com/2014/02/word2vec-tutorial/

544.7ms [["Capitalism",0.5884973406791687],["capitalist",0.5700926184654236],["Friedrich_Hayek",0.5352163314819336],

["Milton_Friedman",0.5348755121231079],["John_Maynard_Keynes",0.5335651636123657]]

word2vecAnalogy puzzles

Page 31: Hacking Human Language (PyData London)

SwedenMost similar words

Page 32: Hacking Human Language (PyData London)

SwedenMost similar words

Page 33: Hacking Human Language (PyData London)

HarvardMost similar words

Page 34: Hacking Human Language (PyData London)

Word vector representations

in Python

Page 35: Hacking Human Language (PyData London)

Link: https://radimrehurek.com/gensim/models/word2vec.html

Page 36: Hacking Human Language (PyData London)

Link: https://radimrehurek.com/gensim/models/word2vec.html

Page 37: Hacking Human Language (PyData London)

Link: https://honnibal.github.io/spaCy/

Page 38: Hacking Human Language (PyData London)

Link: https://honnibal.github.io/spaCy/

Page 39: Hacking Human Language (PyData London)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

Page 40: Hacking Human Language (PyData London)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

2 words context window

Page 41: Hacking Human Language (PyData London)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

5 words context window

2 words context window

Page 42: Hacking Human Language (PyData London)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

Page 43: Hacking Human Language (PyData London)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

Page 44: Hacking Human Language (PyData London)

spaCy!Dependency-Based

Word representations by Levy and Goldberg

Gensim!word2vec

by Mikolov et al

Page 45: Hacking Human Language (PyData London)

word2vecTraining word vectors with generator

Page 46: Hacking Human Language (PyData London)

Link: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Page 47: Hacking Human Language (PyData London)

Applications

Page 48: Hacking Human Language (PyData London)

Machine Translation

T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available:

http://arxiv.org/abs/1309.4168

Page 49: Hacking Human Language (PyData London)

Comparing Wikipedia revisions

Page 50: Hacking Human Language (PyData London)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

Page 51: Hacking Human Language (PyData London)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

Link gensim: https://radimrehurek.com/gensim/!Link word2vec: https://code.google.com/p/word2vec/

Page 52: Hacking Human Language (PyData London)
Page 53: Hacking Human Language (PyData London)
Page 54: Hacking Human Language (PyData London)
Page 55: Hacking Human Language (PyData London)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

linguistics

Page 56: Hacking Human Language (PyData London)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

Link: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Page 57: Hacking Human Language (PyData London)

1. Find Word

Representations word2vec

2. Dimensionality

Reduction t-SNE

3. Visualisation

JSON

Link: https://github.com/mbostock/d3/wiki/Gallery

Page 58: Hacking Human Language (PyData London)

Comparing Wikipedia revisions

Page 59: Hacking Human Language (PyData London)

Comparing Wikipedia revisions!Game of Thrones

Page 60: Hacking Human Language (PyData London)

Bird’s eye view!Game of Thrones

Page 61: Hacking Human Language (PyData London)

Bird’s eye view, intersection set!Game of Thrones

Page 62: Hacking Human Language (PyData London)

Characters in 2013 and 2015!Game of Thrones

Page 63: Hacking Human Language (PyData London)

Bird’s eye view!United States

Page 64: Hacking Human Language (PyData London)

Bird’s eye view, intersection set!United States

Page 65: Hacking Human Language (PyData London)

Bird’s eye view, 2015!United States

Page 66: Hacking Human Language (PyData London)

Bird’s eye view, 2013!United States

Page 67: Hacking Human Language (PyData London)

word2vec patent

Page 68: Hacking Human Language (PyData London)
Page 69: Hacking Human Language (PyData London)

• “Pretty much every time Google has engaged in patent infringement litigation, it has been against someone who has brought an infringement suit against them first. (…) it keeps inventions they are using out of the hands of patent trolls”

• “Idiotic. I'm surprised they didn't just patent matrix algebra”

• “fuck software patents”

https://www.reddit.com/r/MachineLearning/comments/37b1bl/word2vec_has_been_patented_what_does_it_change/

Google’s word2vec patentReactions from the community

Page 70: Hacking Human Language (PyData London)

• Omer Levy: • The novelty claim in this

patent is somewhat bogus • word2vec is doing more or

less what the NLP research community has been doing for the past 25 years

• much of the improvement in performance stems from preprocessing "hacks" and hyperparameter settings

• word2vec is a brilliantly efficient implementation of decade-old ideas

https://www.reddit.com/r/MachineLearning/comments/37b1bl/word2vec_has_been_patented_what_does_it_change/

Google’s word2vec patentReactions from the community

Page 71: Hacking Human Language (PyData London)

Google’s word2vec patentWhat does it change for NLP practitioners?

• “Likely nothing. It's probably one of the thousands of overly broad "defensive" patents held by companies”

• “Didn't it have an Apache open source license before-hand?”

Page 72: Hacking Human Language (PyData London)

Google’s word2vec patentWhat does it change for NLP practitioners?

• “Likely nothing. It's probably one of the thousands of overly broad "defensive" patents held by companies.”

• “Didn't it have an Apache open source license before-hand?”

Page 73: Hacking Human Language (PyData London)

Map Reduce & Hadoop

Page 74: Hacking Human Language (PyData London)

Random Indexing

Page 75: Hacking Human Language (PyData London)

• all information in vectors • each word has a hash key!

• n-dimensional vector • most dimensions are 0 • for a small number k, randomly

distributed -1 or +1 values • the dimension of the vectors is

much smaller than the number of contexts

Random Indexing!Incremental word space model

hash !key

Page 76: Hacking Human Language (PyData London)

• Every time you see a word wi, add the hash key of the words in the context window vi-3, …, vi+3 to the word’s context vector vi

• After a number of occurrences, the context vector holds information about a word’s distribution

• dimensionality reduction, computationally less costly than methods like PCA

Random Indexing!Incremental word space model

hash !key

context!vector

Page 77: Hacking Human Language (PyData London)

Gavagai Living Lexicon

Page 78: Hacking Human Language (PyData London)

Gavagai Living Lexicon

Page 79: Hacking Human Language (PyData London)

Gavagai Living Lexicon

Page 80: Hacking Human Language (PyData London)

Gavagai Living Lexicon

Page 81: Hacking Human Language (PyData London)

https://en.wikipedia.org/wiki/Athens

Gavagai Living Lexicon

Page 82: Hacking Human Language (PyData London)

https://en.wikipedia.org/wiki/Athens

Gavagai Living Lexicon

Page 83: Hacking Human Language (PyData London)

More than words…

Page 84: Hacking Human Language (PyData London)

doc2vec

Page 85: Hacking Human Language (PyData London)

Image Captioning

Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n, http://cs231n.stanford.edu/syllabus.html

Page 86: Hacking Human Language (PyData London)

Image Captioning

Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n, http://cs231n.stanford.edu/syllabus.html

Page 87: Hacking Human Language (PyData London)

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online].

Available: http://arxiv.org/abs/1301.3781

word2vecVectors can encode relationships

KINGS

KING

QUEEN

QUEENS

Page 88: Hacking Human Language (PyData London)

Image Captioning

Fei-Fei Li & Andrej Karpathy, Stanford University, CS231n, http://cs231n.stanford.edu/syllabus.html

Page 89: Hacking Human Language (PyData London)

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, ‘Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering’,

CoRR, vol. abs/1505.05612, 2015 [Online]. Available: http://arxiv.org/abs/1505.05612

Image Question Answering

Page 90: Hacking Human Language (PyData London)

Hacking!Human!Language!Hendrik Heuer

London

[email protected]!http://hen-drik.de!@hen_drik

Thanks to Andrii, Jussi & Roelof

Slides: https://tinyurl.com/pydata-language

Page 91: Hacking Human Language (PyData London)

predict the current word!input!

wi-2, wi-1, wi+1, wi+2 !output !

wi!

word2vecHow it is trained

Page 92: Hacking Human Language (PyData London)

predict the current word!input!

wi-2, wi-1, wi+1, wi+2 !output !

wi!

predict the surrounding words!input

wi !output !

wi-2, wi-1, wi +1, wi +2.

word2vecHow it is trained