Page 1
Efficient Parallel Learning of Word2Vec
Jeroen B. P. Vuurens1, Carsten Eickhoff2, and Arjen P. de Vries3
1The Hague University of Applied Science
2ETH Zurich
3Radboud University Nijmegen
June 24, 2016
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 1 / 14
Page 2
Word2Vec
Simple method for low-dimensional feature representation of words
Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )
Recently very popular
Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14
Page 3
Word2Vec
Simple method for low-dimensional feature representation of words
Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )
Recently very popular
Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14
Page 4
Word2Vec
Simple method for low-dimensional feature representation of words
Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )
Recently very popular
Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14
Page 5
Word2Vec
Simple method for low-dimensional feature representation of words
Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )
Recently very popular
Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14
Page 6
More is more. . .
Figure courtesy of http://deepdist.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 3 / 14
Page 7
Parallel Training
Shared model θ
Parallel SGD threads
I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Page 8
Parallel Training
Shared model θ
Parallel SGD threads
I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Page 9
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xi
I Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Page 10
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θ
I Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Page 11
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θ
I Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Page 12
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Page 13
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Page 14
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Page 15
Hogwild!
Simply skip the locking:
I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Page 16
Hogwild!
Simply skip the locking:
I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Page 17
Hogwild!
Simply skip the locking:I Draw a random training example xi
I Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Page 18
Hogwild!
Simply skip the locking:I Draw a random training example xiI Read current state of θ
I Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Page 19
Hogwild!
Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Page 20
Hogwild!
Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Page 21
Parallel Word2Vec
Intel Xeon CPU E5-2698 v3, 32 cores
Original C implementation + Gensim
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14
Page 22
Parallel Word2Vec
Intel Xeon CPU E5-2698 v3, 32 cores
Original C implementation + Gensim
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14
Page 23
Parallel Word2Vec
Intel Xeon CPU E5-2698 v3, 32 cores
Original C implementation + Gensim
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14
Page 24
Parallel Word2Vec
Intel Xeon CPU E5-2698 v3, 32 cores
Original C implementation + Gensim
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14
Page 25
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Page 26
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Page 27
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Page 28
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Page 29
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Page 30
Zipf’s Law
Figure courtesy of http://wugology.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 8 / 14
Page 31
Cached Huffman Trees
Cache the top c nodes in the tree
Every thread works on their stale copy of these top nodes
Update cache every u terms
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14
Page 32
Cached Huffman Trees
Cache the top c nodes in the tree
Every thread works on their stale copy of these top nodes
Update cache every u terms
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14
Page 33
Cached Huffman Trees
Cache the top c nodes in the tree
Every thread works on their stale copy of these top nodes
Update cache every u terms
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14
Page 34
Cached Huffman Trees
Cache the top c nodes in the tree
Every thread works on their stale copy of these top nodes
Update cache every u terms
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14
Page 35
Efficiency
Python/Cython implementation of cached Huffman trees
Same problem at c = 0
Significantly better performance at c = 31
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14
Page 36
Efficiency
Python/Cython implementation of cached Huffman trees
Same problem at c = 0
Significantly better performance at c = 31
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14
Page 37
Efficiency
Python/Cython implementation of cached Huffman trees
Same problem at c = 0
Significantly better performance at c = 31
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14
Page 38
Efficiency
Python/Cython implementation of cached Huffman trees
Same problem at c = 0
Significantly better performance at c = 31
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14
Page 39
Cache Size
Consistent improvements for all c ≤ 31
Best results for 1 ≤ u ≤ 10
Too large choices of u degrade model quality
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14
Page 40
Cache Size
Consistent improvements for all c ≤ 31
Best results for 1 ≤ u ≤ 10
Too large choices of u degrade model quality
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14
Page 41
Cache Size
Consistent improvements for all c ≤ 31
Best results for 1 ≤ u ≤ 10
Too large choices of u degrade model quality
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14
Page 42
Cache Size
Consistent improvements for all c ≤ 31
Best results for 1 ≤ u ≤ 10
Too large choices of u degrade model quality
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14
Page 43
Effectiveness
Stable model quality
Slight quality edge for Gensim implementation
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14
Page 44
Effectiveness
Stable model quality
Slight quality edge for Gensim implementation
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14
Page 45
Effectiveness
Stable model quality
Slight quality edge for Gensim implementation
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14
Page 46
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodes
I Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodes
I 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Page 47
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodes
I Zipf’s Law
Caching few top nodes
I 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Page 48
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodes
I 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Page 49
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodes
I 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Page 50
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodesI 4x speed-up
I Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Page 51
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodesI 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Page 52
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodesI 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Page 53
Thank You!
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 14 / 14
Page 54
Thank [email protected]
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 14 / 14