Efficient Parallel Learning of Word2Vec

Efficient Parallel Learning of Word2Vec

Jeroen B. P. Vuurens1, Carsten Eickhoff2, and Arjen P. de Vries3

1The Hague University of Applied Science

2ETH Zurich

3Radboud University Nijmegen

June 24, 2016

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 1 / 14

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

Word2Vec





Word2Vec





Word2Vec





More is more. . .

Figure courtesy of http://deepdist.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 3 / 14

http://deepdist.com/

Parallel Training

Shared model θ

Parallel SGD threads

I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .


Parallel Training

Shared model θ

Parallel SGD threads

I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock



Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xi

I Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock



Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θ

I Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock



Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θ

I Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock



Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

I Release lock



Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock



Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock



Hogwild!

Simply skip the locking:

I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))


Hogwild!

Simply skip the locking:

I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))


Hogwild!

Simply skip the locking:I Draw a random training example xi

I Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))


Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θ

I Update θ ← (θ − α∇L(fθ(xi ), yi ))


Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))


Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))


Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim


Parallel Word2Vec




Parallel Word2Vec




Parallel Word2Vec




Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14


Binary Huffman tree






Binary Huffman tree






Binary Huffman tree






Binary Huffman tree





Zipf’s Law

Figure courtesy of http://wugology.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 8 / 14

http://wugology.com/

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

















Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31


Efficiency





Efficiency





Efficiency





Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality


Cache Size





Cache Size





Cache Size





Effectiveness

Stable model quality

Slight quality edge for Gensim implementation


Effectiveness




Effectiveness




Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodes

I Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io


http://cythnn.github.io

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodes

I Zipf’s Law






Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law






Conclusion







Conclusion


Caching few top nodesI 4x speed-up

I Constant model quality




Conclusion


Caching few top nodesI 4x speed-upI Constant model quality




Conclusion


Caching few top nodesI 4x speed-upI Constant model quality




Thank You!


Thank [email protected]


[email protected]

Efficient Parallel Learning of Word2Vec

Technology