Top Banner
1 Billion-scale Approximate Nearest Neighbor Search ACM Multimedia 2020 Tutorial on Effective and Efficient: Toward Open-world Instance Re-identification Yusuke Matsui The University of Tokyo
115

ACM Multimedia 2020 Tutorial on Effective and Efficient ...

May 29, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

1

Billion-scale Approximate Nearest Neighbor Search

ACM Multimedia 2020 Tutorial onEffective and Efficient: Toward Open-world Instance Re-identification

Yusuke MatsuiThe University of Tokyo

Page 2: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Search ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

2

โžข๐‘ ๐ท-dim database vectors: ๐’™๐‘› ๐‘›=1๐‘

Nearest Neighbor Search; NN

Page 3: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

0.233.150.651.43

Search

0.203.250.721.68

๐’’ โˆˆ โ„๐ท ๐’™74

argmin๐‘›โˆˆ 1,2,โ€ฆ,๐‘

๐’’ โˆ’ ๐’™๐‘› 22

Result

๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

3

โžข๐‘ ๐ท-dim database vectors: ๐’™๐‘› ๐‘›=1๐‘

โžขGiven a query ๐’’, find the closest vector from the databaseโžขOne of the fundamental problems in computer scienceโžขSolution: linear scan, ๐‘‚ ๐‘๐ท , slow

Nearest Neighbor Search; NN

Page 4: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

0.233.150.651.43

Search

0.203.250.721.68

๐’’ โˆˆ โ„๐ท ๐’™74

argmin๐‘›โˆˆ 1,2,โ€ฆ,๐‘

๐’’ โˆ’ ๐’™๐‘› 22

Result

๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

Approximate Nearest Neighbor Search; ANN

โžขFaster searchโžขDonโ€™t necessarily have to be exact neighborsโžขTrade off: runtime, accuracy, and memory-consumption

4

Page 5: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Approximate Nearest Neighbor Search; ANN

0.233.150.651.43

Search

0.203.250.721.68

๐’’ โˆˆ โ„๐ท ๐’™74

argmin๐‘›โˆˆ 1,2,โ€ฆ,๐‘

๐’’ โˆ’ ๐’™๐‘› 22

Result

๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

โžขFaster searchโžขDonโ€™t necessarily have to be exact neighborsโžขTrade off: runtime, accuracy, and memory-consumptionโžขA sense of scale: billion-scale data on memory

32GB RAM100 106 to 109

10 ms

5

Page 6: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

NN/ANN for CV

Image retrieval

https://about.mercari.com/press/news/article/20190318_image_search/

https://jp.mathworks.com/help/vision/ug/image-classification-with-bag-of-visual-words.html

Clustering

kNN recognitionโžข Originally: fast construction of bag-of-featuresโžข One of the benchmarks is still SIFT

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

6

Person Re-identification

Page 7: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

7

Start

Have GPU(s)?

faiss-gpu: linear scan (GpuIndexFlatL2)

faiss-cpu: linear scan (IndexFlatL2)

nmslib (hnsw)

falconn

annoy

faiss-cpu: hnsw + ivfpq(IndexHNSWFlat + IndexIVFPQ)

Adjust the PQ parameters: Make ๐‘€ smlaller

Exact nearest neighbor search

Alternative: faiss.IndexHNSWFlat in faiss-cpuโžข Same algorithm in different libraries

Note: Assuming ๐ท โ‰… 100. The size of the problem is determined by ๐ท๐‘. If 100 โ‰ช ๐ท, run PCA to reduce ๐ท to 100

Yes

No

If topk > 2048

If slow, or out of memory

Require fast data addition

Would like to run from several processes

If slowโ€ฆ

Would like to adjust the performance

riiWould like to runsubset-search

If out of memory

Adjust the IVF parameters:Make nprobe larger โžก Higher accuracy but slower

Would like to adjust the performance

cheat-sheet for ANN in Python ๏ผˆas of 2020. Can be installed by conda or pip๏ผ‰

faiss-gpu: ivfpq (GpuIndexIVFPQ)

(1) If still out of GPU-memory, or(2) Need more accurate results

If out of GPU-memoryIf out of GPU-memory, make ๐‘€ smaller

About: 103 < ๐‘ < 106

About: 106 < ๐‘ < 109

About: 109 < ๐‘

Page 8: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Part 1:Nearest Neighbor Search

Part 2:Approximate Nearest Neighbor Search

8

Page 9: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Part 1:Nearest Neighbor Search

Part 2:Approximate Nearest Neighbor Search

9

Page 10: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

0.233.150.651.43

Search

0.203.250.721.68

๐’’ โˆˆ โ„๐ท ๐’™74

argmin๐‘›โˆˆ 1,2,โ€ฆ,๐‘

๐’’ โˆ’ ๐’™๐‘› 22

Result

๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

Nearest Neighbor Search

10

โžขShould try this first of allโžขIntroduce a naรฏve implementationโžขIntroduce a fast implementationโœ“ Faiss library from FAIR (youโ€™ll see many times today. CPU & GPU)

โžขExperience the drastic difference between the two impls

Page 11: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 22

11

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ ๐‘€ โ‰ช ๐‘

Page 12: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ ๐‘€ โ‰ช ๐‘

Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 22

parfor q in Q:for x in X:

l2sqr(q, x)

def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):

diff += (q[d] โ€“ x[d])**2return diff

Naรฏve impl.Parallelize query-side

Select min by heap, but omit it now

12

Page 13: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 22

parfor q in Q:for x in X:

l2sqr(q, x)

def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):

diff += (q[d] โ€“ x[d])**2return diff

Naรฏve impl.Parallelize query-side

Select min by heap, but omit it now

faiss impl.

if ๐‘€ < 20๏ผš

compute ๐’’ โˆ’ ๐’™ 22 by SIMD

else๏ผš

compute ๐’’ โˆ’ ๐’™ 22 = ๐’’ 2

2 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 22 by BLAS 13

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ ๐‘€ โ‰ช ๐‘

Page 14: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 22

parfor q in Q:for x in X:

l2sqr(q, x)

def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):

diff += (q[d] โ€“ x[d])**2return diff

Naรฏve impl.Parallelize query-side

Select min by heap, but omit it now

faiss impl.

if ๐‘€ < 20๏ผš

compute ๐’’ โˆ’ ๐’™ 22 by SIMD

else๏ผš

compute ๐’’ โˆ’ ๐’™ 22 = ๐’’ 2

2 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 22 by BLAS 14

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ ๐‘€ โ‰ช ๐‘

Page 15: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

๐’™ โˆ’ ๐’š 22 by SIMD Ref.Rename variables for the

sake of explanation

x

y

D=31

float: 32bit

15

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

Page 16: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

โžข 256bit SIMD Registerโžข Process eight floats at oncefloat: 32bit

16

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

Page 17: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float: 32bit

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

17

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

Page 18: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

float: 32bit

18

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

Page 19: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

msum1

a_m_b1

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—

+=

float: 32bit

19

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

Page 20: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

msum1 +=

float: 32bit

20

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

Page 21: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

msum1 +=

float: 32bit

21

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

Page 22: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

msum1

a_m_b1

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—

+=

float: 32bit

22

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

Page 23: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

msum1

a_m_b1

msum2

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—

โŠ• โŠ• โŠ• โŠ•

โžข 128bit SIMD Register

+=

float: 32bit

23

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

Page 24: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1 a_m_b1

msum2

โŠ–โŠ–โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—

+=

โžข 128bit SIMD Register

float: 32bit

24

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

Page 25: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

0 0 0mx 0 0 0my

a_m_b1 a_m_b1

โŠ–โŠ–โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—

msum2 +=

The rest

float: 32bit

25

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 128bit SIMD Register

Page 26: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

0 0 0mx 0 0 0my

a_m_b1 a_m_b1

โŠ–โŠ–โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—

โŠ• โŠ•

โŠ•

msum2 +=

Result

float: 32bit

26

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 128bit SIMD Register

The rest

Page 27: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

0 0 0mx 0 0 0my

a_m_b1 a_m_b1

โŠ–โŠ–โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—

โŠ• โŠ•

โŠ•

msum2 +=

Result

float: 32bit

27

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 128bit SIMD Register

The rest

โžข SIMD codes of faiss are simple and easy to readโžข Being able to read SIMD codes comes in handy

sometimes; why this impl is super fastโžข Another example of SIMD L2sqr from HNSW:

https://github.com/nmslib/hnswlib/blob/master/hnswlib/space_l2.h

Page 28: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 2

2

parfor q in Q:for x in X:

l2sqr(q, x)

def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):

diff += (q[d] โ€“ x[d])**2return diff

Naรฏve impl.Parallelize query-side

Select min by heap, but omit it now

faiss impl.

if ๐‘€ < 20๏ผš

compute ๐’’ โˆ’ ๐’™ 22 by SIMD

else๏ผš

compute ๐’’ โˆ’ ๐’™ 22 = ๐’’ 2

2 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 22 by BLAS 28

Page 29: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Compute ๐’’ โˆ’ ๐’™ 22 = ๐’’ 2

2 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 22 with BLAS

# Compute tablesq_norms = norms(Q) # ๐’’1 2

2, ๐’’2 22, โ€ฆ , ๐’’๐‘€ 2

2

x_norms = norms(X) # ๐’™1 22, ๐’™2 2

2, โ€ฆ , ๐’™๐‘ 22

ip = sgemm_(Q, X, โ€ฆ) # ๐‘„โŠค๐‘‹

# Scan and sumparfor (m = 0; m < M; ++m):

for (n = 0; n < N; ++n):dist = q_norms[m] + x_norms[n] โ€“ ip[m][n]

Stack ๐‘€ ๐ท-dim query vectors to a ๐ท ร—๐‘€ matrix: ๐‘„ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€ โˆˆ โ„๐ทร—๐‘€

Stack ๐‘ ๐ท-dim database vectors to a ๐ท ร— ๐‘ matrix: ๐‘‹ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ โˆˆ โ„๐ทร—๐‘

SIMD-accelerated function

โžข Matrix multiplication by BLASโžข Dominant if ๐‘„ and ๐‘‹ are largeโžข The difference of the background matters:โœ“ Intel MKL is 30% faster than OpenBLAS

29๐’’๐‘š 22 ๐’™๐‘› 2

2 ๐‘„โŠค๐‘‹ ๐‘š๐‘›๐’’๐‘š โˆ’ ๐’™๐‘› 22

Page 30: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

NN in GPU (faiss-gpu) is 10x faster than NN in CPU (faiss-cpu)

โžขNN-GPU always compute ๐’’ 22 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 2

2

โžข k-means for 1M vectors (D=256, K=20000)โœ“ 11 min on CPUโœ“ 55 sec on 1 Pascal-class P100 GPU (float32 math)โœ“ 34 sec on 1 Pascal-class P100 GPU (float16 math)โœ“ 21 sec on 4 Pascal-class P100 GPUs (float32 math)โœ“ 16 sec on 4 Pascal-class P100 GPUs (float16 math)

โžข If GPU is available and its memory is enough, try GPU-NNโžข The behavior is little bit different (e.g., a restriction for top-k)

Benchmark: https://github.com/facebookresearch/faiss/wiki/Low-level-benchmarks

x10 faster

30

Page 31: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Referenceโžข Switch implementation of L2sqr in faiss:

[https://github.com/facebookresearch/faiss/wiki/Implementation-notes#matrix-multiplication-to-do-many-l2-distance-computations]

โžข Introduction to SIMD: a lecture by Markus Pรผschel (ETH) [How to Write Fast Numerical Code - Spring 2019], especially [SIMD vector instructions]โœ“ https://acl.inf.ethz.ch/teaching/fastcode/2019/โœ“ https://acl.inf.ethz.ch/teaching/fastcode/2019/slides/07-simd.pdf

โžข SIMD codes for faiss [https://github.com/facebookresearch/faiss/blob/master/utils/distances_simd.cpp]

โžข L2sqr benchmark including AVX512 for faiss-L2sqr [https://gist.github.com/matsui528/583925f88fcb08240319030202588c74]

31

Page 32: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Part 1:Nearest Neighbor Search

Part 2:Approximate Nearest Neighbor Search

32

Page 33: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

33

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

Page 34: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

34

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

Page 35: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

35

Locality Sensitive Hashing (LSH)โžข LSH = Hash functions + Hash tablesโžข Map similar items to the same symbol with a high probability

Record

๐’™13

Hash 1

Hash 2

โ€ฆ

โ‘ฌ

โ‘ฌ

Search

๐’’Hash 1

Hash 2

โ€ฆ

โ‘ฃใ‰‘ใŠด

โ‘คใŠผ

Compare ๐’’ with ๐’™4, ๐’™5, ๐’™21, โ€ฆby the Euclidean distance

Page 36: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

36

Locality Sensitive Hashing (LSH)โžข LSH = Hash functions + Hash tablesโžข Map similar items to the same symbol with a high probability

Record

๐’™13

Hash 1

Hash 2

โ€ฆ

โ‘ฌ

โ‘ฌ

Search

๐’’Hash 1

Hash 2

โ€ฆ

โ‘ฃใ‰‘ใŠด

โ‘คใŠผ

Compare ๐’’ with ๐’™4, ๐’™5, ๐’™21, โ€ฆby the Euclidean distance

E.g., random projection [Dater+, SCG 04]๐ป ๐’™ = โ„Ž1 ๐’™ ,โ€ฆ , โ„Ž๐‘€ ๐’™ ๐‘‡

โ„Ž๐‘š ๐’™ =๐’‚๐‘‡๐’™ + ๐‘

๐‘Š

Page 37: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

37

Locality Sensitive Hashing (LSH)โžข LSH = Hash functions + Hash tablesโžข Map similar items to the same symbol with a high probability

Record

๐’™13

Hash 1

Hash 2

โ€ฆ

โ‘ฌ

โ‘ฌ

Search

๐’’Hash 1

Hash 2

โ€ฆ

โ‘ฃใ‰‘ใŠด

โ‘คใŠผ

Compare ๐’’ with ๐’™4, ๐’™5, ๐’™21, โ€ฆby the Euclidean distance

E.g., random projection [Dater+, SCG 04]๐ป ๐’™ = โ„Ž1 ๐’™ ,โ€ฆ , โ„Ž๐‘€ ๐’™ ๐‘‡

โ„Ž๐‘š ๐’™ =๐’‚๐‘‡๐’™ + ๐‘

๐‘Š

โ˜บ: โžข Math-friendlyโžข Popular in the theory area (FOCS, STOC, โ€ฆ): โžข Large memory costโœ“ Need several tables to boost the accuracyโœ“ Need to store the original data, ๐’™๐‘› ๐‘›=1

๐‘ , on memoryโžข Data-dependent methods such as PQ are better for real-world dataโžข Thus, in recent CV papers, LSH has been treated as a classic-

method

Page 38: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

38

Hash 2

โ€ฆ

โ‘ฃใ‰‘ใŠด

โ‘คใŠผSearch

ใ‰™ใŠน๐’’

Hash 1

In fact:โžขConsider a next candidate โžก practical memory consumption๏ผˆMulti-Probe [Lv+, VLDB 07]๏ผ‰

โžขA library based on the idea: FALCONN

Compare ๐’’ with ๐’™4, ๐’™5, ๐’™21, โ€ฆby the Euclidean distance

Page 39: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

39

โ˜…852https://github.com/falconn-lib/falconn

$> pip install FALCONN

table = falconn.LSHIndex(params_cp)table.setup(X-center)query_object = table.construct_query_object()# query parameter config herequery_object.find_nearest_neighbor(Q-center, topk)

Falconn

โ˜บ Faster data addition (than annoy, nmslib, ivfpq)โ˜บ Useful for on-the-fly addition Parameter configuration seems a bit non-intuitive

Page 40: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

40

Referenceโžข Good summaries on this field: CVPR 2014 Tutorial on Large-Scale Visual

Recognition, Part I: Efficient matching, H. Jรฉgou[https://sites.google.com/site/lsvrtutorialcvpr14/home/efficient-matching]

โžข Practical Q&A: FAQ in Wiki of FALCONN [https://github.com/FALCONN-LIB/FALCONN/wiki/FAQ]

โžข Hash functions: M. Datar et al., โ€œLocality-sensitive hashing scheme based on p-stable distributions,โ€ SCG 2004.

โžข Multi-Probe: Q. Lv et al., โ€œMulti-Probe LSH: Efficient Indexing for High-Dimensional Similarity Searchโ€, VLDB 2007

โžข Survey: A. Andoni and P. Indyk, โ€œNear-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions,โ€ Comm. ACM 2008

Page 41: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

41

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

Page 42: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

42

โžขAutomatically select โ€œRandomized KD Treeโ€ or โ€œk-means Treeโ€https://github.com/mariusmuja/flann

โ˜บ Good code base. Implemented in OpenCV and PCLโ˜บ Very popular in the late 00's and early 10โ€™s Large memory consumption. The original data need to be stored Not actively maintained now

Images are from [Muja and Lowe, TPAMI 2014]

Randomized KD Tree k-means Tree

FLANN: Fast Library for Approximate Nearest Neighbors

Page 43: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

43

Annoyโ€œ2-means treeโ€+ โ€œmultiple-treesโ€ + โ€œshared priority queueโ€

Record

Search

โžข Focus the cell that the query livesโžข Compare the distances Can traverse the tree by log-times comparisons

All images are cited from the authorโ€™s blog post (https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html)

Select two points randomly Divide up the space Repeat hierarchically

Page 44: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

44

Annoyโ€œ2-means treeโ€+ โ€œmultiple-treesโ€ + โ€œshared priority queueโ€

All images are cited from the authorโ€™s blog post (https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html)

Feature 1 If we need more data points, use a priority queue

Feature 2 Boost the accuracy by multi-tree with a shared priority queue

Page 45: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

45

Annoyhttps://github.com/erikbern/annoy$> pip install annoy

t = AnnoyIndex(D)for n, x in enumerate(X):

t.add_item(n, x)t.build(n_trees)

t.get_nns_by_vector(q, topk)

โ˜บ Developed at Spotify. Well-maintained. Stableโ˜บ Simple interface with only a few parametersโ˜บ Baseline for million-scale dataโ˜บ Support mmap, i.e., can be accessed from several processes Large memory consumption Runtime itself is slower than HNSW

โ˜…7.1K

Page 46: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

46

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

Page 47: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

47

Graph traversal

โžข Very popular in recent years

โžข Around 2017, it turned out that the graph-traversal-based methods work well for million-scale data

โžข Pioneer:โœ“ Navigable Small World Graphs (NSW)โœ“ Hierarchical NSW (HNSW)

โžข Implementation: nmslib, hnsw, faiss

Page 48: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

48

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vector

๐’™13

Graph of ๐’™1, โ€ฆ , ๐’™90

Page 49: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

49

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vectorโžขGiven a new database vector, create new edges to neighbors

๐’™13

๐’™91

Graph of ๐’™1, โ€ฆ , ๐’™90

Page 50: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

50

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vectorโžขGiven a new database vector, create new edges to neighbors

๐’™13

๐’™91

Graph of ๐’™1, โ€ฆ , ๐’™90

Page 51: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

51

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vectorโžขGiven a new database vector, create new edges to neighbors

๐’™13

๐’™91

Graph of ๐’™1, โ€ฆ , ๐’™90

Page 52: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

52

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vectorโžขGiven a new database vector, create new edges to neighbors

โžข Early links can be longโžข Such long links encourage a large hop,

making the fast convergence for search

๐’™13

๐’™91

Graph of ๐’™1, โ€ฆ , ๐’™90

Page 53: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

53

Search Images are from [Malkov+, Information Systems, 2013]

Page 54: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

54

Search Images are from [Malkov+, Information Systems, 2013]

โžข Given a query vector

Page 55: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

55

Search Images are from [Malkov+, Information Systems, 2013]

โžข Given a query vectorโžข Start from a random point

Page 56: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

56

Search Images are from [Malkov+, Information Systems, 2013]

โžข Given a query vectorโžข Start from a random pointโžข From the connected nodes, find the closest one to the query

Page 57: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

57

Search Images are from [Malkov+, Information Systems, 2013]

โžข Given a query vectorโžข Start from a random pointโžข From the connected nodes, find the closest one to the query

Page 58: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

58

โžข Given a query vectorโžข Start from a random pointโžข From the connected nodes, find the closest one to the queryโžข Traverse in a greedy manner

Search Images are from [Malkov+, Information Systems, 2013]

Page 59: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

59

โžข Given a query vectorโžข Start from a random pointโžข From the connected nodes, find the closest one to the queryโžข Traverse in a greedy manner

Search Images are from [Malkov+, Information Systems, 2013]

Page 60: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

60

Extension: Hierarchical NSW; HNSW

[Malkov and Yashunin, TPAMI, 2019]

โžข Construct the graph hierarchically [Malkov and Yashunin, TPAMI, 2019]

โžข This structure works pretty well for real-world data

Search on a coarse graph

Move to the same node on a finer graph

Repeat

Page 61: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

61

NMSLIB (Non-Metric Space Library)https://github.com/nmslib/nmslib$> pip install nmslib

index = nmslib.init(method=โ€˜hnswโ€™)index.addDataPointBatch(X)index.createIndex(params1)

index.setQueryTimeParams(params2)index.knnQuery(q, topk)

โ˜บ The โ€œhnswโ€ is the best method as of 2020 for million-scale dataโ˜บ Simple interfaceโ˜บ If memory consumption is not the problem, try this Large memory consumption Data addition is not fast

โ˜…2k

Page 62: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

62

Other implementations of HNSW

Hnswlib: https://github.com/nmslib/hnswlibโžข Spin-off library from nmslibโžข Include only hnswโžข Simpler; may be useful if you want to extend hnsw

Faiss: https://github.com/facebookresearch/faissโžข Libraries for PQ-based methods. Will Introduce later โžข This lib also includes hnsw

Page 63: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Other graph-based approachesโžข From Alibaba:C. Fu et al., โ€œFast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graphโ€, VLDB19https://github.com/ZJULearning/nsg

โžข From Microsoft Research Asia. Used inside Bing:J. Wang and S. Lin, โ€œQuery-Driven Iterated Neighborhood Graph Search for Large Scale Indexingโ€, ACMMM12 (This seems the backbone paper)https://github.com/microsoft/SPTAG

โžข From Yahoo Japan. Competing with NMSLIB for the 1st place of benchmark:M. Iwasaki and D. Miyazaki, โ€œOptimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Dataโ€, arXiv18https://github.com/yahoojapan/NGT

63

Page 64: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

64

Referenceโžข The original paper of Navigable Small World Graph: Y. Malkov et al., โ€œApproximate

Nearest Neighbor Algorithm based on Navigable Small World Graphs,โ€ Information Systems 2013

โžข The original paper of Hierarchical Navigable Small World Graph: Y. Malkov and D. Yashunin, โ€œEfficient and Robust Approximate Nearest Neighbor search using Hierarchical Navigable Small World Graphs,โ€ IEEE TPAMI 2019

Page 65: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

65

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

Page 66: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

66

Basic idea

0.542.350.820.42

0.620.310.341.63

3.340.830.621.45

1 2 N โžขNeed 4๐‘๐ท byte to represent ๐‘ real-valued vectorsusing floats

โžข If ๐‘ or ๐ท is too large, we cannot read the data on memoryโœ“ E.g., 512 GB for ๐ท = 128,๐‘ = 109

โžขConvert each vector to a short-code

โžข Short-code is designed as memory-efficientโœ“ E.g., 4 GB for the above example, with 32-bit code

โžขRun search for short-codes

๐ท

1 2 N

cod

e

cod

e

cod

e

Convert

โ€ฆ

โ€ฆ

Page 67: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

67

Basic idea

0.542.350.820.42

0.620.310.341.63

3.340.830.621.45

1 2 N

โ€ฆ

โžขNeed 4๐‘๐ท byte to represent ๐‘ real-valued vectorsusing floats

โžข If ๐‘ or ๐ท is too large, we cannot read the data on memoryโœ“ E.g., 512 GB for ๐ท = 128,๐‘ = 109

โžขConvert each vector to a short-code

โžข Short-code is designed as memory-efficientโœ“ E.g., 4 GB for the above example, with 32-bit code

โžขRun search for short-codes

๐ท

1 2 N

cod

e

cod

e

cod

e

Convert

โ€ฆ

What kind of conversion is preferred?

1. The โ€œdistanceโ€ between two codes can be calculated (e.g., Hamming-distance)

2. The distance can be computed quickly

3. That distance approximates the distancebetween the original vectors (e.g., ๐ฟ2)

4. Sufficiently small length of codes can achieve the above three criteria

Page 68: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

68

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

โžข Convert ๐’™ to a ๐ต-bit binary vector:๐‘“ ๐’™ = ๐’ƒ โˆˆ 0, 1 ๐ต

โžข Hamming distance ๐‘‘๐ป ๐’ƒ1, ๐’ƒ2 = ๐’ƒ1โŠ•๐’ƒ2 โˆผ ๐‘‘ ๐’™1, ๐’™2

โžข A lot of methods:โœ“ J. Wang et al., โ€œLearning to Hash for Indexing Big Data - A

Surveyโ€, Proc. IEEE 2015โœ“ J. Wang et al., โ€œA Survey on Learning to Hashโ€, TPAMI 2018

โžข Not the main scope of this tutorial;PQ is usually more accurate

Page 69: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

69

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

Page 70: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

70

0.340.220.681.020.030.71

๐ท ๐‘€

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

vector; ๐’™

PQ-code; เดฅ๐’™

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training data

Page 71: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

71

0.340.220.681.020.030.71

๐ท ๐‘€

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

Page 72: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

72

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

Page 73: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

73

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

Page 74: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

74

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

Page 75: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

75

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

โžขSimpleโžขMemory efficientโžขDistance can be esimated

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

Bar notation for PQ-code in this tutorial:๐’™ โˆˆ โ„๐ท โ†ฆ เดฅ๐’™ โˆˆ 1,โ€ฆ , 256 ๐‘€

Page 76: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

76

Product Quantization: Memory efficient

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebookvector; ๐’™

PQ-code; เดฅ๐’™

Page 77: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float: 32bit

77

e.g., ๐ท = 128128 ร— 32 = 4096 [bit]

Product Quantization: Memory efficient

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebookvector; ๐’™

PQ-code; เดฅ๐’™

Page 78: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float: 32bit

78

e.g., ๐ท = 128128 ร— 32 = 4096 [bit]

e.g., ๐‘€ = 88 ร— 8 = 64 [bit]

uchar: 8bit

Product Quantization: Memory efficient

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebookvector; ๐’™

PQ-code; เดฅ๐’™

Page 79: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

float: 32bit

79

e.g., ๐ท = 128128 ร— 32 = 4096 [bit]

e.g., ๐‘€ = 88 ร— 8 = 64 [bit]

1/64

uchar: 8bit

Product Quantization: Memory efficient

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebookvector; ๐’™

PQ-code; เดฅ๐’™

Page 80: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

80

Query; ๐’’ โˆˆ โ„๐ท

0.340.220.681.020.030.71

Product Quantization: Distance estimation

0.542.350.820.420.140.32

0.620.310.341.631.430.74

3.340.830.621.450.122.32

โ€ฆ

๐’™1 ๐’™2 ๐’™๐‘Database vectors

Page 81: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

81

Query; ๐’’ โˆˆ โ„๐ท

0.340.220.681.020.030.71

Product Quantization: Distance estimation

0.542.350.820.420.140.32

0.620.310.341.631.430.74

3.340.830.621.450.122.32

โ€ฆ

Product quantization

๐’™1 ๐’™2 ๐’™๐‘Database vectors

Page 82: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

82

Query; ๐’’ โˆˆ โ„๐ท

0.340.220.681.020.030.71

โ€ฆ ID: 42

ID: 67

ID: 92

ID: 221

ID: 143

ID: 34

ID: 99

ID: 234

ID: 3

Product Quantization: Distance estimation

๐’™1 ๐’™2 ๐’™๐‘

๐’™1 โˆˆ 1,โ€ฆ , 256 ๐‘€

Page 83: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

83

Query; ๐’’ โˆˆ โ„๐ท

โžข ๐‘‘ ๐’’, ๐’™ 2 can be efficiently approximated by ๐‘‘๐ด ๐’’, เดฅ๐’™ 2

โžข Lookup-trick: Looking up pre-computed distance-tablesโžข Linear-scan by ๐‘‘๐ด

0.340.220.681.020.030.71

Linearscan

โ€ฆ ID: 42

ID: 67

ID: 92

ID: 221

ID: 143

ID: 34

ID: 99

ID: 234

ID: 3

Product Quantization: Distance estimation

๐’™1 ๐’™2 ๐’™๐‘

Asymmetric distance

๐’™1 โˆˆ 1,โ€ฆ , 256 ๐‘€

Page 84: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

84

โžข Only tens of lines in Pythonโžข Pure Python library: nanopq https://github.com/matsui528/nanopqโžข pip install nanopq

Not pseudo codes

Page 85: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

85

Deep PQ

โžข T. Yu et al., โ€œProduct Quantization Network for Fast Image Retrievalโ€, ECCV 18, IJCV20

โžข L. Yu et al., โ€œGenerative Adversarial Product Quantisationโ€, ACMMM 18

โžข B. Klein et al., โ€œEnd-to-End Supervised Product Quantization for Image Search and Retrievalโ€, CVPR 19

From T. Yu et al., โ€œProduct Quantization Network for Fast Image Retrievalโ€, ECCV 18

โžข Supervised search (unlike the original PQ)โžข Base-CNN + PQ-like-layer + Some-lossโžข Need class information

Page 86: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

86

More extensive survey for PQ

โžขhttps://github.com/facebookresearch/faiss/wiki#research-foundations-of-faissโžขhttp://yusukematsui.me/project/survey_pq/survey_pq_jp.htmlโžขY. Matsui, Y. Uchida, H. Jรฉgou, S. Satoh โ€œA Survey of Product Quantizationโ€, ITE 2018.

Page 87: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

87

Hamming-based Look-up-based

0.340.220.681.020.030.71

0

1

0

1

0

0

0.340.220.681.020.030.71

ID: 2

ID: 123

ID: 87

Representation Binary code๏ผš 0, 1 ๐ต PQ code๏ผš 1,โ€ฆ , 256 ๐‘€

Distance Hamming distance Asymmetric distance

Approximation โ˜บ โ˜บโ˜บ

Runtime โ˜บโ˜บ โ˜บ

Pros No auxiliary structure Can reconstruct the original vector

Cons Cannot reconstruct the original vector Require an auxiliary structure (codebook)

Hamming-based vs Look-up-based

Page 88: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

88

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

Page 89: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

89

Inverted index + PQ: Recap the notation

0.340.220.681.020.030.71

ID: 2

ID: 123

ID: 87

๐’™ โˆˆ โ„๐ท

๐ท ๐‘€

เดฅ๐’™ โˆˆ 1,โ€ฆ , 256 ๐‘€

Product quantization

โžข Suppose ๐’’, ๐’™ โˆˆ โ„๐ท, where ๐’™ is quantized to เดฅ๐’™โžข ๐‘‘ ๐’’, ๐’™ 2 can be efficiently approximated by เดฅ๐’™:

๐‘‘ ๐’’, ๐’™ 2 โˆผ ๐‘‘๐ด ๐’’, เดฅ๐’™ 2

PQ code

Bar-notation = PQ-code

Just by a PQ-code.Not the original vector ๐’™

Page 90: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

90

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

Inverted index + PQ: Record

Prepare a coarse quantizer โœ“ Split the space into ๐พ sub-spaces

โœ“ ๐’„๐‘˜ ๐‘˜=1๐พ are created by running k-means on training data

Page 91: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

91

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

1.020.730.561.371.370.72

๐’™1

Record ๐’™1๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

Inverted index + PQ: Record

Page 92: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

92

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

1.020.730.561.371.370.72

๐’™1

Record ๐’™1๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

Inverted index + PQ: Record

Page 93: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

93

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

1.020.730.561.371.370.72

๐’™1

Record ๐’™1๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

Inverted index + PQ: Record

โžข ๐’„2 is closest to ๐’™1โžข Compute a residual ๐‘Ÿ1 between ๐’™1 and ๐’„2:

๐’“1 = ๐’™1 โˆ’ ๐’„2 ( )

Page 94: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

94

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

1.020.730.561.371.370.72

๐’™1

Record ๐’™1๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พ

โžข ๐’„2 is closest to ๐’™1โžข Compute a residual ๐‘Ÿ1 between ๐’™1 and ๐’„2:

๐’“1 = ๐’™1 โˆ’ ๐’„2 ( )

ID: 42

ID: 37

ID: 9

1

ใƒปใƒปใƒป

โžข Quantize ๐’“1 to เดค๐’“1 by PQโžข Record it with the ID, โ€œ1โ€โžข i.e., record (๐‘–, เดฅ๐’“๐‘–)

เดค๐’“๐‘–

Inverted index + PQ: Record

๐‘–

Page 95: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

95

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

Inverted index + PQ: Recordโžข For all database vectors, record [ID + PQ(res)] as pointing lists

coarse quantizer

Page 96: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

96

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

Inverted index + PQ: Search

coarse quantizer

Page 97: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

97

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

0.542.350.820.420.140.32

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

Find the nearest vector to ๐’’

Inverted index + PQ: Search

coarse quantizer

Page 98: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

98

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

0.542.350.820.420.140.32

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

Find the nearest vector to ๐’’

Inverted index + PQ: Search

coarse quantizer

Page 99: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

99

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

0.542.350.820.420.140.32

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

Find the nearest vector to ๐’’

โžข ๐’„2 is the closest to ๐’’โžข Compute the residual: ๐’“๐‘ž = ๐’’ โˆ’ ๐’„2

Inverted index + PQ: Search

coarse quantizer

Page 100: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

100

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

0.542.350.820.420.140.32

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

Find the nearest vector to ๐’’

โžข ๐’„2 is the closest to ๐’’โžข Compute the residual: ๐’“๐‘ž = ๐’’ โˆ’ ๐’„2

โžข For all (๐‘–, เดค๐’“๐‘–) in ๐‘˜ = 2, compare เดค๐’“๐‘– with ๐’“๐‘ž:

๐‘‘ ๐’’, ๐’™๐‘–2 = ๐‘‘ ๐’’ โˆ’ ๐’„2, ๐’™๐‘– โˆ’ ๐’„2

2

= ๐‘‘ ๐’“๐‘ž , ๐’“๐‘–2โˆผ ๐‘‘๐ด ๐’“๐‘ž , เดฅ๐’“๐‘–

2

โžข Find the smallest one (several strategies)

Inverted index + PQ: Search

coarse quantizer

เดค๐’“๐‘–

๐‘–

Page 101: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

101

Faisshttps://github.com/facebookresearch/faiss

$> conda install faiss-cpu -c pytorch$> conda install faiss-gpu -c pytorch

โžขFrom the original authors of the PQ and a GPU expert, FAIRโžขCPU-version: all PQ-based methodsโžขGPU-version: some PQ-based methodsโžขBonus:โžขNN (not ANN) is also implemented, and quite fastโžขk-means (CPU/GPU). Fast.

โ˜…10K

Benchmark of k-means: https://github.com/DwangoMediaVillage/pqkmeans/blob/master/tutorial/4_comparison_to_faiss.ipynb

Page 102: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

102

quantizer = faiss.IndexFlatL2(D)index = faiss.IndexIVFPQ(quantizer, D, nlist, M, nbits)

index.train(Xt) # Trainindex.add(X) # Record dataindex.nprobe = nprobe # Search parameterdist, ids = index.search(Q, topk) # Search

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

ใƒปใƒปใƒป

0.542.350.820.420.140.32

coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

๐‘˜ = 1

๐‘˜ = ๐พ

Usually, 8 bit

๐‘€

Select a coarse quantizer

Simple linear scan

Page 103: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

103

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

Page 104: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

104

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

ใƒปใƒปใƒป

๐‘˜ = 1

๐‘˜ = ๐พ

๐‘€

0.542.350.820.420.140.32

๐’’

๐’„6๐’„3๐’„13

Coarse quantizer

Page 105: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

105

quantizer = faiss.IndexHNSWFlat(D, hnsw_m)index = faiss.IndexIVFPQ(quantizer, D, nlist, M, nbits)

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

ใƒปใƒปใƒป

0.542.350.820.420.140.32

Coarse quantizer๐’’

๐‘˜ = 1

๐‘˜ = ๐พ

๐‘€HNSW

โžขSwitch a coarse quantizer from linear-scan to HNSWโžขThe best approach for billion-scale data as of 2020โžขThe backbone of [Douze+, CVPR 2018] [Baranchuk+, ECCV 2018]

Usually, 8 bitSelect a coarse quantizer

๐’„6๐’„3๐’„13

Page 106: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

106

โ˜บ From the original authors of PQ. Extremely efficient (theory & impl)โ˜บ Used in a real-world product (Mercari, etc)โ˜บ For billion-scale data, Faiss is the best optionโ˜บ Especially, large-batch-search is fast; #query is large

Lack of documentation (especially, python binding) Hard for a novice user to select a suitable algorithm As of 2020, anaconda is required. Pip is not supported officially

Page 107: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

107

Referenceโžข Faiss wiki: [https://github.com/facebookresearch/faiss/wiki]

โžข Faiss tips: [https://github.com/matsui528/faiss_tips]

โžข Julia implementation of lookup-based methods [https://github.com/una-dinosauria/Rayuela.jl]

โžข PQ paper: H. Jรฉgou et al., โ€œProduct quantization for nearest neighbor search,โ€ TPAMI 2011

โžข IVFADC + HNSW (1): M. Douze et al., โ€œLink and code: Fast indexing with graphs and compact regression codes,โ€ CVPR 2018

โžข IVFADC + NHSW (2): D. Baranchuk et al., โ€œRevisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors,โ€ ECCV 2018

Page 108: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

108

Start

Have GPU(s)?

faiss-gpu: linear scan (GpuIndexFlatL2)

faiss-cpu: linear scan (IndexFlatL2)

nmslib (hnsw)

falconn

annoy

faiss-cpu: hnsw + ivfpq(IndexHNSWFlat + IndexIVFPQ)

Adjust the PQ parameters: Make ๐‘€ smlaller

Exact nearest neighbor search

Alternative: faiss.IndexHNSWFlat in faiss-cpuโžข Same algorithm in different libraries

Note: Assuming ๐ท โ‰… 100. The size of the problem is determined by ๐ท๐‘. If 100 โ‰ช ๐ท, run PCA to reduce ๐ท to 100

Yes

No

If topk > 2048

If slow, or out of memory

Require fast data addition

Would like to run from several processes

If slowโ€ฆ

Would like to adjust the performance

riiWould like to runsubset-search

If out of memory

Adjust the IVF parameters:Make nprobe larger โžก Higher accuracy but slower

Would like to adjust the performance

cheat-sheet for ANN in Python ๏ผˆas of 2020. Can be installed by conda or pip๏ผ‰

faiss-gpu: ivfpq (GpuIndexIVFPQ)

(1) If still out of GPU-memory, or(2) Need more accurate results

If out of GPU-memoryIf out of GPU-memory, make ๐‘€ smaller

About: 103 < ๐‘ < 106

About: 106 < ๐‘ < 109

About: 109 < ๐‘

Page 109: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

109

Benchmark 1: ann-benchmarksโžข https://github.com/erikbern/ann-benchmarksโžข Comprehensive and thorough benchmarks

for various libraries. Docker-based

โžข Top right, the betterโžข As of June, 2020, NMSLIB and NGT are

competing each other for the first place

Page 110: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

110

Benchmark 2: annbenchโžข https://github.com/matsui528/annbenchโžข Lightweight, easy-to-use

# Install librariespip install -r requirements.txt

# Download dataset on ./datasetpython download.py dataset=siftsmall

# Evaluate algos. Results are on ./outputpython run.py dataset=siftsmall algo=annoy

# Visualizepython plot.py

# Multi-run by Hydrapython run.py --multirun dataset=siftsmall,sift1m algo=linear,annoy,ivfpq,hnsw

Page 111: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

ID Img Tag

1 โ€œcatโ€

2 โ€œbirdโ€

โ‹ฎ

125 โ€œzebraโ€

126 โ€œelephantโ€

โ‹ฎ

111

Search for a โ€œsubsetโ€Target IDs:[125, 223, 365, โ€ฆ]

(1) Tag-based search:tag == โ€œzebraโ€

(2) Image search with a query ๐’’

๐’™๐‘› ๐‘›=1๐‘

Subset-searchY. Matsui+, โ€œReconfigurable Inverted Indexโ€, ACMMM 18

Page 112: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

112

Trillion-scale search: ๐‘ = 1012 (1T)

Sense of scaleโžข K(= 103) Just in a second on a local machineโžข M(= 106) All data can be on memory. Try several approachesโžข G(= 109) Need to compress data by PQ. Only two datasets are available (SIFT1B, Deep1B)โžข T(= 1012) Cannot even imagine

https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectorsโžข Only in Faiss wikiโžข Distributed, mmap, etc.

https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors

A sparse matrix of 15 Exa elements?

Page 113: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

113

Nearest neighbor search engine: something like ANN + SQL

https://github.com/vearch/vearch

https://github.com/milvus-io/milvus

โžข The algorithm inside is faiss, nmslib, or NGT

Elasticsearch KNNhttps://github.com/opendistro-for-elasticsearch/k-NN

https://github.com/vdaas/vald

Page 114: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

114

Problems of ANNโžข No mathematical background.โœ“ Only actual measurements matter: recall and runtimeโœ“ The ANN problem was mathematically defined 10+ years ago (LSH), but recently no

one cares the definition.

โžข Thus, when the score is high, itโ€™s not clear the reason:โœ“ The method is good?โœ“ The implementation is good?โœ“ Just happens to work well for the target dataset?โœ“ E.g.: The difference of math library (OpenBLAS vs Intel MKL) matters.

โžข If one can explain โ€œwhy this approach works good for this datasetโ€, it would be a great contribution to the field.

โžข Not enough dataset. Currently, only two datasets are available for billion-scale data: SIFT1B and Deep1B

Page 115: ACM Multimedia 2020 Tutorial on Effective and Efficient ...

115

Start

Have GPU(s)?

faiss-gpu: linear scan (GpuIndexFlatL2)

faiss-cpu: linear scan (IndexFlatL2)

nmslib (hnsw)

falconn

annoy

faiss-cpu: hnsw + ivfpq(IndexHNSWFlat + IndexIVFPQ)

Adjust the PQ parameters: Make ๐‘€ smlaller

Exact nearest neighbor search

Alternative: faiss.IndexHNSWFlat in faiss-cpuโžข Same algorithm in different libraries

Note: Assuming ๐ท โ‰… 100. The size of the problem is determined by ๐ท๐‘. If 100 โ‰ช ๐ท, run PCA to reduce ๐ท to 100

Yes

No

If topk > 2048

If slow, or out of memory

Require fast data addition

Would like to run from several processes

If slowโ€ฆ

Would like to adjust the performance

riiWould like to runsubset-search

If out of memory

Adjust the IVF parameters:Make nprobe larger โžก Higher accuracy but slower

Would like to adjust the performance

cheat-sheet for ANN in Python ๏ผˆas of 2020. Can be installed by conda or pip๏ผ‰

faiss-gpu: ivfpq (GpuIndexIVFPQ)

(1) If still out of GPU-memory, or(2) Need more accurate results

If out of GPU-memoryIf out of GPU-memory, make ๐‘€ smaller

About: 103 < ๐‘ < 106

About: 106 < ๐‘ < 109

About: 109 < ๐‘