1 Billion-scale Approximate Nearest Neighbor Search ACM Multimedia 2020 Tutorial on Effective and Efficient: Toward Open-world Instance Re-identification Yusuke Matsui The University of Tokyo
1
Billion-scale Approximate Nearest Neighbor Search
ACM Multimedia 2020 Tutorial onEffective and Efficient: Toward Open-world Instance Re-identification
Yusuke MatsuiThe University of Tokyo
Search ๐1, ๐2, โฆ , ๐๐
๐๐ โ โ๐ท
2
โข๐ ๐ท-dim database vectors: ๐๐ ๐=1๐
Nearest Neighbor Search; NN
0.233.150.651.43
Search
0.203.250.721.68
๐ โ โ๐ท ๐74
argmin๐โ 1,2,โฆ,๐
๐ โ ๐๐ 22
Result
๐1, ๐2, โฆ , ๐๐
๐๐ โ โ๐ท
3
โข๐ ๐ท-dim database vectors: ๐๐ ๐=1๐
โขGiven a query ๐, find the closest vector from the databaseโขOne of the fundamental problems in computer scienceโขSolution: linear scan, ๐ ๐๐ท , slow
Nearest Neighbor Search; NN
0.233.150.651.43
Search
0.203.250.721.68
๐ โ โ๐ท ๐74
argmin๐โ 1,2,โฆ,๐
๐ โ ๐๐ 22
Result
๐1, ๐2, โฆ , ๐๐
๐๐ โ โ๐ท
Approximate Nearest Neighbor Search; ANN
โขFaster searchโขDonโt necessarily have to be exact neighborsโขTrade off: runtime, accuracy, and memory-consumption
4
Approximate Nearest Neighbor Search; ANN
0.233.150.651.43
Search
0.203.250.721.68
๐ โ โ๐ท ๐74
argmin๐โ 1,2,โฆ,๐
๐ โ ๐๐ 22
Result
๐1, ๐2, โฆ , ๐๐
๐๐ โ โ๐ท
โขFaster searchโขDonโt necessarily have to be exact neighborsโขTrade off: runtime, accuracy, and memory-consumptionโขA sense of scale: billion-scale data on memory
32GB RAM100 106 to 109
10 ms
5
NN/ANN for CV
Image retrieval
https://about.mercari.com/press/news/article/20190318_image_search/
https://jp.mathworks.com/help/vision/ug/image-classification-with-bag-of-visual-words.html
Clustering
kNN recognitionโข Originally: fast construction of bag-of-featuresโข One of the benchmarks is still SIFT
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
6
Person Re-identification
7
Start
Have GPU(s)?
faiss-gpu: linear scan (GpuIndexFlatL2)
faiss-cpu: linear scan (IndexFlatL2)
nmslib (hnsw)
falconn
annoy
faiss-cpu: hnsw + ivfpq(IndexHNSWFlat + IndexIVFPQ)
Adjust the PQ parameters: Make ๐ smlaller
Exact nearest neighbor search
Alternative: faiss.IndexHNSWFlat in faiss-cpuโข Same algorithm in different libraries
Note: Assuming ๐ท โ 100. The size of the problem is determined by ๐ท๐. If 100 โช ๐ท, run PCA to reduce ๐ท to 100
Yes
No
If topk > 2048
If slow, or out of memory
Require fast data addition
Would like to run from several processes
If slowโฆ
Would like to adjust the performance
riiWould like to runsubset-search
If out of memory
Adjust the IVF parameters:Make nprobe larger โก Higher accuracy but slower
Would like to adjust the performance
cheat-sheet for ANN in Python ๏ผas of 2020. Can be installed by conda or pip๏ผ
faiss-gpu: ivfpq (GpuIndexIVFPQ)
(1) If still out of GPU-memory, or(2) Need more accurate results
If out of GPU-memoryIf out of GPU-memory, make ๐ smaller
About: 103 < ๐ < 106
About: 106 < ๐ < 109
About: 109 < ๐
Part 1:Nearest Neighbor Search
Part 2:Approximate Nearest Neighbor Search
8
Part 1:Nearest Neighbor Search
Part 2:Approximate Nearest Neighbor Search
9
0.233.150.651.43
Search
0.203.250.721.68
๐ โ โ๐ท ๐74
argmin๐โ 1,2,โฆ,๐
๐ โ ๐๐ 22
Result
๐1, ๐2, โฆ , ๐๐
๐๐ โ โ๐ท
Nearest Neighbor Search
10
โขShould try this first of allโขIntroduce a naรฏve implementationโขIntroduce a fast implementationโ Faiss library from FAIR (youโll see many times today. CPU & GPU)
โขExperience the drastic difference between the two impls
Task๏ผGiven ๐ โ ๐ฌ and ๐ โ ๐ณ, compute ๐ โ ๐ 22
11
๐ ๐ท-dim query vectors ๐ฌ = ๐1, ๐2, โฆ , ๐๐๐ ๐ท-dim database vectors ๐ณ = ๐1, ๐2, โฆ , ๐๐ ๐ โช ๐
๐ ๐ท-dim query vectors ๐ฌ = ๐1, ๐2, โฆ , ๐๐๐ ๐ท-dim database vectors ๐ณ = ๐1, ๐2, โฆ , ๐๐ ๐ โช ๐
Task๏ผGiven ๐ โ ๐ฌ and ๐ โ ๐ณ, compute ๐ โ ๐ 22
parfor q in Q:for x in X:
l2sqr(q, x)
def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):
diff += (q[d] โ x[d])**2return diff
Naรฏve impl.Parallelize query-side
Select min by heap, but omit it now
12
Task๏ผGiven ๐ โ ๐ฌ and ๐ โ ๐ณ, compute ๐ โ ๐ 22
parfor q in Q:for x in X:
l2sqr(q, x)
def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):
diff += (q[d] โ x[d])**2return diff
Naรฏve impl.Parallelize query-side
Select min by heap, but omit it now
faiss impl.
if ๐ < 20๏ผ
compute ๐ โ ๐ 22 by SIMD
else๏ผ
compute ๐ โ ๐ 22 = ๐ 2
2 โ 2๐โค๐ + ๐ 22 by BLAS 13
๐ ๐ท-dim query vectors ๐ฌ = ๐1, ๐2, โฆ , ๐๐๐ ๐ท-dim database vectors ๐ณ = ๐1, ๐2, โฆ , ๐๐ ๐ โช ๐
Task๏ผGiven ๐ โ ๐ฌ and ๐ โ ๐ณ, compute ๐ โ ๐ 22
parfor q in Q:for x in X:
l2sqr(q, x)
def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):
diff += (q[d] โ x[d])**2return diff
Naรฏve impl.Parallelize query-side
Select min by heap, but omit it now
faiss impl.
if ๐ < 20๏ผ
compute ๐ โ ๐ 22 by SIMD
else๏ผ
compute ๐ โ ๐ 22 = ๐ 2
2 โ 2๐โค๐ + ๐ 22 by BLAS 14
๐ ๐ท-dim query vectors ๐ฌ = ๐1, ๐2, โฆ , ๐๐๐ ๐ท-dim database vectors ๐ณ = ๐1, ๐2, โฆ , ๐๐ ๐ โช ๐
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
๐ โ ๐ 22 by SIMD Ref.Rename variables for the
sake of explanation
x
y
D=31
float: 32bit
15
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
mx my
โข 256bit SIMD Registerโข Process eight floats at oncefloat: 32bit
16
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
float: 32bit
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
mx my
17
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 256bit SIMD Registerโข Process eight floats at once
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
mx my
a_m_b1
โโโโ โโ โโ
float: 32bit
18
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 256bit SIMD Registerโข Process eight floats at once
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
mx my
a_m_b1
msum1
a_m_b1
โโโโ โโ โโ
โโโโโโโโ
+=
float: 32bit
19
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 256bit SIMD Registerโข Process eight floats at once
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
mx my
msum1 +=
float: 32bit
20
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 256bit SIMD Registerโข Process eight floats at once
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
mx my
a_m_b1
โโโโ โโ โโ
msum1 +=
float: 32bit
21
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 256bit SIMD Registerโข Process eight floats at once
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
mx my
a_m_b1
msum1
a_m_b1
โโโโ โโ โโ
โโโโโโโโ
+=
float: 32bit
22
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 256bit SIMD Registerโข Process eight floats at once
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
mx my
a_m_b1
msum1
a_m_b1
msum2
โโโโ โโ โโ
โโโโโโโโ
โ โ โ โ
โข 128bit SIMD Register
+=
float: 32bit
23
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 256bit SIMD Registerโข Process eight floats at once
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
mx my
a_m_b1 a_m_b1
msum2
โโโโ
โโโโ
+=
โข 128bit SIMD Register
float: 32bit
24
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
0 0 0mx 0 0 0my
a_m_b1 a_m_b1
โโโโ
โโโโ
msum2 +=
The rest
float: 32bit
25
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 128bit SIMD Register
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
0 0 0mx 0 0 0my
a_m_b1 a_m_b1
โโโโ
โโโโ
โ โ
โ
msum2 +=
Result
float: 32bit
26
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 128bit SIMD Register
The rest
float fvec_L2sqr (const float * x,const float * y,size_t d)
{__m256 msum1 = _mm256_setzero_ps();
while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;
}
__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);
if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;
}
if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;
}
msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);
}
x
y
0 0 0mx 0 0 0my
a_m_b1 a_m_b1
โโโโ
โโโโ
โ โ
โ
msum2 +=
Result
float: 32bit
27
def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):
diff += (x[d] โ y[d])**2return diff
๐ โ ๐ 22 by SIMD Rename variables for the
sake of explanationRef.
D=31
โข 128bit SIMD Register
The rest
โข SIMD codes of faiss are simple and easy to readโข Being able to read SIMD codes comes in handy
sometimes; why this impl is super fastโข Another example of SIMD L2sqr from HNSW:
https://github.com/nmslib/hnswlib/blob/master/hnswlib/space_l2.h
๐ ๐ท-dim query vectors ๐ฌ = ๐1, ๐2, โฆ , ๐๐๐ ๐ท-dim database vectors ๐ณ = ๐1, ๐2, โฆ , ๐๐Task๏ผGiven ๐ โ ๐ฌ and ๐ โ ๐ณ, compute ๐ โ ๐ 2
2
parfor q in Q:for x in X:
l2sqr(q, x)
def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):
diff += (q[d] โ x[d])**2return diff
Naรฏve impl.Parallelize query-side
Select min by heap, but omit it now
faiss impl.
if ๐ < 20๏ผ
compute ๐ โ ๐ 22 by SIMD
else๏ผ
compute ๐ โ ๐ 22 = ๐ 2
2 โ 2๐โค๐ + ๐ 22 by BLAS 28
Compute ๐ โ ๐ 22 = ๐ 2
2 โ 2๐โค๐ + ๐ 22 with BLAS
# Compute tablesq_norms = norms(Q) # ๐1 2
2, ๐2 22, โฆ , ๐๐ 2
2
x_norms = norms(X) # ๐1 22, ๐2 2
2, โฆ , ๐๐ 22
ip = sgemm_(Q, X, โฆ) # ๐โค๐
# Scan and sumparfor (m = 0; m < M; ++m):
for (n = 0; n < N; ++n):dist = q_norms[m] + x_norms[n] โ ip[m][n]
Stack ๐ ๐ท-dim query vectors to a ๐ท ร๐ matrix: ๐ = ๐1, ๐2, โฆ , ๐๐ โ โ๐ทร๐
Stack ๐ ๐ท-dim database vectors to a ๐ท ร ๐ matrix: ๐ = ๐1, ๐2, โฆ , ๐๐ โ โ๐ทร๐
SIMD-accelerated function
โข Matrix multiplication by BLASโข Dominant if ๐ and ๐ are largeโข The difference of the background matters:โ Intel MKL is 30% faster than OpenBLAS
29๐๐ 22 ๐๐ 2
2 ๐โค๐ ๐๐๐๐ โ ๐๐ 22
NN in GPU (faiss-gpu) is 10x faster than NN in CPU (faiss-cpu)
โขNN-GPU always compute ๐ 22 โ 2๐โค๐ + ๐ 2
2
โข k-means for 1M vectors (D=256, K=20000)โ 11 min on CPUโ 55 sec on 1 Pascal-class P100 GPU (float32 math)โ 34 sec on 1 Pascal-class P100 GPU (float16 math)โ 21 sec on 4 Pascal-class P100 GPUs (float32 math)โ 16 sec on 4 Pascal-class P100 GPUs (float16 math)
โข If GPU is available and its memory is enough, try GPU-NNโข The behavior is little bit different (e.g., a restriction for top-k)
Benchmark: https://github.com/facebookresearch/faiss/wiki/Low-level-benchmarks
x10 faster
30
Referenceโข Switch implementation of L2sqr in faiss:
[https://github.com/facebookresearch/faiss/wiki/Implementation-notes#matrix-multiplication-to-do-many-l2-distance-computations]
โข Introduction to SIMD: a lecture by Markus Pรผschel (ETH) [How to Write Fast Numerical Code - Spring 2019], especially [SIMD vector instructions]โ https://acl.inf.ethz.ch/teaching/fastcode/2019/โ https://acl.inf.ethz.ch/teaching/fastcode/2019/slides/07-simd.pdf
โข SIMD codes for faiss [https://github.com/facebookresearch/faiss/blob/master/utils/distances_simd.cpp]
โข L2sqr benchmark including AVX512 for faiss-L2sqr [https://gist.github.com/matsui528/583925f88fcb08240319030202588c74]
31
Part 1:Nearest Neighbor Search
Part 2:Approximate Nearest Neighbor Search
32
๐
109
106
bill
ion
-sca
lem
illio
n-s
cale Locality Sensitive Hashing (LSH)
Tree / Space Partitioning
Graph traversal
0.340.220.680.71
0
1
0
0
ID: 2
ID: 123
0.340.220.680.71
Space partition Data compression
โข k-meansโข PQ/OPQโข Graph traversalโข etcโฆ
โข Raw dataโข Scalar quantizationโข PQ/OPQโข etcโฆ
Look-up-based
Hamming-based
Linear-scan by Asymmetric Distance
โฆ
Linear-scan by Hamming distance
33
Inverted index + data compression
For raw data: Acc. โบ, Memory: For compressed data: Acc. , Memory: โบ
๐
109
106
bill
ion
-sca
lem
illio
n-s
cale Locality Sensitive Hashing (LSH)
Tree / Space Partitioning
Graph traversal
0.340.220.680.71
0
1
0
0
ID: 2
ID: 123
0.340.220.680.71
Space partition Data compression
โข k-meansโข PQ/OPQโข Graph traversalโข etcโฆ
โข Raw dataโข Scalar quantizationโข PQ/OPQโข etcโฆ
Look-up-based
Hamming-based
Linear-scan by Asymmetric Distance
โฆ
Linear-scan by Hamming distance
34
Inverted index + data compression
For raw data: Acc. โบ, Memory: For compressed data: Acc. , Memory: โบ
35
Locality Sensitive Hashing (LSH)โข LSH = Hash functions + Hash tablesโข Map similar items to the same symbol with a high probability
Record
๐13
Hash 1
Hash 2
โฆ
โฌ
โฌ
Search
๐Hash 1
Hash 2
โฆ
โฃใใด
โคใผ
Compare ๐ with ๐4, ๐5, ๐21, โฆby the Euclidean distance
36
Locality Sensitive Hashing (LSH)โข LSH = Hash functions + Hash tablesโข Map similar items to the same symbol with a high probability
Record
๐13
Hash 1
Hash 2
โฆ
โฌ
โฌ
Search
๐Hash 1
Hash 2
โฆ
โฃใใด
โคใผ
Compare ๐ with ๐4, ๐5, ๐21, โฆby the Euclidean distance
E.g., random projection [Dater+, SCG 04]๐ป ๐ = โ1 ๐ ,โฆ , โ๐ ๐ ๐
โ๐ ๐ =๐๐๐ + ๐
๐
37
Locality Sensitive Hashing (LSH)โข LSH = Hash functions + Hash tablesโข Map similar items to the same symbol with a high probability
Record
๐13
Hash 1
Hash 2
โฆ
โฌ
โฌ
Search
๐Hash 1
Hash 2
โฆ
โฃใใด
โคใผ
Compare ๐ with ๐4, ๐5, ๐21, โฆby the Euclidean distance
E.g., random projection [Dater+, SCG 04]๐ป ๐ = โ1 ๐ ,โฆ , โ๐ ๐ ๐
โ๐ ๐ =๐๐๐ + ๐
๐
โบ: โข Math-friendlyโข Popular in the theory area (FOCS, STOC, โฆ): โข Large memory costโ Need several tables to boost the accuracyโ Need to store the original data, ๐๐ ๐=1
๐ , on memoryโข Data-dependent methods such as PQ are better for real-world dataโข Thus, in recent CV papers, LSH has been treated as a classic-
method
38
Hash 2
โฆ
โฃใใด
โคใผSearch
ใใน๐
Hash 1
In fact:โขConsider a next candidate โก practical memory consumption๏ผMulti-Probe [Lv+, VLDB 07]๏ผ
โขA library based on the idea: FALCONN
Compare ๐ with ๐4, ๐5, ๐21, โฆby the Euclidean distance
39
โ 852https://github.com/falconn-lib/falconn
$> pip install FALCONN
table = falconn.LSHIndex(params_cp)table.setup(X-center)query_object = table.construct_query_object()# query parameter config herequery_object.find_nearest_neighbor(Q-center, topk)
Falconn
โบ Faster data addition (than annoy, nmslib, ivfpq)โบ Useful for on-the-fly addition Parameter configuration seems a bit non-intuitive
40
Referenceโข Good summaries on this field: CVPR 2014 Tutorial on Large-Scale Visual
Recognition, Part I: Efficient matching, H. Jรฉgou[https://sites.google.com/site/lsvrtutorialcvpr14/home/efficient-matching]
โข Practical Q&A: FAQ in Wiki of FALCONN [https://github.com/FALCONN-LIB/FALCONN/wiki/FAQ]
โข Hash functions: M. Datar et al., โLocality-sensitive hashing scheme based on p-stable distributions,โ SCG 2004.
โข Multi-Probe: Q. Lv et al., โMulti-Probe LSH: Efficient Indexing for High-Dimensional Similarity Searchโ, VLDB 2007
โข Survey: A. Andoni and P. Indyk, โNear-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions,โ Comm. ACM 2008
๐
109
106
bill
ion
-sca
lem
illio
n-s
cale Locality Sensitive Hashing (LSH)
Tree / Space Partitioning
Graph traversal
0.340.220.680.71
0
1
0
0
ID: 2
ID: 123
0.340.220.680.71
Space partition Data compression
โข k-meansโข PQ/OPQโข Graph traversalโข etcโฆ
โข Raw dataโข Scalar quantizationโข PQ/OPQโข etcโฆ
Look-up-based
Hamming-based
Linear-scan by Asymmetric Distance
โฆ
Linear-scan by Hamming distance
41
Inverted index + data compression
For raw data: Acc. โบ, Memory: For compressed data: Acc. , Memory: โบ
42
โขAutomatically select โRandomized KD Treeโ or โk-means Treeโhttps://github.com/mariusmuja/flann
โบ Good code base. Implemented in OpenCV and PCLโบ Very popular in the late 00's and early 10โs Large memory consumption. The original data need to be stored Not actively maintained now
Images are from [Muja and Lowe, TPAMI 2014]
Randomized KD Tree k-means Tree
FLANN: Fast Library for Approximate Nearest Neighbors
43
Annoyโ2-means treeโ+ โmultiple-treesโ + โshared priority queueโ
Record
Search
โข Focus the cell that the query livesโข Compare the distances Can traverse the tree by log-times comparisons
All images are cited from the authorโs blog post (https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html)
Select two points randomly Divide up the space Repeat hierarchically
44
Annoyโ2-means treeโ+ โmultiple-treesโ + โshared priority queueโ
All images are cited from the authorโs blog post (https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html)
Feature 1 If we need more data points, use a priority queue
Feature 2 Boost the accuracy by multi-tree with a shared priority queue
45
Annoyhttps://github.com/erikbern/annoy$> pip install annoy
t = AnnoyIndex(D)for n, x in enumerate(X):
t.add_item(n, x)t.build(n_trees)
t.get_nns_by_vector(q, topk)
โบ Developed at Spotify. Well-maintained. Stableโบ Simple interface with only a few parametersโบ Baseline for million-scale dataโบ Support mmap, i.e., can be accessed from several processes Large memory consumption Runtime itself is slower than HNSW
โ 7.1K
๐
109
106
bill
ion
-sca
lem
illio
n-s
cale Locality Sensitive Hashing (LSH)
Tree / Space Partitioning
Graph traversal
0.340.220.680.71
0
1
0
0
ID: 2
ID: 123
0.340.220.680.71
Space partition Data compression
โข k-meansโข PQ/OPQโข Graph traversalโข etcโฆ
โข Raw dataโข Scalar quantizationโข PQ/OPQโข etcโฆ
Look-up-based
Hamming-based
Linear-scan by Asymmetric Distance
โฆ
Linear-scan by Hamming distance
46
Inverted index + data compression
For raw data: Acc. โบ, Memory: For compressed data: Acc. , Memory: โบ
47
Graph traversal
โข Very popular in recent years
โข Around 2017, it turned out that the graph-traversal-based methods work well for million-scale data
โข Pioneer:โ Navigable Small World Graphs (NSW)โ Hierarchical NSW (HNSW)
โข Implementation: nmslib, hnsw, faiss
48
Record Images are from [Malkov+, Information Systems, 2013]
โขEach node is a database vector
๐13
Graph of ๐1, โฆ , ๐90
49
Record Images are from [Malkov+, Information Systems, 2013]
โขEach node is a database vectorโขGiven a new database vector, create new edges to neighbors
๐13
๐91
Graph of ๐1, โฆ , ๐90
50
Record Images are from [Malkov+, Information Systems, 2013]
โขEach node is a database vectorโขGiven a new database vector, create new edges to neighbors
๐13
๐91
Graph of ๐1, โฆ , ๐90
51
Record Images are from [Malkov+, Information Systems, 2013]
โขEach node is a database vectorโขGiven a new database vector, create new edges to neighbors
๐13
๐91
Graph of ๐1, โฆ , ๐90
52
Record Images are from [Malkov+, Information Systems, 2013]
โขEach node is a database vectorโขGiven a new database vector, create new edges to neighbors
โข Early links can be longโข Such long links encourage a large hop,
making the fast convergence for search
๐13
๐91
Graph of ๐1, โฆ , ๐90
53
Search Images are from [Malkov+, Information Systems, 2013]
54
Search Images are from [Malkov+, Information Systems, 2013]
โข Given a query vector
55
Search Images are from [Malkov+, Information Systems, 2013]
โข Given a query vectorโข Start from a random point
56
Search Images are from [Malkov+, Information Systems, 2013]
โข Given a query vectorโข Start from a random pointโข From the connected nodes, find the closest one to the query
57
Search Images are from [Malkov+, Information Systems, 2013]
โข Given a query vectorโข Start from a random pointโข From the connected nodes, find the closest one to the query
58
โข Given a query vectorโข Start from a random pointโข From the connected nodes, find the closest one to the queryโข Traverse in a greedy manner
Search Images are from [Malkov+, Information Systems, 2013]
59
โข Given a query vectorโข Start from a random pointโข From the connected nodes, find the closest one to the queryโข Traverse in a greedy manner
Search Images are from [Malkov+, Information Systems, 2013]
60
Extension: Hierarchical NSW; HNSW
[Malkov and Yashunin, TPAMI, 2019]
โข Construct the graph hierarchically [Malkov and Yashunin, TPAMI, 2019]
โข This structure works pretty well for real-world data
Search on a coarse graph
Move to the same node on a finer graph
Repeat
61
NMSLIB (Non-Metric Space Library)https://github.com/nmslib/nmslib$> pip install nmslib
index = nmslib.init(method=โhnswโ)index.addDataPointBatch(X)index.createIndex(params1)
index.setQueryTimeParams(params2)index.knnQuery(q, topk)
โบ The โhnswโ is the best method as of 2020 for million-scale dataโบ Simple interfaceโบ If memory consumption is not the problem, try this Large memory consumption Data addition is not fast
โ 2k
62
Other implementations of HNSW
Hnswlib: https://github.com/nmslib/hnswlibโข Spin-off library from nmslibโข Include only hnswโข Simpler; may be useful if you want to extend hnsw
Faiss: https://github.com/facebookresearch/faissโข Libraries for PQ-based methods. Will Introduce later โข This lib also includes hnsw
Other graph-based approachesโข From Alibaba:C. Fu et al., โFast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graphโ, VLDB19https://github.com/ZJULearning/nsg
โข From Microsoft Research Asia. Used inside Bing:J. Wang and S. Lin, โQuery-Driven Iterated Neighborhood Graph Search for Large Scale Indexingโ, ACMMM12 (This seems the backbone paper)https://github.com/microsoft/SPTAG
โข From Yahoo Japan. Competing with NMSLIB for the 1st place of benchmark:M. Iwasaki and D. Miyazaki, โOptimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Dataโ, arXiv18https://github.com/yahoojapan/NGT
63
64
Referenceโข The original paper of Navigable Small World Graph: Y. Malkov et al., โApproximate
Nearest Neighbor Algorithm based on Navigable Small World Graphs,โ Information Systems 2013
โข The original paper of Hierarchical Navigable Small World Graph: Y. Malkov and D. Yashunin, โEfficient and Robust Approximate Nearest Neighbor search using Hierarchical Navigable Small World Graphs,โ IEEE TPAMI 2019
๐
109
106
bill
ion
-sca
lem
illio
n-s
cale Locality Sensitive Hashing (LSH)
Tree / Space Partitioning
Graph traversal
0.340.220.680.71
0
1
0
0
ID: 2
ID: 123
0.340.220.680.71
Space partition Data compression
โข k-meansโข PQ/OPQโข Graph traversalโข etcโฆ
โข Raw dataโข Scalar quantizationโข PQ/OPQโข etcโฆ
Look-up-based
Hamming-based
Linear-scan by Asymmetric Distance
โฆ
Linear-scan by Hamming distance
65
Inverted index + data compression
For raw data: Acc. โบ, Memory: For compressed data: Acc. , Memory: โบ
66
Basic idea
0.542.350.820.42
0.620.310.341.63
3.340.830.621.45
1 2 N โขNeed 4๐๐ท byte to represent ๐ real-valued vectorsusing floats
โข If ๐ or ๐ท is too large, we cannot read the data on memoryโ E.g., 512 GB for ๐ท = 128,๐ = 109
โขConvert each vector to a short-code
โข Short-code is designed as memory-efficientโ E.g., 4 GB for the above example, with 32-bit code
โขRun search for short-codes
๐ท
1 2 N
cod
e
cod
e
cod
e
Convert
โฆ
โฆ
67
Basic idea
0.542.350.820.42
0.620.310.341.63
3.340.830.621.45
1 2 N
โฆ
โขNeed 4๐๐ท byte to represent ๐ real-valued vectorsusing floats
โข If ๐ or ๐ท is too large, we cannot read the data on memoryโ E.g., 512 GB for ๐ท = 128,๐ = 109
โขConvert each vector to a short-code
โข Short-code is designed as memory-efficientโ E.g., 4 GB for the above example, with 32-bit code
โขRun search for short-codes
๐ท
1 2 N
cod
e
cod
e
cod
e
Convert
โฆ
What kind of conversion is preferred?
1. The โdistanceโ between two codes can be calculated (e.g., Hamming-distance)
2. The distance can be computed quickly
3. That distance approximates the distancebetween the original vectors (e.g., ๐ฟ2)
4. Sufficiently small length of codes can achieve the above three criteria
๐
109
106
bill
ion
-sca
lem
illio
n-s
cale Locality Sensitive Hashing (LSH)
Tree / Space Partitioning
Graph traversal
0.340.220.680.71
0
1
0
0
ID: 2
ID: 123
0.340.220.680.71
Space partition Data compression
โข k-meansโข PQ/OPQโข Graph traversalโข etcโฆ
โข Raw dataโข Scalar quantizationโข PQ/OPQโข etcโฆ
Look-up-based
Hamming-based
Linear-scan by Asymmetric Distance
โฆ
Linear-scan by Hamming distance
68
Inverted index + data compression
For raw data: Acc. โบ, Memory: For compressed data: Acc. , Memory: โบ
โข Convert ๐ to a ๐ต-bit binary vector:๐ ๐ = ๐ โ 0, 1 ๐ต
โข Hamming distance ๐๐ป ๐1, ๐2 = ๐1โ๐2 โผ ๐ ๐1, ๐2
โข A lot of methods:โ J. Wang et al., โLearning to Hash for Indexing Big Data - A
Surveyโ, Proc. IEEE 2015โ J. Wang et al., โA Survey on Learning to Hashโ, TPAMI 2018
โข Not the main scope of this tutorial;PQ is usually more accurate
๐
109
106
bill
ion
-sca
lem
illio
n-s
cale Locality Sensitive Hashing (LSH)
Tree / Space Partitioning
Graph traversal
0.340.220.680.71
0
1
0
0
ID: 2
ID: 123
0.340.220.680.71
Space partition Data compression
โข k-meansโข PQ/OPQโข Graph traversalโข etcโฆ
โข Raw dataโข Scalar quantizationโข PQ/OPQโข etcโฆ
Look-up-based
Hamming-based
Linear-scan by Asymmetric Distance
โฆ
Linear-scan by Hamming distance
69
Inverted index + data compression
For raw data: Acc. โบ, Memory: For compressed data: Acc. , Memory: โบ
70
0.340.220.681.020.030.71
๐ท ๐
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
vector; ๐
PQ-code; เดฅ๐
Codebook
Product Quantization; PQ [Jรฉgou, TPAMI 2011]
โข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by
k-means on training data
71
0.340.220.681.020.030.71
๐ท ๐
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
Codebook
Product Quantization; PQ [Jรฉgou, TPAMI 2011]
โข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by
k-means on training datavector; ๐
PQ-code; เดฅ๐
72
0.340.220.681.020.030.71
๐ท ๐ID: 2
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
Codebook
Product Quantization; PQ [Jรฉgou, TPAMI 2011]
โข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by
k-means on training datavector; ๐
PQ-code; เดฅ๐
73
0.340.220.681.020.030.71
๐ท ๐ID: 2
ID: 123
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
Codebook
Product Quantization; PQ [Jรฉgou, TPAMI 2011]
โข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by
k-means on training datavector; ๐
PQ-code; เดฅ๐
74
0.340.220.681.020.030.71
๐ท ๐ID: 2
ID: 123
ID: 87
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
Codebook
Product Quantization; PQ [Jรฉgou, TPAMI 2011]
โข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by
k-means on training datavector; ๐
PQ-code; เดฅ๐
75
0.340.220.681.020.030.71
๐ท ๐ID: 2
ID: 123
ID: 87
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
Codebook
โขSimpleโขMemory efficientโขDistance can be esimated
Product Quantization; PQ [Jรฉgou, TPAMI 2011]
โข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by
k-means on training datavector; ๐
PQ-code; เดฅ๐
Bar notation for PQ-code in this tutorial:๐ โ โ๐ท โฆ เดฅ๐ โ 1,โฆ , 256 ๐
76
Product Quantization: Memory efficient
0.340.220.681.020.030.71
๐ท ๐ID: 2
ID: 123
ID: 87
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
Codebookvector; ๐
PQ-code; เดฅ๐
float: 32bit
77
e.g., ๐ท = 128128 ร 32 = 4096 [bit]
Product Quantization: Memory efficient
0.340.220.681.020.030.71
๐ท ๐ID: 2
ID: 123
ID: 87
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
Codebookvector; ๐
PQ-code; เดฅ๐
float: 32bit
78
e.g., ๐ท = 128128 ร 32 = 4096 [bit]
e.g., ๐ = 88 ร 8 = 64 [bit]
uchar: 8bit
Product Quantization: Memory efficient
0.340.220.681.020.030.71
๐ท ๐ID: 2
ID: 123
ID: 87
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
Codebookvector; ๐
PQ-code; เดฅ๐
float: 32bit
79
e.g., ๐ท = 128128 ร 32 = 4096 [bit]
e.g., ๐ = 88 ร 8 = 64 [bit]
1/64
uchar: 8bit
Product Quantization: Memory efficient
0.340.220.681.020.030.71
๐ท ๐ID: 2
ID: 123
ID: 87
0.130.98
0.320.27
1.030.08
โฆID: 1 ID: 2 ID: 256
0.31.28
0.350.12
0.991.13
โฆID: 1 ID: 2 ID: 256
0.130.98
0.721.34
1.030.08
โฆID: 1 ID: 2 ID: 256
Codebookvector; ๐
PQ-code; เดฅ๐
80
Query; ๐ โ โ๐ท
0.340.220.681.020.030.71
Product Quantization: Distance estimation
0.542.350.820.420.140.32
0.620.310.341.631.430.74
3.340.830.621.450.122.32
โฆ
๐1 ๐2 ๐๐Database vectors
81
Query; ๐ โ โ๐ท
0.340.220.681.020.030.71
Product Quantization: Distance estimation
0.542.350.820.420.140.32
0.620.310.341.631.430.74
3.340.830.621.450.122.32
โฆ
Product quantization
๐1 ๐2 ๐๐Database vectors
82
Query; ๐ โ โ๐ท
0.340.220.681.020.030.71
โฆ ID: 42
ID: 67
ID: 92
ID: 221
ID: 143
ID: 34
ID: 99
ID: 234
ID: 3
Product Quantization: Distance estimation
๐1 ๐2 ๐๐
๐1 โ 1,โฆ , 256 ๐
83
Query; ๐ โ โ๐ท
โข ๐ ๐, ๐ 2 can be efficiently approximated by ๐๐ด ๐, เดฅ๐ 2
โข Lookup-trick: Looking up pre-computed distance-tablesโข Linear-scan by ๐๐ด
0.340.220.681.020.030.71
Linearscan
โฆ ID: 42
ID: 67
ID: 92
ID: 221
ID: 143
ID: 34
ID: 99
ID: 234
ID: 3
Product Quantization: Distance estimation
๐1 ๐2 ๐๐
Asymmetric distance
๐1 โ 1,โฆ , 256 ๐
84
โข Only tens of lines in Pythonโข Pure Python library: nanopq https://github.com/matsui528/nanopqโข pip install nanopq
Not pseudo codes
85
Deep PQ
โข T. Yu et al., โProduct Quantization Network for Fast Image Retrievalโ, ECCV 18, IJCV20
โข L. Yu et al., โGenerative Adversarial Product Quantisationโ, ACMMM 18
โข B. Klein et al., โEnd-to-End Supervised Product Quantization for Image Search and Retrievalโ, CVPR 19
From T. Yu et al., โProduct Quantization Network for Fast Image Retrievalโ, ECCV 18
โข Supervised search (unlike the original PQ)โข Base-CNN + PQ-like-layer + Some-lossโข Need class information
86
More extensive survey for PQ
โขhttps://github.com/facebookresearch/faiss/wiki#research-foundations-of-faissโขhttp://yusukematsui.me/project/survey_pq/survey_pq_jp.htmlโขY. Matsui, Y. Uchida, H. Jรฉgou, S. Satoh โA Survey of Product Quantizationโ, ITE 2018.
87
Hamming-based Look-up-based
0.340.220.681.020.030.71
0
1
0
1
0
0
0.340.220.681.020.030.71
ID: 2
ID: 123
ID: 87
Representation Binary code๏ผ 0, 1 ๐ต PQ code๏ผ 1,โฆ , 256 ๐
Distance Hamming distance Asymmetric distance
Approximation โบ โบโบ
Runtime โบโบ โบ
Pros No auxiliary structure Can reconstruct the original vector
Cons Cannot reconstruct the original vector Require an auxiliary structure (codebook)
Hamming-based vs Look-up-based
๐
109
106
bill
ion
-sca
lem
illio
n-s
cale Locality Sensitive Hashing (LSH)
Tree / Space Partitioning
Graph traversal
0.340.220.680.71
0
1
0
0
ID: 2
ID: 123
0.340.220.680.71
Space partition Data compression
โข k-meansโข PQ/OPQโข Graph traversalโข etcโฆ
โข Raw dataโข Scalar quantizationโข PQ/OPQโข etcโฆ
Look-up-based
Hamming-based
Linear-scan by Asymmetric Distance
โฆ
Linear-scan by Hamming distance
88
Inverted index + data compression
For raw data: Acc. โบ, Memory: For compressed data: Acc. , Memory: โบ
89
Inverted index + PQ: Recap the notation
0.340.220.681.020.030.71
ID: 2
ID: 123
ID: 87
๐ โ โ๐ท
๐ท ๐
เดฅ๐ โ 1,โฆ , 256 ๐
Product quantization
โข Suppose ๐, ๐ โ โ๐ท, where ๐ is quantized to เดฅ๐โข ๐ ๐, ๐ 2 can be efficiently approximated by เดฅ๐:
๐ ๐, ๐ 2 โผ ๐๐ด ๐, เดฅ๐ 2
PQ code
Bar-notation = PQ-code
Just by a PQ-code.Not the original vector ๐
90
Coarse quantizer
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
๐ = 1
๐ = 2
๐ = ๐พใปใปใป
Inverted index + PQ: Record
Prepare a coarse quantizer โ Split the space into ๐พ sub-spaces
โ ๐๐ ๐=1๐พ are created by running k-means on training data
91
Coarse quantizer
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
1.020.730.561.371.370.72
๐1
Record ๐1๐ = 1
๐ = 2
๐ = ๐พใปใปใป
Inverted index + PQ: Record
92
Coarse quantizer
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
1.020.730.561.371.370.72
๐1
Record ๐1๐ = 1
๐ = 2
๐ = ๐พใปใปใป
Inverted index + PQ: Record
93
Coarse quantizer
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
1.020.730.561.371.370.72
๐1
Record ๐1๐ = 1
๐ = 2
๐ = ๐พใปใปใป
Inverted index + PQ: Record
โข ๐2 is closest to ๐1โข Compute a residual ๐1 between ๐1 and ๐2:
๐1 = ๐1 โ ๐2 ( )
94
Coarse quantizer
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
1.020.730.561.371.370.72
๐1
Record ๐1๐ = 1
๐ = 2
๐ = ๐พ
โข ๐2 is closest to ๐1โข Compute a residual ๐1 between ๐1 and ๐2:
๐1 = ๐1 โ ๐2 ( )
ID: 42
ID: 37
ID: 9
1
ใปใปใป
โข Quantize ๐1 to เดค๐1 by PQโข Record it with the ID, โ1โโข i.e., record (๐, เดฅ๐๐)
เดค๐๐
Inverted index + PQ: Record
๐
95
ID: 42
ID: 37
ID: 9
245
ID: 25
ID: 47
ID: 32
12
ID: 38
ID: 49
ID: 72
1932
ID: 42
ID: 37
ID: 9
1
ID: 24
ID: 54
ID: 23
8621
ID: 77
ID: 21
ID: 5
145
ID: 18
ID: 4
ID: 96
3721
ID: 32
ID: 11
ID: 85
324
ID: 16
ID: 72
ID: 95
1721
โฆ
๐ = 1
๐ = 2
๐ = ๐พใปใปใป
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
Inverted index + PQ: Recordโข For all database vectors, record [ID + PQ(res)] as pointing lists
coarse quantizer
96
ID: 42
ID: 37
ID: 9
245
ID: 25
ID: 47
ID: 32
12
ID: 38
ID: 49
ID: 72
1932
ID: 42
ID: 37
ID: 9
1
ID: 24
ID: 54
ID: 23
8621
ID: 77
ID: 21
ID: 5
145
ID: 18
ID: 4
ID: 96
3721
ID: 32
ID: 11
ID: 85
324
ID: 16
ID: 72
ID: 95
1721
โฆ
๐ = 1
๐ = 2
๐ = ๐พใปใปใป
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
Inverted index + PQ: Search
coarse quantizer
97
ID: 42
ID: 37
ID: 9
245
ID: 25
ID: 47
ID: 32
12
ID: 38
ID: 49
ID: 72
1932
ID: 42
ID: 37
ID: 9
1
ID: 24
ID: 54
ID: 23
8621
ID: 77
ID: 21
ID: 5
145
ID: 18
ID: 4
ID: 96
3721
ID: 32
ID: 11
ID: 85
324
ID: 16
ID: 72
ID: 95
1721
โฆ
๐ = 1
๐ = 2
๐ = ๐พใปใปใป
0.542.350.820.420.140.32
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
๐
Find the nearest vector to ๐
Inverted index + PQ: Search
coarse quantizer
98
ID: 42
ID: 37
ID: 9
245
ID: 25
ID: 47
ID: 32
12
ID: 38
ID: 49
ID: 72
1932
ID: 42
ID: 37
ID: 9
1
ID: 24
ID: 54
ID: 23
8621
ID: 77
ID: 21
ID: 5
145
ID: 18
ID: 4
ID: 96
3721
ID: 32
ID: 11
ID: 85
324
ID: 16
ID: 72
ID: 95
1721
โฆ
๐ = 1
๐ = 2
๐ = ๐พใปใปใป
0.542.350.820.420.140.32
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
๐
Find the nearest vector to ๐
Inverted index + PQ: Search
coarse quantizer
99
ID: 42
ID: 37
ID: 9
245
ID: 25
ID: 47
ID: 32
12
ID: 38
ID: 49
ID: 72
1932
ID: 42
ID: 37
ID: 9
1
ID: 24
ID: 54
ID: 23
8621
ID: 77
ID: 21
ID: 5
145
ID: 18
ID: 4
ID: 96
3721
ID: 32
ID: 11
ID: 85
324
ID: 16
ID: 72
ID: 95
1721
โฆ
๐ = 1
๐ = 2
๐ = ๐พใปใปใป
0.542.350.820.420.140.32
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
๐
Find the nearest vector to ๐
โข ๐2 is the closest to ๐โข Compute the residual: ๐๐ = ๐ โ ๐2
Inverted index + PQ: Search
coarse quantizer
100
ID: 42
ID: 37
ID: 9
245
ID: 25
ID: 47
ID: 32
12
ID: 38
ID: 49
ID: 72
1932
ID: 42
ID: 37
ID: 9
1
ID: 24
ID: 54
ID: 23
8621
ID: 77
ID: 21
ID: 5
145
ID: 18
ID: 4
ID: 96
3721
ID: 32
ID: 11
ID: 85
324
ID: 16
ID: 72
ID: 95
1721
โฆ
๐ = 1
๐ = 2
๐ = ๐พใปใปใป
0.542.350.820.420.140.32
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
๐
Find the nearest vector to ๐
โข ๐2 is the closest to ๐โข Compute the residual: ๐๐ = ๐ โ ๐2
โข For all (๐, เดค๐๐) in ๐ = 2, compare เดค๐๐ with ๐๐:
๐ ๐, ๐๐2 = ๐ ๐ โ ๐2, ๐๐ โ ๐2
2
= ๐ ๐๐ , ๐๐2โผ ๐๐ด ๐๐ , เดฅ๐๐
2
โข Find the smallest one (several strategies)
Inverted index + PQ: Search
coarse quantizer
เดค๐๐
๐
101
Faisshttps://github.com/facebookresearch/faiss
$> conda install faiss-cpu -c pytorch$> conda install faiss-gpu -c pytorch
โขFrom the original authors of the PQ and a GPU expert, FAIRโขCPU-version: all PQ-based methodsโขGPU-version: some PQ-based methodsโขBonus:โขNN (not ANN) is also implemented, and quite fastโขk-means (CPU/GPU). Fast.
โ 10K
Benchmark of k-means: https://github.com/DwangoMediaVillage/pqkmeans/blob/master/tutorial/4_comparison_to_faiss.ipynb
102
quantizer = faiss.IndexFlatL2(D)index = faiss.IndexIVFPQ(quantizer, D, nlist, M, nbits)
index.train(Xt) # Trainindex.add(X) # Record dataindex.nprobe = nprobe # Search parameterdist, ids = index.search(Q, topk) # Search
ID: 42
ID: 37
ID: 9
245
ID: 25
ID: 47
ID: 32
12
ID: 38
ID: 49
ID: 72
1932
ID: 24
ID: 54
ID: 23
8621
ID: 77
ID: 21
ID: 5
145
ID: 32
ID: 11
ID: 85
324
ID: 16
ID: 72
ID: 95
1721
โฆ
ใปใปใป
0.542.350.820.420.140.32
coarse quantizer
๐1
๐3
๐2
๐4 ๐5
๐6
๐7
๐
๐ = 1
๐ = ๐พ
Usually, 8 bit
๐
Select a coarse quantizer
Simple linear scan
๐
109
106
bill
ion
-sca
lem
illio
n-s
cale Locality Sensitive Hashing (LSH)
Tree / Space Partitioning
Graph traversal
0.340.220.680.71
0
1
0
0
ID: 2
ID: 123
0.340.220.680.71
Space partition Data compression
โข k-meansโข PQ/OPQโข Graph traversalโข etcโฆ
โข Raw dataโข Scalar quantizationโข PQ/OPQโข etcโฆ
Look-up-based
Hamming-based
Linear-scan by Asymmetric Distance
โฆ
Linear-scan by Hamming distance
103
Inverted index + data compression
For raw data: Acc. โบ, Memory: For compressed data: Acc. , Memory: โบ
104
ID: 42
ID: 37
ID: 9
245
ID: 25
ID: 47
ID: 32
12
ID: 38
ID: 49
ID: 72
1932
ID: 24
ID: 54
ID: 23
8621
ID: 77
ID: 21
ID: 5
145
ID: 32
ID: 11
ID: 85
324
ID: 16
ID: 72
ID: 95
1721
โฆ
ใปใปใป
๐ = 1
๐ = ๐พ
๐
0.542.350.820.420.140.32
๐
๐6๐3๐13
Coarse quantizer
105
quantizer = faiss.IndexHNSWFlat(D, hnsw_m)index = faiss.IndexIVFPQ(quantizer, D, nlist, M, nbits)
ID: 42
ID: 37
ID: 9
245
ID: 25
ID: 47
ID: 32
12
ID: 38
ID: 49
ID: 72
1932
ID: 24
ID: 54
ID: 23
8621
ID: 77
ID: 21
ID: 5
145
ID: 32
ID: 11
ID: 85
324
ID: 16
ID: 72
ID: 95
1721
โฆ
ใปใปใป
0.542.350.820.420.140.32
Coarse quantizer๐
๐ = 1
๐ = ๐พ
๐HNSW
โขSwitch a coarse quantizer from linear-scan to HNSWโขThe best approach for billion-scale data as of 2020โขThe backbone of [Douze+, CVPR 2018] [Baranchuk+, ECCV 2018]
Usually, 8 bitSelect a coarse quantizer
๐6๐3๐13
106
โบ From the original authors of PQ. Extremely efficient (theory & impl)โบ Used in a real-world product (Mercari, etc)โบ For billion-scale data, Faiss is the best optionโบ Especially, large-batch-search is fast; #query is large
Lack of documentation (especially, python binding) Hard for a novice user to select a suitable algorithm As of 2020, anaconda is required. Pip is not supported officially
107
Referenceโข Faiss wiki: [https://github.com/facebookresearch/faiss/wiki]
โข Faiss tips: [https://github.com/matsui528/faiss_tips]
โข Julia implementation of lookup-based methods [https://github.com/una-dinosauria/Rayuela.jl]
โข PQ paper: H. Jรฉgou et al., โProduct quantization for nearest neighbor search,โ TPAMI 2011
โข IVFADC + HNSW (1): M. Douze et al., โLink and code: Fast indexing with graphs and compact regression codes,โ CVPR 2018
โข IVFADC + NHSW (2): D. Baranchuk et al., โRevisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors,โ ECCV 2018
108
Start
Have GPU(s)?
faiss-gpu: linear scan (GpuIndexFlatL2)
faiss-cpu: linear scan (IndexFlatL2)
nmslib (hnsw)
falconn
annoy
faiss-cpu: hnsw + ivfpq(IndexHNSWFlat + IndexIVFPQ)
Adjust the PQ parameters: Make ๐ smlaller
Exact nearest neighbor search
Alternative: faiss.IndexHNSWFlat in faiss-cpuโข Same algorithm in different libraries
Note: Assuming ๐ท โ 100. The size of the problem is determined by ๐ท๐. If 100 โช ๐ท, run PCA to reduce ๐ท to 100
Yes
No
If topk > 2048
If slow, or out of memory
Require fast data addition
Would like to run from several processes
If slowโฆ
Would like to adjust the performance
riiWould like to runsubset-search
If out of memory
Adjust the IVF parameters:Make nprobe larger โก Higher accuracy but slower
Would like to adjust the performance
cheat-sheet for ANN in Python ๏ผas of 2020. Can be installed by conda or pip๏ผ
faiss-gpu: ivfpq (GpuIndexIVFPQ)
(1) If still out of GPU-memory, or(2) Need more accurate results
If out of GPU-memoryIf out of GPU-memory, make ๐ smaller
About: 103 < ๐ < 106
About: 106 < ๐ < 109
About: 109 < ๐
109
Benchmark 1: ann-benchmarksโข https://github.com/erikbern/ann-benchmarksโข Comprehensive and thorough benchmarks
for various libraries. Docker-based
โข Top right, the betterโข As of June, 2020, NMSLIB and NGT are
competing each other for the first place
110
Benchmark 2: annbenchโข https://github.com/matsui528/annbenchโข Lightweight, easy-to-use
# Install librariespip install -r requirements.txt
# Download dataset on ./datasetpython download.py dataset=siftsmall
# Evaluate algos. Results are on ./outputpython run.py dataset=siftsmall algo=annoy
# Visualizepython plot.py
# Multi-run by Hydrapython run.py --multirun dataset=siftsmall,sift1m algo=linear,annoy,ivfpq,hnsw
ID Img Tag
1 โcatโ
2 โbirdโ
โฎ
125 โzebraโ
126 โelephantโ
โฎ
111
Search for a โsubsetโTarget IDs:[125, 223, 365, โฆ]
(1) Tag-based search:tag == โzebraโ
(2) Image search with a query ๐
๐๐ ๐=1๐
Subset-searchY. Matsui+, โReconfigurable Inverted Indexโ, ACMMM 18
112
Trillion-scale search: ๐ = 1012 (1T)
Sense of scaleโข K(= 103) Just in a second on a local machineโข M(= 106) All data can be on memory. Try several approachesโข G(= 109) Need to compress data by PQ. Only two datasets are available (SIFT1B, Deep1B)โข T(= 1012) Cannot even imagine
https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectorsโข Only in Faiss wikiโข Distributed, mmap, etc.
https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors
A sparse matrix of 15 Exa elements?
113
Nearest neighbor search engine: something like ANN + SQL
https://github.com/vearch/vearch
https://github.com/milvus-io/milvus
โข The algorithm inside is faiss, nmslib, or NGT
Elasticsearch KNNhttps://github.com/opendistro-for-elasticsearch/k-NN
https://github.com/vdaas/vald
114
Problems of ANNโข No mathematical background.โ Only actual measurements matter: recall and runtimeโ The ANN problem was mathematically defined 10+ years ago (LSH), but recently no
one cares the definition.
โข Thus, when the score is high, itโs not clear the reason:โ The method is good?โ The implementation is good?โ Just happens to work well for the target dataset?โ E.g.: The difference of math library (OpenBLAS vs Intel MKL) matters.
โข If one can explain โwhy this approach works good for this datasetโ, it would be a great contribution to the field.
โข Not enough dataset. Currently, only two datasets are available for billion-scale data: SIFT1B and Deep1B
115
Start
Have GPU(s)?
faiss-gpu: linear scan (GpuIndexFlatL2)
faiss-cpu: linear scan (IndexFlatL2)
nmslib (hnsw)
falconn
annoy
faiss-cpu: hnsw + ivfpq(IndexHNSWFlat + IndexIVFPQ)
Adjust the PQ parameters: Make ๐ smlaller
Exact nearest neighbor search
Alternative: faiss.IndexHNSWFlat in faiss-cpuโข Same algorithm in different libraries
Note: Assuming ๐ท โ 100. The size of the problem is determined by ๐ท๐. If 100 โช ๐ท, run PCA to reduce ๐ท to 100
Yes
No
If topk > 2048
If slow, or out of memory
Require fast data addition
Would like to run from several processes
If slowโฆ
Would like to adjust the performance
riiWould like to runsubset-search
If out of memory
Adjust the IVF parameters:Make nprobe larger โก Higher accuracy but slower
Would like to adjust the performance
cheat-sheet for ANN in Python ๏ผas of 2020. Can be installed by conda or pip๏ผ
faiss-gpu: ivfpq (GpuIndexIVFPQ)
(1) If still out of GPU-memory, or(2) Need more accurate results
If out of GPU-memoryIf out of GPU-memory, make ๐ smaller
About: 103 < ๐ < 106
About: 106 < ๐ < 109
About: 109 < ๐