Our Better Binary Quantization (BBQ) indices are now even better(er). Recall improvements across the board (in extreme cases up to 20%) and unlocking the future of quantizing vectors to any bit size. As of Elasticsearch 8.18, BBQ indices are now backed by our state of the art optimized scalar quantization algorithm.
A Brief History of Scalar Quantization
Introduced in Elasticsearch 8.12, scalar quantization was initially a simple min/max quantization scheme. Per lucene segment, we would find the global quantile values for a given confidence interval. These quantiles are then used as the minimum and maximum to quantize all the vectors. While this naive quantization is powerful, it only really works for whole byte quantization.
In Elasticsearch 8.15, we added half-byte, or int4, quantization. To achieve this with high recall, we added an optimization step, allowing for the best quantiles to be calculated dynamically. Meaning, no more static confidence intervals. Lucene will calculate the best global upper and lower quantiles for each segment. Achieving 8x reduction in memory utilization over float32 vectors.
Finally, now in 8.18, we have added locally optimized scalar quantization. It optimizes quantiles per individual vector. Allowing for exceptional recall at any bit size, even single bit quantization.
What is Optimized Scalar Quantization?
For an in-depth explanation of the math and intuition behind optimized scalar quantization, check out our blog post on Optimized Scalar Quantization. There are three main takeaways from this work:
- Each vector, is centered on the Apache Lucene segment's centroid. This allows us to make better use of the possible quantized vectors to represent the dataset as a whole.
- Every vector is individually quantized with a unique set of optimized quantiles.
- Asymmetric quantization is used allowing for higher recall with the same memory footprint.
In short, when quantizing each vector:
- We center the vector on the centroid
- Compute a limited number of iterations to find the optimal quantiles. Stopping early if the quantiles are unchanged or the error (loss) increases
- Pack the resulting quantized vectors
- Store the packed vector, its quantiles, the sum of its components, and an extra error correction term
Storage and Retrieval
The storage and retrieval of optimized scalar quantization vectors are similar to BBQ. The main difference is the particular values we store.
One piece of nuance is the correction term. For Euclidean distance, we store the squared norm of the centered vector. For dot product we store the dot product between the centroid and the uncentered vector.
Performance
Enough talk. Here are the results from four datasets.
- Cohere's 768 dimensioned multi-lingual embeddings. This is a well distributed inner-product dataset.
- Cohere's 1024 dimensioned multi-lingual embeddings. This embedding model is well optimized for quantization.
- E5-Small-v2 quantized over the quora dataset. This model typically does poorly with binary quantization.
- GIST-1M dataset. This scientific dataset opens some interesting edge cases for inner-product and quantization.
Here are the results for Recall@10|50
Dataset | BBQ | BBQ with OSQ | Improvement |
---|---|---|---|
Cohere 768 | 0.933 | 0.938 | 0.5% |
Cohere 1024 | 0.932 | 0.945 | 1.3% |
E5-Small-v2 | 0.972 | 0.975 | 0.3% |
GIST-1M | 0.740 | 0.989 | 24.9% |
Across the board, we see that BBQ backed by our new optimized scalar quantization improves recall, and dramatically so for the GIST-1M dataset.
But, what about indexing times? Surely all this per vector optimizations must add up. The answer is no.
Here are the indexing times for the same datasets.
Dataset | BBQ | BBQ with OSQ | Difference |
---|---|---|---|
Cohere 768 | 368.62s | 372.95s | +1% |
Cohere 1024 | 307.09s | 314.08s | +2% |
E5-Small-v2 | 227.37s | 229.83s | < +1% |
GIST-1M | 1300.03s* | 297.13s | -300% |
- Since the quantization methodology works so poorly over GIST-1M when using inner-product, it takes an exceptionally long time to build the HNSW graph as the vector distances are not well distinguished.
Conclusion
Not only does this new, state of the art quantization methodology improve recall for our BBQ indices, it unlocks future optimizations. We can now quantize vectors to any bit size and we want to explore how to provide 2 bit quantization, striking a balance between memory utilization and recall with no reranking.
Ready to try this out on your own? Start a free trial.
Elasticsearch and Lucene offer strong vector database and search capabilities. Dive into our sample notebooks to learn more.
Related content
January 7, 2025
Early termination in HNSW for faster approximate KNN search
Learn how HNSW can be made faster for KNN search, using smart early termination strategies.
January 3, 2025
Lucene Wrapped 2024
2024 has been another major year for Apache Lucene. In this blog, we’ll explore the key highlights.
December 27, 2024
Lucene bug adventures: Fixing a corrupted index exception
Sometimes, a single line of code takes days to write. Here, we get a glimpse of an engineer's pain and debugging over multiple days to fix a potential Apache Lucene index corruption.
December 4, 2024
Smokin' fast BBQ with hardware accelerated SIMD instructions
How we optimized vector comparisons in BBQ with hardware accelerated SIMD (Single Instruction Multiple Data) instructions.
November 18, 2024
Better Binary Quantization vs. Product Quantization
Why we chose to spend time working on better binary quantization instead of product quantization in Lucene and Elasticsearch.