Robust Optimized Scalar Quantization

We discuss a sparse preconditioner to apply to vectors which results in more stable quantization performance with respect to data distribution

Recently, we introduced optimized scalar quantization (OSQ) and donated it to Lucene. This powers our implementation of binary quantization (BBQ). As we continue on our journey to make BBQ the default format for all our vector indices, we’ve been exploring its robustness across a wider range of data characteristics.

As we discussed before, many high quality learned embeddings share a similar characteristic: their components are normally distributed. However, classic approximate nearest neighbor (ANN) benchmark datasets often do not share this characteristic.

In this blog, we explore:

  1. The distribution characteristics of a variety of classic ANN benchmark datasets,
  2. How they perform with vanilla OSQ, and
  3. The effect of a preconditioner on retrieval robustness.

We propose a sparse variant of the random orthogonal preconditioner introduced by RaBitQ, which achieves similar quality, but at a fraction of the cost.

Datasets

In this study we use a subset of the datasets included in the popular ann-benchmark suite:

  1. GIST that contains 960 handcrafted features for image recognition
  2. MNIST and Fashion-MNIST that are 2828 images whose grayscale pixel values comprise the 784 dimensional vectors
  3. GloVe that comprises a collection of unsupervised word vector embeddings, and
  4. SIFT that contains 128 handcrafted image descriptors

We also include some embedding model based datasets to understand the impact we have on these if any:

  1. e5-small embeddings of the quora duplicate question dataset
  2. Cohere v2 embedding of a portion of the wikipedia dataset

These have diverse characteristics. Some of their high-level features are summarized in the following table.

DatasetDimensionDocument countMetric
GIST9601000000Euclidean
MNIST78460000Euclidean
Fashion-MNIST78460000Euclidean
Glove1001183514Cosine
SIFT1281000000Euclidean
e5-small384522931Cosine
Cohere v2768934024MIP

The figures below show sample component distributions.

Preconditioning

RaBitQ proposes random orthogonal preconditioning as an initial step for constructing quantized vectors. An orthogonal transformation of a vector corresponds to a change of basis. It is convenient in the context of quantization schemes because it doesn’t change the similarity metric. For example, for any orthogonal matrix P the dot product is unchanged since

(Px)t(Py)=xtPtPy=xtIy=xty(Px)^t(Py)=x^tP^tPy=x^tIy=x^ty

The same holds for cosine similarity and Euclidean distance. So the transform can just be applied to each vector before quantizing it and no further changes are needed to either the index or query procedure. For our purposes, another important property of this transformation is that it makes the component distributions more normally distributed. This is a consequence of the central limit theorem. Specifically, the ithi^{\text{th}} component of the transformed vector corresponds to the summation jPijxj\sum_j P_{ij} x_j where PijP_{ij} are approximately independent random variables. It is easy to sample such a matrix: first sample d2d^2 elements independently from the standard normal distribution and then apply modified Gram-Schmidt orthogonalization to the resulting matrix (which is almost surely rank full). The figure below shows the impact of this transform on the component distribution when it’s applied to the Fashion-MNIST dataset.

This property suits OSQ because our initialization procedure assumes that the components are normally distributed. The resulting transformation is effective, but it comes with some overheads:

  1. The preconditioning matrix contains d2d^2 floating point values
  2. Applying the transformation to the full dataset requires nd2nd^2 floating point multiplications.

For example, if the vectors have 1024 dimensions, we’d require 4MB storage per preconditioning matrix and perform more than 1M multiplications per document we add to the index.

We therefore asked, “do we really need a dense orthogonal matrix to achieve our goal?” For high dimensional vectors we might reasonably expect that we can mix fewer components and still the resulting component distributions will be fairly normal. One way to achieve this is using a block diagonal matrix as illustrated schematically below.

If we use a block diagonal matrix whose blocks are 32×3232 \times 32 we drop the memory requirement for a 10241024 dimensional preconditioner to 4×32×32×(1024/32)/(1024×1024)=0.125MB4 \times 32 \times 32 \times (1024 / 32) / (1024 \times 1024)=0.125\text{MB} and we reduce the indexing overhead by a factor of 32. This is pretty appealing.

Although the vanilla block diagonal preconditioner performs fairly well, it gives up some of the recall gains we achieve with a dense preconditioner. Furthermore, the choice of a block diagonal matrix is arbitrary: it is mixing components that happen to be adjacent to each other in the vector. It seems undesirable that the effectiveness of our scheme might change for an arbitrary permutation of the input vectors. Ideally we would use a procedure which yields a deterministic choice for which components we mix.

To do this we drew inspiration from optimized product quantisation. This seeks to balance variance between groups of components used to build each codebook. In a similar vein, we use a greedy algorithm to assign components to blocks with the aim to balance their variance as follows.

for j{1,2,...,blocks}j \in \{1,2,...,\text{blocks} \}
    vj0\;\;v_j \leftarrow 0
    bj\;\;b_j \leftarrow \emptyset

while i<di<d in descending component variance order
    jargminkvk\;\;j \leftarrow arg\min_k v_k
    \;\;Add ii to bjb_j
    \;\;if bj<dj\left| b_j \right| < d_j
        \;\;\;\;Add the variance of the ithi^{\text{th}} component to vjv_j
    \;\;else
        vj\;\;\;\;v_j \leftarrow \infty

Here, dd is the vector dimension and djd_j is the jthj^{\text{th}} block dimension. Our intuition is that this should maximize the homogeneity of component distributions and that this is beneficial because we use the same quantization interval for each component. This intuition is borne out by the results. We achieve slightly better quality than the dense transformation at a fraction of the cost.

Results

As is typical, we evaluate recall reranking some collection of vectors larger than the number of required candidates. The following table breaks down the recall@10 at 5 distinct reranking depths for the most challenging vector distributions in our test set with and without preconditioning. It also compares dense preconditioning with the optimized block diagonal preconditioning we described above, using a block size of 32. Averaged across datasets and reranking depths we see an increase in recall of 31% using block diagonal preconditioning for these datasets compared to the baseline. It’s also clear that the block diagonal preconditioner with balanced variance is on a par with the dense preconditioner.

DatasetMetricBaselineDense preconditionerSparse preconditioner
Fashion-MNISTrecall@10|100.4440.7120.710
recall@10|200.6290.9110.911
recall@10|300.7300.9640.966
recall@10|400.7920.9830.984
recall@10|500.8330.9900.992
MNISTrecall@10|100.5440.7950.787
recall@10|200.7190.9700.967
recall@10|300.8000.9940.993
recall@10|400.8480.9980.998
recall@10|500.8800.9990.999
GISTrecall@10|100.3140.5580.545
recall@10|200.4380.7460.739
recall@10|300.5210.8370.825
recall@10|400.5810.8840.878
recall@10|500.6250.9160.911
SIFTrecall@10|100.2430.2720.264
recall@10|200.3510.3950.382
recall@10|300.4250.4770.461
recall@10|400.4780.5370.519
recall@10|500.5220.5840.565

Recall @10 for various reranking depths using raw OSQ and preconditioned OSQ (higher is better)

For completeness we show the average recall@10 across the five reranking depths for the baseline and block diagonal preconditioning across all the test datasets in the figure below. This shows that preconditioning is essentially a no-op on well behaved datasets.


In summary, the block diagonal preconditioner increases robustness with modest overhead.

Conclusion

Our SIMD accelerated binary quantization scheme has enabled us to achieve fantastic index efficiency and query recall-latency tradeoff. Our goal is to make this work easily consumable by everyone.

In this post we discussed a lightweight preconditioner we’re planning to introduce to the backing quantization scheme as we move towards enabling BBQ as the default format for all vector indices. This improves its robustness against pathological vector distributions for a small index time overhead.

Ready to try this out on your own? Start a free trial.

Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!

Related content

Understanding optimized scalar quantization

December 19, 2024

Understanding optimized scalar quantization

In this post, we explain a new form of scalar quantization we've developed at Elastic that achieves state-of-the-art accuracy for binary quantization.

Better Binary Quantization (BBQ) vs. Product Quantization

November 18, 2024

Better Binary Quantization (BBQ) vs. Product Quantization

Why we chose to spend time working on Better Binary Quantization (BBQ) instead of product quantization in Lucene and Elasticsearch.

Smokin' fast BBQ with hardware accelerated SIMD instructions

December 4, 2024

Smokin' fast BBQ with hardware accelerated SIMD instructions

How we optimized vector comparisons in BBQ with hardware accelerated SIMD (Single Instruction Multiple Data) instructions.

Elasticsearch BBQ vs. OpenSearch FAISS: Vector search performance comparison

April 15, 2025

Elasticsearch BBQ vs. OpenSearch FAISS: Vector search performance comparison

A performance comparison between Elasticsearch BBQ and OpenSearch FAISS.

Evaluating scalar quantization in Elasticsearch

May 3, 2024

Evaluating scalar quantization in Elasticsearch

Learn how scalar quantization can be used to reduce the memory footprint of vector embeddings in Elasticsearch through an experiment.

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself