Understanding optimized scalar quantization

In this post we explain a new form of scalar quantization we've developed at Elastic that achieves state-of-the-art accuracy for binary quantization

At Elastic we continue to work on innovations that improve the cost and performance of our vector indices. In this post we’re going to discuss in detail the implementation and intuition for a new version of scalar quantization we’ve been working on which we’re calling optimized scalar quantization or OSQ. This provides a further enhancement to our BBQ index format. Indeed, we set new state-of-the-art performance for 32×\times compression.

It also allows us to get very significant improvements in accuracy compared to naive scalar quantization approaches while retaining similar index compression and query performance. We plan to roll out 2, 4 and 7 bit quantization schemes and indeed unify all our index compression options to use this same underlying technique. Furthermore, we anticipate that with 7 bit quantization we should be able to discard the raw floating point vectors and plan to evaluate this thoroughly.

Measuring success

For any compression scheme we need to worry about its impact on the query behavior. For nearest neighbor queries there is a metric that much prior art focuses on. This is the recall at kk. Specifically, one measures the fraction of true nearest neighbor vectors that a query returns.

In fact, we already exploit the relaxation that we don't find the true nearest neighbors, even when we search with uncompressed vectors in Lucene. Linear complexity is unavoidable if one wants to find the true nearest neighbors in a high dimensional vector space. The data structure that Lucene uses to index vector data is an HNSW graph, which allows it to run approximate nearest neighbor queries significantly faster than a linear scan.

In this blog we study a metric that we’ll abbreviate as recall@knk|n, which is the recall at kk retrieving nkn\geq k candidates and reranking them using the uncompressed vector distance calculation. Note that if n=kn=k then it's simply equal to the recall at kk. We also look at the quality of the approximation of the uncompressed vector distances.

We’ll discuss all this in more detail when we discuss our results.

Motivation

Scalar quantization, as we’ve discussed before, is an approach which allows one to represent floating point vectors by integer vectors. There are two reasons to do this:

  1. It allows you to reduce the amount of memory needed to represent a vector, by using fewer than 32 bits per integer, and
  2. It allows you to accelerate the distance calculation between vectors, which is the bottleneck for performing nearest neighbor queries in vector databases.

Recently, a paper, which we’ve discussed before, proposed an approach called RaBitQ that is able to achieve good recall using only 1 bit per component. This is exciting because 32×\times compression is competitive with Product Quantization (PQ), which was the previous state-of-the-art approach when compression was paramount. A key advantage of RaBitQ is that it’s relatively straightforward to accelerate distance calculations with SIMD operations. Certainly, when it is compared to PQ that uses lookups to compute distance, for which it is much harder to exploit hardware parallelism. The authors performed extensive experiments and showed they were able to achieve consistently higher recall as a function of query latency than PQ using the same compression ratio with an IVF index.

Although the RaBitQ approach is conceptually rather different to scalar quantization, we were inspired to re-evaluate whether similar performance could be unlocked for scalar quantization. In our companion piece we will discuss the result of integrating OSQ with HNSW and specifically how it compares to the baseline BBQ quantization scheme, which is inspired by RaBitQ. As an incentive to keep reading this blog we note that we were able to achieve systematically higher recall in this setting, sometimes by as much as 30%. Another advantage of OSQ is it immediately generalizes to using nn bits per component. For example, we show below that we’re able to achieve excellent recall in all cases we tested using minimal or even no reranking with only a few bits per component.

This post is somewhat involved. We will take you through step-by-step the innovations we’ve introduced and explain some of the intuition at each step.

Dot products are enough

In the following we discuss only the dot product, which would translate to a MIP query in Elasticsearch. In fact, this is sufficient because well known reductions exist to convert the other metrics Elasticsearch supports, Euclidean and cosine, to computing a dot product. For the Euclidean distance we note that

yx2=x2+y22xty\|y-x\|^2 = \|x\|^2 + \|y\|^2 - 2x^ty

and also that x\|x\| and y\|y\| can be precomputed and cached, so we need only compute the dot product ytxy^t x in order to compare two vectors. For cosine we simply need to normalize vectors and then

ytxyx=ytx\frac{y^t x}{\|y\|\|x\|} = y^t x

Scalar quantization refresher

Scalar quantization is typically defined by the follow componentwise equation

x^i=round(2n1ba(clamp(xi,a,b)a))\hat{x}_i = \text{round}\left(\frac{2^n-1}{b-a} \left(\text{clamp}(x_i, a, b) - a\right)\right)

Here, clamp(,a,b)=max(min(,b),a)\text{clamp}(\cdot, a, b) = \max(\min(\cdot, b), a). The quantized vector x^\hat{x} is integer valued with values less than or equal to 2n12^n-1, that is we’re using nn bits to represent each component. The interval [a,b][a,b] is called the quantization interval.

We also define the quantized approximation of the original floating point vector, which is our best estimate of the floating point vector after we’ve applied quantization. We never directly work with this quantity, but it is convenient to describe the overall approach. This is defined as follows

xˉ=a1+ba2n1x^\bar{x} = a 1 + \frac{b-a}{2^n-1} \hat{x}

Here, 11 denotes a vector whose components are all equal to one. Note that we abuse notation here and below by using 11 to denote both a vector and scalar and we rely on the context to make the meaning clear.

The need for speed

A key requirement we identified for scalar quantization is that we can perform distance comparisons directly on integer vectors. Integer arithmetic operations are more compute efficient than floating point ones. Furthermore, higher throughput specialized hardware instructions exist for performing them in parallel. We’ve discussed how to achieve this in the context of scalar quantization before.

Below we show that as long as we cache a couple of corrective factors we can use a different quantization interval and number of bits for each vector we quantize and still compute the dot product using integer vectors. This is the initial step towards achieving OSQ.

First of all observe that our best estimate of the dot product between two quantized vectors is

ytx=(ay1+byay2ny1y^)t(ax1+bxax2nx1x^)y^t x = \left(a_y 1 + \frac{b_y-a_y}{2^{n_y} - 1}\hat{y}\right)^t \left(a_x 1 + \frac{b_x-a_x}{2^{n_x} - 1}\hat{x}\right)

Expanding we obtain

y^tx^=ayax1t1+byay2ny1ax1ty^+bxax2nx1ay1tx^+byay2ny1bxax2nx1y^tx^\hat{y}^t \hat{x} = a_y a_x 1^t 1 + \frac{b_y-a_y}{2^{n_y} - 1} a_x 1^t \hat{y} + \frac{b_x-a_x}{2^{n_x} - 1} a_y 1^t \hat{x} + \frac{b_y-a_y}{2^{n_y} - 1} \frac{b_x-a_x}{2^{n_x} - 1} \hat{y}^t \hat{x}

Focusing on the vector dot products, since these dominate the compute cost, we observe that 1t11^t 1 is just equal to the vector dimension, and 1ty^1^t\hat{y} and 1tx^1^t\hat{x} are the sums of the quantized query and document vector components, respectively. For the query this can be computed once upfront and for the documents these can be computed at index time and stored with the quantized vector. Therefore, we need only compute the integer vector dot product y^tx^\hat{y}^t\hat{x} per comparison.

The geometry of scalar quantization

To build a bit more intuition about OSQ we digress to understand more about how scalar quantization represents a vector. Observe that the set of all possible quantized vectors lie inside a cube centered on the point a+b21\frac{a+b}{2}1 with side length bab−a. If only 1 bit is being used then the possible vectors lie at the 2d2^d corners of the cube. Otherwise, the possible vectors lie on a regular grid with 2nd2^{nd} points.

By changing aa and bb, we are only able to expand or contract the cube or slide it along the line spanned by the 11 vector. In particular, suppose the vectors in the corpus are centered around some point mm. We can decompose this into a component parallel to the 11 vector and a component perpendicular to it as follows

m=11tdm+(I11td)mm = \frac{1 1^t}{d} m + \left(I - \frac{1 1^t}{d}\right) m

If we center the cube on the point at 1tmd1\frac{1^t m}{d}1 then it must still expand it to encompass the offset m1tmd1m-\frac{1^t m}{d}1 before it covers even the center of the data distribution. This is illustrated in the figure below for 2 dimensions.

Since the quantization errors will be proportional on average to the side length of the cube, this suggests that we want to minimize m1tmd1\| m-\frac{1^t m}{d}1 \|. An easy way to do this is to center the query and document vectors before quantizing. We show below that if we do this we can still recover the dot product between the original vectors.

Note that

ytx=(ym+m)t(xm+m)=(ym)t(xm)+mty+mtxmtm\begin{align*} y^t x &= (y - m + m)^t (x - m + m) \\ &= (y - m)^t (x - m) + m^t y + m^t x - m^t m \end{align*}

The values mtym^t y and mtxm^t x, which are the inner product between the query and document vector and the centroid of the vectors in the corpus, can be precomputed and in the case of the document vector cached with its quantized representation. The quantity mtmm^t m is a global constant.

This means we only need to estimate (ym)t(xm)(y−m)^t(x−m) when comparing a query and document vector. In other words, we can quantize the centered vectors and recover an estimate of the actual dot product.

The distribution of centered vectors

So far we’ve shown that we can use a different bit count and a different quantization interval per vector. We next show how to exploit this to significantly improve the accuracy of scalar quantization. We propose an effective criterion and procedure for optimizing the choice of the constants aa and bb. However, before discussing this it is useful to see some examples of the component distribution in real centered embedding vectors.

We observe that the values are all fairly normally distributed and will use this observation to choose our initial quantization interval.

Looking at these plots one might guess it would be beneficial to scale components. Specifically, it seems natural to standardize the component distributions of the Cohere v2 and the gte-base-en-v1.5 embeddings. In general, this would amount to applying a scaling matrix to the document vectors before quantization as follows

Diag(σ)1(xm)\text{Diag}(\sigma)^{-1} (x - m)

Here, Diag(σ)\text{Diag}(\sigma) is a diagonal matrix whose diagonal entries are the standard deviations of the components for the corpus as a whole. We can apply this operation and still compute the dot product efficiently because we simply need to apply the inverse transformation to the query vector before quantizing it

(ym)t(xm)=(Diag(σ)(ym))t(Diag(σ)1(xm))(y - m)^t (x - m) = \left(\text{Diag}(\sigma) (y - m)\right)^t \left(\text{Diag}(\sigma)^{-1} (x - m)\right)

The effect is not symmetric because, as we’ll discuss below, we use a higher bit count for the query.

Referring back to our geometric picture this would amount to stretching the different edges of the cube. We tried this but found it didn’t measurably improve effectiveness once we optimize the interval directly for each vector so avoid the extra complexity.

Initializing the quantization interval

Let’s consider a natural criterion for setting a global quantization interval for normally distributed data. If we pick a vector, XX, at random from the corpus then the quantization error is given by

Xx(a,b)2=i=1d(Xi(a+ba2n1x^i))2\| X - x(a,b)\|^2 = \sum_{i=1}^d \left(X_i - \left(a+\frac{b-a}{2^n-1}\hat{x}_i\right)\right)^2

By assumption, and as we showed empirically is often the case, each component of XX is normally distributed or XiN(0,σi)X_i \sim N(0,\sigma_i). In such a scenario it is reasonable to minimize the expected square error. Specifically, we seek

a,b=argmina,b  EX[i=1d(Xi(a+ba2n1x^i))2]a^*,b^* = arg\min_{a,b} \; \mathbb{E}_{X} \left[ \sum_{i=1}^d \left(X_i - \left(a+\frac{b-a}{2^n-1}\hat{x}_i\right)\right)^2 \right]

Since the expectation is a linear operator it distributes over the summation and we can focus on a single term. Without loss of generality we can assume that XiX_i​ is a unit normal since we can always rescale the interval by the data standard deviation.

To compute this expectation we make use of the following quantity

I(x,c)=12πx(tc)2et2/2dt=12(c2+1)erf(x2)+12πex2/2(2cx)+constant\begin{align*} I(x,c) &= \frac{1}{\sqrt{2\pi}} \int^x (t-c)^2 e^{-t^2 / 2} dt \\ &= \frac{1}{2}\left(c^2+1\right) \text{erf}\left(\frac{x}{\sqrt{2}}\right) + \frac{1}{\sqrt{2\pi}} e^{-x^2/2} (2c-x) + \text{constant} \end{align*}

This is the expectation of the square error when rounding a normally distributed value to a fixed point cc expressed as an indefinite integral, alternatively, before we’ve determined the range the value can take.

In order to minimize the expected quantization error we should snap floating point values to their nearest grid point. This means we can express the expected square quantization error as follows

Errore(a,b  |  n)=  I(a+s2,a)I(,a)  +i=12n2I(a+2i+12s,a+is)I(a+2i12s,a+is)+I(,b)I(a+2n+132s,b)\begin{align*} Error_e\left(a,b\; \middle| \; n\right) =& \; I(a+\frac{s}{2}, a) - I(-\infty, a)\;+ \\ & \sum_{i=1}^{2^n-2} I\left(a+\frac{2i+1}{2}s, a+is\right) - I\left(a+\frac{2i-1}{2}s, a+is\right) + \\ & I(\infty, b) - I\left(a+\frac{2^{n+1}-3}{2}s, b\right) \end{align*}

Where we defined s=ba2n1s=\frac{b−a}{2^n−1}​. The integration limits are determined by the condition that we snap to the nearest grid point.

We now have a function, in terms of the interval endpoints aa and bb, for the expected square quantization error using a reasonable assumption about the vector components’ distribution. It is relatively straightforward to show that this quantity is minimized by an interval centered on the origin. This means that we need to search for the value of a single variable zz which minimizes Errore(z,z  |  n)Error_e\left(-z,z\; \middle| \; n\right). The figure below shows the error as a function of zz for various bit counts.

Optimizing this function numerically for various choices of nn gives the following quantization intervals

[a(1),b(1)]=[0.798,0.798][a(2),b(2)]=[1.493,1.493][a(3),b(3)]=[2.051,2.051][a(4),b(4)]=[2.514,2.514][a(7),b(7)]=[3.611,3.611]\begin{align*} \left[a_{(1)}, b_{(1)}\right] &= [-0.798, 0.798] \\ \left[a_{(2)}, b_{(2)}\right] &= [-1.493, 1.493] \\ \left[a_{(3)}, b_{(3)}\right] &= [-2.051, 2.051] \\ \left[a_{(4)}, b_{(4)}\right] &= [-2.514, 2.514] \\ \left[a_{(7)}, b_{(7)}\right] &= [-3.611, 3.611] \\ \end{align*}

We’re denoting the interval for nn bits [a(n),b(n)]\left[a_{(n)}, b_{(n)}\right].

Finally, we need to map these fixed intervals to the specific interval to use for a vector xx. To do this we shift by the mean of its components mxm_x​ and scale by their standard deviation σx\sigma_x​. It’s clear that we should always choose axmina\geq x_{min}​ and bxmaxb\leq x_{max}. Therefore, our initial estimate for the quantization interval for a vector xx using nn bits per component is

[max(mx+a(n)σx,xmin),min(mx+b(n)σx,xmax)]\left[\max\left(m_x+a_{(n)}\sigma_x, x_{min}\right), \min\left(m_x+b_{(n)}\sigma_x, x_{max}\right)\right]

Refining the quantization interval

The initialization scheme actually works surprisingly well. We present the results of quantizing using this approach alone as one of the ablation studies when we examine the performance of OSQ. However, we can do better.

It has been noted in the context of PQ that targeting minimum square quantization error is not actually the best criterion when you care about recall. In particular, you know you’re going to be running nearest neighbor queries on the corpus and the nearest neighbors of a query are very likely to be fairly parallel to the query vector. Let’s consider what this means.

Suppose we have query vector yy_{||} for which the document is relevant. Ignoring the quantization of the query vector, we can decompose the square of the error in the dot product into a component that is parallel to and a component which is perpendicular to the document vector as follows

(yt(xxˉ))2=((xxtx2y)txxtx2(xxˉ)+((Ixxtx2)y)t(Ixxtx2)(xxˉ))2\left(y_{||}^t (x - \bar{x})\right)^2 = \left(\left(\frac{x x^t}{\|x\|^2}y_{||}\right)^t \frac{x x^t}{{\|x\|^2}}(x - \bar{x}) + \left(\left(I - \frac{x x^t}{\|x\|^2}\right) y_{||}\right)^t \left(I - \frac{x x^t}{\|x\|^2}\right) (x - \bar{x}) \right)^2

Now we expect (Ixxtx2)yxxtx2y\left\|\left(I-\frac{x x^t}{\|x\|^2}\right) y_{||}\right\| \ll \left\|\frac{x x^t}{\|x\|^2}y_{||}\right\|. Furthermore, we would like to minimize the error in the dot product in this specific case, since this is when the document is relevant and our query should return it.

Let (Ixxtx2)y=λxxtx2y\left\|\left(I-\frac{x x^t}{\|x\|^2}\right) y_{||}\right\| = \sqrt{\lambda} \left\|\frac{x x^t}{\|x\|^2}y_{||}\right\| for some λ1\lambda \ll 1. We can bound our quantization error in the dot product as follows

(yt(xxˉ))2xxtx2y2xxtx2(xxˉ)+λ(Ixxtx2)(xxˉ)2\left(y_{||}^t (x - \bar{x})\right)^2 \leq \left\|\frac{x x^t}{\|x\|^2}y_{||}\right\|^2 \left\|\frac{x x^t}{\|x\|^2}(x - \bar{x})+\sqrt{\lambda} \left( I - \frac{x x^t}{\|x\|^2}\right)(x - \bar{x}) \right\|^2

Whatever way we quantize we can’t affect the quantity xxtx2y\left\|\frac{x x^t}{\|x\|^2}y_{||}\right\| so all we care about is minimizing the second factor. A little linear algebra shows that this is equal to

Error(a,b  |  λ)=(xˉ(a,b)x)t(xxtx2+λ(Ixxtx2))(xˉ(a,b)x)Error \left(a, b\; \middle| \; \lambda \right) = \left(\bar{x}(a,b) - x\right)^t \left( \frac{x x^t}{\|x\|^2} + \lambda \left(I-\frac{x x^t}{\|x\|^2}\right)\right) \left(\bar{x}(a,b) - x\right)

Here, we’ve made the dependence of this expression on the quantization interval explicit.

So far we’ve proposed a natural quantity to minimize in order to reduce the impact of quantization on MIP recall. We now turn our attention to how to efficiently minimize this quantity with respect to the interval [a,b][a,b].

There is a complication because the assignment of components of the vector to grid points depends on aa and bb while the optimal choice for aa and bb depends on this assignment. We use a coordinate descent approach, which alternates between computing the quantized vector x^\hat{x} while holding aa and bb fixed, and optimizing the quantization interval while holding x^\hat{x} fixed. This is described schematically as follows.

1     \;\; a0,b0max(mx+a(n)σx,xmin),min(mx+b(n)σx,xmax)a_0,b_0 \leftarrow \max\left(m_x+a_{(n)}\sigma_x, x_{min}\right), \min\left(m_x+b_{(n)}\sigma_x, x_{max}\right)
2     \;\; for k{1,2,...,rounds}k \in \{1,2,...,rounds\} do
3         \;\;\;\; compute x^(k)\hat{x}_{(k)} from ak1a_{k-1} and bk1b_{k-1}
4         \;\;\;\; ak,bkargmina,b  Error(a,b  |  λ,x^(k))a_k,b_k \leftarrow arg\min_{a,b}\; Error\left(a,b\; \middle| \; \lambda, \hat{x}_{(k)} \right)
5         \;\;\;\; if Error(ak,bk  |  λ,x^(k))>Error(ak1,bk1  |  λ,x^(k1))Error\left(a_k,b_k\; \middle| \; \lambda, \hat{x}_{(k)} \right) > Error\left(a_{k-1},b_{k-1}\; \middle| \; \lambda, \hat{x}_{(k-1)} \right) then break

First we’ll focus on line 3. The simplest approach to compute x^\hat{x} uses standard scalar quantization

x^(k),i=round(2n1bk1ak1(clamp(xi,ak1,bk1)ak1))\hat{x}_{(k), i} = \text{round}\left(\frac{2^n-1}{b_{k-1}-a_{k-1}} \left(\text{clamp}(x_i, a_{k-1}, b_{k-1}) - a_{k-1}\right)\right)

Specifically, this would amount to snapping each component of xx to the nearest grid point. In practice this does not minimize Error(a,b  |  λ)Error\left(a,b\; \middle| \; \lambda\right) as we illustrate below.

Unfortunately, we can’t just enumerate grid points and find the minimum error one since there are 2nd2^{nd} candidates; therefore, we tried the following heuristic:

  1. Snap to the nearest grid point,
  2. Coordinate-wise check if rounding in the other direction reduces the error.

This isn’t guaranteed to find the global optimum, which in fact isn’t guaranteed to be one of the corners of the grid square containing the floating point vector. However, in practice we found it meant the error almost never increased in an iteration of the loop over kk. By contrast, when snapping to the nearest grid point the loop frequently exits due to this condition.

The heuristic yields a small but systematic improvement in the brute force recall. On average it amounted to +0.3% compared to just snapping to the nearest grid point. Given the impact is so small we decided it wasn’t worth the extra complexity and increased runtime.

We now turn our attention to line 4. This expression decomposes as follows

Error(a,b  |  λ,x^)=1λx2(xt(xˉ(a,b)x))2+λ(xˉ(a,b)x)t(xˉ(a,b)x)Error\left(a,b\; \middle| \; \lambda, \hat{x} \right) = \frac{1-\lambda}{\|x\|^2} \left(x^t(\bar{x}(a,b)-x)\right)^2 + \lambda \left(\bar{x}(a,b)-x\right)^t \left(\bar{x}(a,b)-x\right)

It’s fairly easy to see that this is a convex quadratic form of aa and bb. This means it has a unique minimum where the partial derivatives w.r.t. aa and bb vanish. We won’t show the full calculation but give a flavor. For example, we can use the chain rule to help evaluate the first term

a(xt(xˉ(a,b)x))2=2xt(xˉ(a,b)x)xtxˉ(a,b)a\frac{\partial}{\partial a} \left(x^t(\bar{x}(a,b)-x)\right)^2 = 2 x^t(\bar{x}(a,b)-x) \frac{\partial x^t\bar{x}(a,b)}{\partial a}

then

xtxˉ(a,b)a=ai=1dxi(a+ba2n1x^i)=i=1dxi(112n1x^i)=i=1dxi(1si)\frac{\partial x^t\bar{x}(a,b)}{\partial a} = \frac{\partial}{\partial a} \sum_{i=1}^d x_i \left(a + \frac{b-a}{2^n-1}\hat{x}_i\right) = \sum_{i=1}^d x_i \left(1-\frac{1}{2^n-1}\hat{x}_i\right) = \sum_{i=1}^d x_i (1-s_i)

Where we’ve defined si=12n1x^is_i = \frac{1}{2^n-1}\hat{x}_i. The final result is that the optimal interval satisfies

[1λx2(ixi(1si))2+λi(1si)21λx2(ixi(1si))ixisi+λi(1si)si1λx2(ixi(1si))ixisi+λi(1si)si1λx2(ixisi)2+λisi2][ab]=[ixi(1si)ixisi]\footnotesize \left[ \begin{matrix} \frac{1-\lambda}{\|x\|^2}\left(\sum_i x_i(1-s_i)\right)^2+\lambda \sum_i (1-s_i)^2 & \frac{1-\lambda}{\|x\|^2}\left(\sum_i x_i(1-s_i)\right)\sum_i x_i s_i+\lambda \sum_i(1-s_i)s_i \\ \frac{1-\lambda}{\|x\|^2}\left(\sum_i x_i(1-s_i)\right)\sum_i x_i s_i+\lambda \sum_i(1-s_i)s_i & \frac{1-\lambda}{\|x\|^2}\left(\sum_i x_i s_i\right)^2 + \lambda \sum_i s_i^2 \end{matrix} \right] \left[\begin{matrix}a \\ b\end{matrix}\right]= \left[\begin{matrix}\sum_i x_i(1-s_i) \\ \sum_i x_i s_i \end{matrix}\right]

which is trivial to solve for aa​ and bb​. Taken together with the preprocessing and interval initialization steps this defines OSQ.

The query and document vectors are different

We noted already that each vector could choose its bit count. Whilst there are certain technical disadvantages to using different bit counts for different document vectors, the query and the document vectors are different. In particular, the compression factor the quantization scheme yields only depends on the number of bits used to represent the document vectors.

There are side effects from using more bits to represent the query, principally that it can affect the dot product performance. Also there’s a limit to what one can gain based on the document quantization error. However, we get large recall gains by using asymmetric quantization for high compression factors and this translates to significant net wins in terms of recall as a function of query latency. Therefore, we always quantize the query using at least 4 bits.

How does it perform?

In this section we compare the brute force recall@knk|n and the correlation between the estimated and actual dot products for OSQ and our baseline BBQ quantization scheme, which was an adaptation of RaBitQ to use with an HNSW index. We have previously evaluated this baseline scheme and its accuracy is commensurate with RaBitQ. The authors of RaBitQ did extensive comparisons with alternative methods showing its superiority; we therefore consider it sufficient to simply compare to this very strong baseline.

We also perform a couple of ablation studies: against a global interval and against the per vector intervals calculated by our initialization scheme. To compute a global interval we use the OSQ initialization strategy, but compute the mean and variance of the corpus as a whole. For low bit counts, this is significantly better than the usual fixed interval quantization scheme, which tends to use something like the 99th percentile centered confidence interval. High confidence intervals in such cases often completely fail because the grid points are far from the majority of the data.

We evaluate against a variety of embeddings

and datasets

The embedding dimensions vary from 384 to 960. E5, Arctic and GTE use cosine similarity, and Cohere and GIST using MIP. The datasets vary from around 100k to 1M vectors.

In all our experiments we set λ=0.1\lambda=0.1, which we found to be an effective choice.

First off we study 1 bit quantization. We report brute force recall in these experiments to take any effects of the indexing choice out of the picture. As such we do not compare recall vs latency curves, which are very strongly affected by both the indexing data structure and the dot product implementation. Instead we focus on the recall at 10 reranking the top n hits. In our next blog we’ll study how OSQ behaves when it is integrated with Lucene's HNSW index implementation and turn our attention to query latency. The figures below show our brute force recall@10n10|n for n{10,20,30,40,50}n\in\{10,20,30,40,50\}.

Rolling these up into the average recall@10n10|n for the 8 retrieval experiments we tested we get the table below.

nBaseline average recall@10|n OSQ average recall@10|nGlobal average recall@10|nInitialization average recall@10|n
100.710.740.540.65
200.880.900.690.81
300.930.940.760.86
400.950.960.800.89
500.960.970.820.90

Compared to the baseline we gain 2% on average in recall. As for the ablation study, compared to using a global quantization interval we gain 26% and compared to our initial per vector quantization intervals we gain 10%.

The figure below shows the floating point dot product values versus the corresponding 1 bit quantized dot product values for a sample of 2000 gte-base-en-v1.5 embeddings. Visually the correlation is high.

We can quantify this by computing the R2R^2 between the floating point and quantized dot products. For each dataset and model combination we computed the average R2R^2 for every query against the full corpus. We see small but systematic improvement in R2R^2 comparing OSQ to the baseline. The table below shows the R2R^2 values broken down by the dataset and model combinations we tested.

DatasetModelBaseline R2OSQ R2
FiQAe5-small0.8490.865
FiQAarctic0.8500.863
FiQAgte0.9250.930
Quorae5-small0.8680.881
Quoraarctic0.8170.838
Quoragte0.8870.897
WikiCohere v20.8840.898
GIST1M-0.9530.974

Interestingly, when we integrated OSQ with HNSW we got substantially larger improvements in recall than we see for brute force search, as we’ll show in our next blog. One hypothesis we have is that the improvements we see in correlation with the true floating point dot products are more beneficial for graph search than brute force search. For many queries the tail of high scoring documents are well separated from the bulk of the score distribution and are less prone to being reordered by quantization errors. By contrast we have to navigate through regions of low scoring documents as we traverse the HNSW graph. Here any gains in accuracy can be important.

Finally, the table below compares the average recall for 1 and 2 bit OSQ for the same 8 retrieval experiments. With 2 bits we reach 95% recall reranking between 10 and 20 candidates. The average R2R^2 rises from 0.893 for 1 bit to 0.968 for 2 bit.

n1 bit OSQ average recall@10|n2 bit OSQ average recall@10|n
100.740.84
200.900.97
300.940.99
400.960.995
500.970.997

Conclusion

We’ve presented an improved automatic scalar quantization scheme which allows us to achieve high recall with relatively modest reranking depth. Avoiding deep reranking has significant system advantages.

For 1 bit quantization we compared it to a very strong baseline and showed it was able to achieve systematic improvements in both recall and the accuracy with which it approximates the floating point vector distance calculation. Therefore, we feel comfortable saying that it sets new state-of-the-art performance for at 32×\times compression of the raw vectors.

It also allows one to simply trade compression for retrieval quality using the same underlying approach and achieves significant performance improvements compared to standard scalar quantization techniques.

We are working on integrating this new approach into Elasticsearch. In our next blog we will discuss how it is able to enhance the performance of our existing BBQ scalar quantization index formats.

Ready to try this out on your own? Start a free trial.

Looking to build RAG into your apps? Want to try different LLMs with a vector database?
Check out our sample notebooks for LangChain, Cohere and more on Github, and join Elasticsearch Relevance Engine training now.

Related content

Smokin' fast BBQ with hardware accelerated SIMD instructions

December 4, 2024

Smokin' fast BBQ with hardware accelerated SIMD instructions

How we optimized vector comparisons in BBQ with hardware accelerated SIMD (Single Instruction Multiple Data) instructions.

Better Binary Quantization (BBQ) in Lucene and Elasticsearch

November 11, 2024

Better Binary Quantization (BBQ) in Lucene and Elasticsearch

How Better Binary Quantization works in Lucene and Elasticsearch.

RaBitQ binary quantization 101

October 22, 2024

RaBitQ binary quantization 101

Understand the most critical components of RaBitQ binary quantization, how it works and its benefits. This guide also covers the math behind the quantization and examples.

Better Binary Quantization vs. Product Quantization

November 18, 2024

Better Binary Quantization vs. Product Quantization

Why we chose to spend time working on better binary quantization instead of product quantization in Lucene and Elasticsearch.

Understanding scalar quantization in Lucene

November 11, 2023

Understanding scalar quantization in Lucene

Explore how Elastic introduced scalar quantization into Lucene, including automatic byte quantization, quantization per segment & performance insights.

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself