Elasticsearch is introducing a new type of vector in 8.6! This vector has 8-bit integer dimensions, where each dimension has a range of [-128, 127]. This is 4x smaller than the current vector with 32-bit float dimensions, which can result in substantial space savings.
You can start indexing these smaller, 8-bit vectors right now by adding the element_type parameter with the byte value to your vector mappings, similar to the example below.
{
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector",
"element_type": "byte",
"dims": 3,
"index": true,
"similarity": "dot_product"
}
}
}
}But what if your existing vectors' dimensions don't fit into this smaller type? Then we can use the process of quantization to make them fit, often with only a small loss of precision!
Let's quantize
Let's start by defining quantization. Quantization is the process of taking a larger set of values and mapping them to a smaller set of values. More specifically, in our case this would be taking the range of a 32-bit float and mapping it to the range of an 8-bit integer for each dimension in a vector. (This should not be confused with dimensional reduction, which is a different topic. This is only reducing the range of the values for the existing dimensions.)
This leads to two further questions. What is the actual range of our 32-bit float vectors? And what function should we use to do the mapping? The answers vary significantly based on use-case.
As an example, one of the simplest forms of quantization is taking the dimensions of normalized 32-bit vectors and linearly mapping them to the full range of the dimensions of 8-bit vectors. Using Python, this would look something like the following:
import numpy as np
import typing as t
def quantize_embeddings(text_and_embeddings: t.List[t.Mapping[str, t.Any]]) -> t.List[t.Mapping[str, t.Any]]:
quantized_embeddings = np.array([x['embedding'] for x in
query_and_embeddings])
quantized_embeddings = (quantized_embeddings * 128)
quantized_embeddings = quantized_embeddings.clip(-128,
127).astype(int).tolist()
return [dict(item, **{'embedding': embedding}) for (item,
embedding) in zip(text_and_embeddings, quantized_embeddings)]This is only a single example, though. There are many other useful quantization functions. For your specific use case, it's important to evaluate what method of quantization will give you the best results relative to the trade-off between space reduction, relevance, and recall.
Some real-world numbers
8-bit vectors and quantization are great and all, but do they really reduce space in a real-world use case? The answer is unequivocally YES! And substantially. This is all while they continue to give good results without hurting relevance and recall. Elasticsearch even has all the tools you need to do that evaluation yourself with our rank evaluation API.
Now, let's look at some numbers generated from a real-world example with the following setup:
- All data was gathered using Elasticsearch in Cloud with two gcp.data.highcpu.1 64GB nodes
- Data was collected from the NQ dataset (Natural Question), built by Google, used in BEIR
- The embeddings model was sentence-transformers%2Fall-MiniLM-L6-v2
- Quantization to generate 8-bit integer vectors was applied to the 32-bit float vectors collected from the data using the previous example Python snippet
Then we make some magic happen and collect results based on this setup:
| category | Median kNN Response Time | Median Exact Response Time | Recall@100 | NDCG@10 | Total Index Size (1p, 1r) |
| byte | 32ms | 1072ms | 0.79 | 0.38 | 5.8gb |
| float | 36ms | 1530ms | 0.79 | 0.38 | 16.4gb |
| % Reduction | 11% | 30% | 0% | 0% | 64% |
And our results look fantastic. Let's break down each one.
- Median kNN Response Time: This response time is collected using approximate kNN search against our example data set. This type of search uses Lucene's HNSW graph as the backing data structure. We see an 11% increase in response time for byte versus float.
- Median Exact Response Time: This response time is collected using exact kNN search against our example data set. This type of search uses a script to iterate through every vector in the data set and will return the best possible results. We see a large improvement of 30% reduction in response time!
- Recall@100: This shows us if the most relevant results are included in the top 100. This is important to show if our quantization function worked well. We can see that the numbers are identical for byte versus float, which means that our relevance even after quantizing is just as good for byte as it is for float.
- @NDCG@10: This shows us how good the quality of our first 10 results is. This is another important metric to evaluate if our quantization function worked well. Once again, the numbers are identical between byte versus float, so we can rest assured that our results are still just as good even after quantization.
- Total Index Size (1p, 1r): This is the total index size used for our vectors' index with a single partition and a single replica. For this metric, we disabled source, which we recommend for all vector fields in which the ingested vector data is unmodified so it's not stored twice. And we see a massive 64% reduction in total index size! This doesn't quite reach the 4x difference between a byte and float because of additional overhead for the HNSW data structure including graph connections, but it's still a quite substantial size reduction.
Byte vectors are all ready to go as part of 8.6, and we encourage you to fire up a cluster in Elastic Cloud and give them a try!
Ready to try this out on your own? Start a free trial.
Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!
Related content

October 27, 2025
Building agentic applications with Elasticsearch and Microsoft’s Agent Framework
Learn how to use Microsoft Agent Framework with Elasticsearch to build an agentic application that extracts ecommerce data from Elasticsearch client libraries using ES|QL.

October 21, 2025
Introducing Elastic’s Agent Builder
Introducing Elastic Agent Builder, a framework to easily build reliable, context-driven AI agents in Elasticsearch with your data.

October 20, 2025
Elastic MCP server: Expose Agent Builder tools to any AI agent
Discover how to use the built-in Elastic MCP server in Agent Builder to securely extend any AI agent with access to your private data and custom tools.

How to use the Synonyms UI to upload and manage Elasticsearch synonyms
Learn how to use the Synonyms UI in Kibana to create synonym sets and assign them to indices.

October 13, 2025
AI Agent evaluation: How Elastic tests agentic frameworks
Learn how we evaluate and test changes to an agentic system before releasing them to Elastic users to ensure accurate and verifiable results.