Introducing the sparse vector query: Searching sparse vectors with inference or precomputed query vectors

Learn about the Elasticsearch sparse vector query, how it works, and how to effectively use it.

Sparse vector queries take advantage of Elasticsearch’s powerful inference API, allowing easy built-in setup for Elastic-hosted models such as ELSER and E5, as well as the flexibility to host other models.

Introduction

Vector search is evolving, and as our needs for vector search evolve so does the need for a consistent and forward thinking vector search API.

When Elastic first launched semantic search, we leveraged existing rank_features fields using the text_expansion query. We then reintroduced the sparse_vector field type for semantic search use cases.

As we think about what sparse vector search is going forward, we’ve introduced a new sparse vector query. As of Elasticsearch 8.15.0, both the text_expansion query and weighted_tokens query have been deprecated in favor of the new sparse vector query.

The sparse vector query supports two modes of querying: using an inference ID and using precomputed query vectors. Both modes of querying require data to be indexed in a sparse_vector mapped field.

These token-weight pairs are then used in a query against a sparse vector. At query time, query vectors are calculated using the same inference model that was used to create the tokens.

Let’s look at an example: let’s say we’ve indexed a document detailing when Orion is most visible in the night sky:

Now, assume we’re looking for constellations that are visible in the northern hemisphere, and we run this query through the same learned sparse encoder model. The output might look similar to this:

At query time, these vectors are ORed together, and scoring is effectively a dot product calculation between the stored dimensions and the query dimensions, which would score this example at 10.84:

Sparse vector queries with inference

Sparse vector queries using inference work in a very similar way to the previous text expansion query, instead of sending in a trained model, we create an inference endpoint associated with the model we want to use.

Here’s an example of how to create an inference endpoint for ELSER:

PUT _inference/sparse_embedding/my-elser-endpoint
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  }
}

You should use an inference endpoint to index your sparse vector data, and use the same endpoint as input to your sparse_vector query. For example:

POST my-index/_search
{
  "query": {
    "sparse_vector": {
      "field": "embeddings",
      "inference_id": "my-elser-endpoint",
      "query": "constellations in the northern hemisphere"
    }
  }
}

Sparse vector queries with precomputed query vectors

You may have precomputed vectors that don’t require inference at query time. These can be sent into the sparse_vector query instead of using inference. Here is an example:

POST my-index/_search
{
  "query": {
    "sparse_vector": {
      "field": "embeddings",
      "query_vector": {
        "constellation": 2.5,
        "northern": 1.9,
        "hemisphere": 1.8,
        "orion": 1.5,
        "galaxy": 1.4,
        "astronomy": 0.9,
        "telescope": 0.3,
        "star": 0.01
      }
    }
  }
}

Query optimization with token pruning

Like text expansion search, the sparse vector query is subject to performance penalties from huge boolean queries. Therefore the same token pruning strategies available for text expansion strategies are available in the sparse vector query. You can see the impact of token pruning in our nightly MS Marco Passage Ranking benchmarks.

In order to enable pruning with the default pruning configuration (which has been tuned for ELSER V2), simply add prune: true to your request:

POST my-index/_search
{
  "query": {
    "sparse_vector": {
      "field": "embeddings",
      "inference_id": "my-elser-endpoint",
      "query": "constellations in the northern hemisphere",
      "prune": true
    }
  }
}

Alternately, you can adjust the pruning configuration by sending it directly in with the request:

GET my-index/_search
{
   "query":{
      "sparse_vector":{
         "field": "embeddings",
         "inference_id": "my-elser-endpoint",
         "query": "constellations in the northern hemisphere",
         "prune": true,
         "pruning_config": {
           "tokens_freq_ratio_threshold": 5,
           "tokens_weight_threshold": 0.4,
           "only_score_pruned_tokens": false
         }
      }
   }
}

Because token pruning will incur a recall penalty, we recommend adding the pruned tokens back in a rescore:

GET my-index/_search
{
   "query":{
      "sparse_vector":{
         "field": "embeddings",
         "inference_id": "my-elser-endpoint",
         "query": "constellations in the northern hemisphere",
         "prune": true,
         "pruning_config": {
           "tokens_freq_ratio_threshold": 5,
           "tokens_weight_threshold": 0.4,
           "only_score_pruned_tokens": false
         }
      }
   },
   "rescore": {
      "window_size": 100,
      "query": {
         "rescore_query": {
            "sparse_vector": {
               "field": "embeddings",
               "inference_id": "my-elser-endpoint",
               "query": "constellations in the northern hemisphere",
               "prune": true,
               "pruning_config": {
                   "tokens_freq_ratio_threshold": 5,
                   "tokens_weight_threshold": 0.4,
                   "only_score_pruned_tokens": true
               }
            }
         }
      }
   }
}

What's next?

While the text_expansion query is GA’d and will be supported throughout Elasticsearch 8.x, we recommend updating to the sparse_vector query as soon as possible in order to ensure you’re using the most up to date features as we continually improve the vector search experience in Elasticsearch.

If you are using the weighted_tokens query, this was never GA’d and will be replaced by the sparse_vector query very soon.

The sparse_vector query will be available starting with 8.15.0 and is already available in Serverless - try it out today!

Ready to try this out on your own? Start a free trial.

Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!

Related content

Unlock the Power of Your Data with RAG using Vertex AI and Elasticsearch

Unlock the Power of Your Data with RAG using Vertex AI and Elasticsearch

Unlock your data's potential with RAG using Vertex AI and Elasticsearch. This blog series covers data ingestion into Elasticsearch for a robust knowledge base for creating advanced RAG based search applications.

Building a search app with Blazor and Elasticsearch

Building a search app with Blazor and Elasticsearch

Learn how to build a search application using Blazor and Elasticsearch, and how to use the Elasticsearch .NET client for hybrid search.

Using Eland on Elasticsearch Serverless

Using Eland on Elasticsearch Serverless

Learn how to use Eland on Elasticsearch Serverless

Vertex AI integration with Elasticsearch open inference API brings reranking to your RAG applications

Vertex AI integration with Elasticsearch open inference API brings reranking to your RAG applications

Google Cloud customers can use Vertex AI embeddings and reranking models with Elasticsearch and take advantage of Vertex AI’s fully-managed, unified AI development platform for building generative AI apps.

Adding AI summaries to your site with Elastic

September 26, 2024

Adding AI summaries to your site with Elastic

How to add an AI summary box along with the search results to enrich your search experience.

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself