When hybrid search truly shines

In this article, we are going to explore hybrid search by examples, and show when it truly shines against using lexical or semantic search techniques alone.

What is hybrid search?

Hybrid search is a technique that combines different search approaches such as traditional lexical term matching, and semantic search.

Lexical search is good when users know the exact words. This approach will find the relevant documents, and sort them in a way that makes sense by using TF-IDF which means: The more common across the dataset the term you are searching is, the less it contributes to the score; and the more common it is within a certain document the more it contributes to the score.

But, what if the words in the query are not present in the documents? Sometimes the user is not looking for something in concrete, but for a concept. They may not be looking for a specific restaurant, but for "a nice place to eat with family". For this kind of queries, semantic search is useful because it takes into consideration the context of the search query and brings similar documents. You can expect to get more related documents back than with the previous approach, but in return, this approach struggles with precision, especially with numbers.

Hybrid search gives us the best of both worlds by blending the precision of term-matching together with the context-aware matching of semantic search.

You can read a deep dive on hybrid search on this article, and more about lexical and semantic search differences in this one.

Let's create an example using real estate units.

The query will be: quiet home in Pinewood with 2 rooms , with quiet place being the semantic component of the query while Pinewood with 2 rooms will be the textual or lexical portion.

Configuring ELSER

We are going to use ELSER as our model provider.

Start by creating the inference endpoint:

PUT _inference/sparse_embedding/my-elser-model 
{
  "service": "elser", 
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  }
}

If this is your first time using ELSER, you may encounter a 502 Bad Gateway error as the model loads in the background.You can check the status of the model in Machine Learning > Trained Models in Kibana.Once is deployed, you can proceed to the next step.

Configuring index

For the index, we are going to use text fields, and semantic_text for the semantic field. We are going to copy the descriptions, because we want to use them for both match and semantic queries.

PUT properties-hybrid
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english"
      },
      "description": {
        "type": "text",
        "analyzer": "english", 
        "copy_to": "semantic_field"
      },
      "neighborhood": {
        "type": "keyword"
      },
      "bedrooms": {
        "type": "integer"
      },
      "bathrooms": {
        "type": "integer"
      },
      "semantic_field": {
        "type": "semantic_text",
        "inference_id": "my-elser-model"
      }
    }
  }
}

Indexing data


POST _bulk
{ "index" : { "_index" : "properties-hybrid" , "_id": "1"} }
{ "title": "2 Bed 2 Bath in Sunnydale", "description": "Spacious apartment with modern amenities, perfect for urban dwellers.", "neighborhood": "Sunnydale", "bedrooms": 2, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "2" } }
{ "title": "1 Bed 1 Bath in Sunnydale", "description": "Compact apartment with easy access to downtown, ideal for singles or couples.", "neighborhood": "Sunnydale", "bedrooms": 1, "bathrooms": 1 }
{ "index" : { "_index" : "properties-hybrid", "_id": "3" } }
{ "title": "2 Bed 2 Bath in Pinewood", "description": "New apartment with modern bedrooms, located in a restaurant and bar area. Suitable for active people who enjoy nightlife.", "neighborhood": "Pinewood", "bedrooms": 2, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "4" } }
{ "title": "3 Bed 2 Bath in Pinewood", "description": "Secluded and private family unit with a practical layout with three total rooms. Near schools and shops. Perfect for raising kids.", "neighborhood": "Pinewood", "bedrooms": 3, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "5" } }
{ "title": "2 Bed 2 Bath in Pinewood", "description": "Retired apartment in a serene neighborhood, perfect for those seeking a retreat. This well-maintained residence offers two bedrooms with abundant natural light and silence.", "neighborhood": "Pinewood", "bedrooms": 2, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "6" } }
{ "title": "1 Bed 1 Bath in Pinewood", "description": "Apartment with a scenic view, ideal for those seeking an energetic environment.", "neighborhood": "Pinewood", "bedrooms": 1, "bathrooms": 1 }
{ "index" : { "_index" : "properties-hybrid", "_id": "7" } }
{ "title": "2 Bed 2 Bath in Maplewood", "description": "Nice apartment with a large balcony, offering a relaxed and comfortable living experience.", "neighborhood": "Maplewood", "bedrooms": 2, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "8" } }
{ "title": "1 Bed 1 Bath in Maplewood", "description": "Charming apartment with modern interiors, situated in a peaceful neighborhood.", "neighborhood": "Maplewood", "bedrooms": 1, "bathrooms": 1 }

Querying data

Let's start by the classic match query, that will search by the content of the title and description:

GET properties-hybrid/_search
{
  "query": {
    "multi_match": {
      "query": "quiet home 2 bedroom in Pinewood",
      "fields": ["title", "description"]
    }
  }
}

This is the first result:

{
    "description": "New apartment with modern bedrooms, located in a restaurant and bar area. Suitable for active people who enjoy nightlife.",
    "title": "2 Bed 2 Bath in Pinewood"
}

It is not bad. It managed to capture the neighborhood Pinewood, and also the 2 bedroom requirement, however, this is not a quiet place at all.

Now, a pure semantic query:

GET properties-hybrid/_search
{
  "query": {
    "semantic": {
      "field": "semantic_field",
      "query": "quiet home in Pinewood with 2 rooms"
    }
  }
}

This is the first result:

{
    "description": "Secluded and private family unit with a practical layout with three total rooms. Near schools and shops. Perfect for raising kids.",
    "title": "3 Bed 2 Bath in Pinewood"
}

Now the results considered the quiet home piece by relating it to things like "secluded and private", but this one is a 3 bedroom and we are looking for 2.

Let's run a hybrid search now. We will use RRF (Reciprocal rank fusion) to achieve this purpose and combine the two previous queries. The RRF algorithm will blend the scores of both queries for us.

GET properties-hybrid/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "semantic_field",
                "query": "quiet home 2 bedroom in Pinewood"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "quiet home 2 bedroom in Pinewood",
                "fields": ["title", "description"]
              }
            }
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  }
}

This is the first result:

{
    "description": ""Retired apartment in a serene neighborhood, perfect for those seeking a retreat. This well-maintained residence offers two bedrooms with abundant natural light and silence.",
    "title": "2 Bed 2 Bath in Pinewood"
}

Now the results considered both being a quiet place, but also having 2 bedrooms.

Evaluating results

For the evaluation, we are going to use the Ranking Evaluation API which allows us to automate the process of running queries and then checking the position of the relevant results. You can choose between different evaluation metrics. For this example I will pick Mean reciprocal ranking (MRR) which takes into consideration the result position and reduces the score as the position gets lower by 1/position#.

For this scenario, we are going to test our 3 queries (multi_match, semantic, hybrid) against the initial question:

quiet home 2 bedroom in Pinewood

Expecting the following apartment to be in the first position as it meets all the criteria.

Retired apartment in a serene neighborhood, perfect for those seeking a retreat. This well-maintained residence offers two bedrooms with abundant natural light and silence."

We can configure as many queries as we need, and put on ratings the id of the documents we expect to be in the first positions:

GET /properties-hybrid/_rank_eval
{
  "requests": [
    {
      "id": "hybrid",
      "request": {
        "retriever": {
          "rrf": {
            "retrievers": [
              {
                "standard": {
                  "query": {
                    "semantic": {
                      "field": "semantic_field",
                      "query": "quiet home 2 bedroom in Pinewood"
                    }
                  }
                }
              },
              {
                "standard": {
                  "query": {
                    "multi_match": {
                      "query": "quiet home 2 bedroom in Pinewood",
                      "fields": [
                        "title",
                        "description"
                      ]
                    }
                  }
                }
              }
            ],
            "rank_window_size": 50,
            "rank_constant": 20
          }
        }
      },
      "ratings": [
        {
          "_index": "properties-hybrid",
          "_id": "5",
          "rating": 1
        }
      ]
    },
    {
      "id": "lexical",
      "request": {
        "query": {
          "multi_match": {
            "query": "quiet home 2 bedroom in Pinewood",
            "fields": [
              "title",
              "description"
            ]
          }
        }
      },
      "ratings": [
        {
          "_index": "properties-hybrid",
          "_id": "5",
          "rating": 1
        }
      ]
    },
    {
      "id": "semantic",
      "request": {
        "query": {
          "semantic": {
            "field": "semantic_field",
            "query": "quiet place in Pinewood with 2 rooms"
          }
        }
      },
      "ratings": [
        {
          "_index": "properties-hybrid",
          "_id": "5",
          "rating": 1
        }
      ]
    }
  ],
  "metric": {
    "mean_reciprocal_rank": {
      "k": 20,
      "relevant_rating_threshold": 1
    }
  }
}

As you can see on the image, the query got a score of 1 for hybrid search (1st position), and 0.5 on the other ones, meaning the expected result was returned on the second position.

Conclusion

Full-text search techniques–which find terms and sort the results by term frequency–and semantic search–which will search by semantic proximity–are powerful in different scenarios. On the one hand, text search shines when users are specific with what they want to search, for example providing the exact SKU for an article or words present on a technical manual. On the other hand, semantic search is useful when users are looking for concepts or ideas not explicitly defined in the documents. Combining both approaches with hybrid search, gives you both full-text search capabilities as well as adding semantically related documents, which can be useful in specific scenarios that require keyword matching and contextual understanding. This dual approach enhances search accuracy and relevance, making it ideal for complex queries and diverse content types.

Ready to try this out on your own? Start a free trial.

Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our Beyond RAG Basics webinar to build your next GenAI app!