When hybrid search truly shines

Demonstrating when hybrid search is better than lexical or semantic search on their own.

In this article, we are going to explore hybrid search by examples, and show when it truly shines against using lexical or semantic search techniques alone.

What is hybrid search?

Hybrid search is a technique that combines different search approaches such as traditional lexical term matching, and semantic search.

Lexical search is good when users know the exact words. This approach will find the relevant documents, and sort them in a way that makes sense by using TF-IDF which means: The more common across the dataset the term you are searching is, the less it contributes to the score; and the more common it is within a certain document the more it contributes to the score.

But, what if the words in the query are not present in the documents? Sometimes the user is not looking for something in concrete, but for a concept. They may not be looking for a specific restaurant, but for "a nice place to eat with family". For this kind of queries, semantic search is useful because it takes into consideration the context of the search query and brings similar documents. You can expect to get more related documents back than with the previous approach, but in return, this approach struggles with precision, especially with numbers.

Hybrid search gives us the best of both worlds by blending the precision of term-matching together with the context-aware matching of semantic search.

You can read a deep dive on hybrid search on this article, and more about lexical and semantic search differences in this one.

Let's create an example using real estate units.

The query will be: quiet home in Pinewood with 2 rooms , with quiet place being the semantic component of the query while Pinewood with 2 rooms will be the textual or lexical portion.

Configuring ELSER

We are going to use ELSER as our model provider.

Start by creating the inference endpoint:

PUT _inference/sparse_embedding/my-elser-model 
{
  "service": "elser", 
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  }
}

If this is your first time using ELSER, you may encounter a 502 Bad Gateway error as the model loads in the background.You can check the status of the model in Machine Learning > Trained Models in Kibana.Once is deployed, you can proceed to the next step.

Configuring index

For the index, we are going to use text fields, and semantic_text for the semantic field. We are going to copy the descriptions, because we want to use them for both match and semantic queries.

PUT properties-hybrid
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english"
      },
      "description": {
        "type": "text",
        "analyzer": "english", 
        "copy_to": "semantic_field"
      },
      "neighborhood": {
        "type": "keyword"
      },
      "bedrooms": {
        "type": "integer"
      },
      "bathrooms": {
        "type": "integer"
      },
      "semantic_field": {
        "type": "semantic_text",
        "inference_id": "my-elser-model"
      }
    }
  }
}

Indexing data


POST _bulk
{ "index" : { "_index" : "properties-hybrid" , "_id": "1"} }
{ "title": "2 Bed 2 Bath in Sunnydale", "description": "Spacious apartment with modern amenities, perfect for urban dwellers.", "neighborhood": "Sunnydale", "bedrooms": 2, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "2" } }
{ "title": "1 Bed 1 Bath in Sunnydale", "description": "Compact apartment with easy access to downtown, ideal for singles or couples.", "neighborhood": "Sunnydale", "bedrooms": 1, "bathrooms": 1 }
{ "index" : { "_index" : "properties-hybrid", "_id": "3" } }
{ "title": "2 Bed 2 Bath in Pinewood", "description": "New apartment with modern bedrooms, located in a restaurant and bar area. Suitable for active people who enjoy nightlife.", "neighborhood": "Pinewood", "bedrooms": 2, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "4" } }
{ "title": "3 Bed 2 Bath in Pinewood", "description": "Secluded and private family unit with a practical layout with three total rooms. Near schools and shops. Perfect for raising kids.", "neighborhood": "Pinewood", "bedrooms": 3, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "5" } }
{ "title": "2 Bed 2 Bath in Pinewood", "description": "Retired apartment in a serene neighborhood, perfect for those seeking a retreat. This well-maintained residence offers two bedrooms with abundant natural light and silence.", "neighborhood": "Pinewood", "bedrooms": 2, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "6" } }
{ "title": "1 Bed 1 Bath in Pinewood", "description": "Apartment with a scenic view, ideal for those seeking an energetic environment.", "neighborhood": "Pinewood", "bedrooms": 1, "bathrooms": 1 }
{ "index" : { "_index" : "properties-hybrid", "_id": "7" } }
{ "title": "2 Bed 2 Bath in Maplewood", "description": "Nice apartment with a large balcony, offering a relaxed and comfortable living experience.", "neighborhood": "Maplewood", "bedrooms": 2, "bathrooms": 2 }
{ "index" : { "_index" : "properties-hybrid", "_id": "8" } }
{ "title": "1 Bed 1 Bath in Maplewood", "description": "Charming apartment with modern interiors, situated in a peaceful neighborhood.", "neighborhood": "Maplewood", "bedrooms": 1, "bathrooms": 1 }

Querying data

Let's start by the classic match query, that will search by the content of the title and description:

GET properties-hybrid/_search
{
  "query": {
    "multi_match": {
      "query": "quiet home 2 bedroom in Pinewood",
      "fields": ["title", "description"]
    }
  }
}

This is the first result:

{
    "description": "New apartment with modern bedrooms, located in a restaurant and bar area. Suitable for active people who enjoy nightlife.",
    "title": "2 Bed 2 Bath in Pinewood"
}

It is not bad. It managed to capture the neighborhood Pinewood, and also the 2 bedroom requirement, however, this is not a quiet place at all.

Now, a pure semantic query:

GET properties-hybrid/_search
{
  "query": {
    "semantic": {
      "field": "semantic_field",
      "query": "quiet home in Pinewood with 2 rooms"
    }
  }
}

This is the first result:

{
    "description": "Secluded and private family unit with a practical layout with three total rooms. Near schools and shops. Perfect for raising kids.",
    "title": "3 Bed 2 Bath in Pinewood"
}

Now the results considered the quiet home piece by relating it to things like "secluded and private", but this one is a 3 bedroom and we are looking for 2.

Let's run a hybrid search now. We will use RRF (Reciprocal rank fusion) to achieve this purpose and combine the two previous queries. The RRF algorithm will blend the scores of both queries for us.

GET properties-hybrid/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "semantic_field",
                "query": "quiet home 2 bedroom in Pinewood"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "quiet home 2 bedroom in Pinewood",
                "fields": ["title", "description"]
              }
            }
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  }
}

This is the first result:

{
    "description": ""Retired apartment in a serene neighborhood, perfect for those seeking a retreat. This well-maintained residence offers two bedrooms with abundant natural light and silence.",
    "title": "2 Bed 2 Bath in Pinewood"
}

Now the results considered both being a quiet place, but also having 2 bedrooms.

Evaluating results

For the evaluation, we are going to use the Ranking Evaluation API which allows us to automate the process of running queries and then checking the position of the relevant results. You can choose between different evaluation metrics. For this example I will pick Mean reciprocal ranking (MRR) which takes into consideration the result position and reduces the score as the position gets lower by 1/position#.

For this scenario, we are going to test our 3 queries (multi_match, semantic, hybrid) against the initial question:

quiet home 2 bedroom in Pinewood

Expecting the following apartment to be in the first position as it meets all the criteria.

Retired apartment in a serene neighborhood, perfect for those seeking a retreat. This well-maintained residence offers two bedrooms with abundant natural light and silence."

We can configure as many queries as we need, and put on ratings the id of the documents we expect to be in the first positions:

GET /properties-hybrid/_rank_eval
{
  "requests": [
    {
      "id": "hybrid",
      "request": {
        "retriever": {
          "rrf": {
            "retrievers": [
              {
                "standard": {
                  "query": {
                    "semantic": {
                      "field": "semantic_field",
                      "query": "quiet home 2 bedroom in Pinewood"
                    }
                  }
                }
              },
              {
                "standard": {
                  "query": {
                    "multi_match": {
                      "query": "quiet home 2 bedroom in Pinewood",
                      "fields": [
                        "title",
                        "description"
                      ]
                    }
                  }
                }
              }
            ],
            "rank_window_size": 50,
            "rank_constant": 20
          }
        }
      },
      "ratings": [
        {
          "_index": "properties-hybrid",
          "_id": "5",
          "rating": 1
        }
      ]
    },
    {
      "id": "lexical",
      "request": {
        "query": {
          "multi_match": {
            "query": "quiet home 2 bedroom in Pinewood",
            "fields": [
              "title",
              "description"
            ]
          }
        }
      },
      "ratings": [
        {
          "_index": "properties-hybrid",
          "_id": "5",
          "rating": 1
        }
      ]
    },
    {
      "id": "semantic",
      "request": {
        "query": {
          "semantic": {
            "field": "semantic_field",
            "query": "quiet place in Pinewood with 2 rooms"
          }
        }
      },
      "ratings": [
        {
          "_index": "properties-hybrid",
          "_id": "5",
          "rating": 1
        }
      ]
    }
  ],
  "metric": {
    "mean_reciprocal_rank": {
      "k": 20,
      "relevant_rating_threshold": 1
    }
  }
}

As you can see on the image, the query got a score of 1 for hybrid search (1st position), and 0.5 on the other ones, meaning the expected result was returned on the second position.

Conclusion

Full-text search techniques–which find terms and sort the results by term frequency–and semantic search–which will search by semantic proximity–are powerful in different scenarios. On the one hand, text search shines when users are specific with what they want to search, for example providing the exact SKU for an article or words present on a technical manual. On the other hand, semantic search is useful when users are looking for concepts or ideas not explicitly defined in the documents. Combining both approaches with hybrid search, gives you both full-text search capabilities as well as adding semantically related documents, which can be useful in specific scenarios that require keyword matching and contextual understanding. This dual approach enhances search accuracy and relevance, making it ideal for complex queries and diverse content types.

Ready to try this out on your own? Start a free trial.

Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our Beyond RAG Basics webinar to build your next GenAI app!

Related content

Navigating graphs for Retrieval-Augmented Generation using Elasticsearch

January 20, 2025

Navigating graphs for Retrieval-Augmented Generation using Elasticsearch

Discover how to use Knowledge Graphs to enhance RAG results while storing the graph efficiently in Elasticsearch. This guide explores a detailed strategy for dynamically generating knowledge subgraphs tailored to a user’s query.

How to ingest data to Elasticsearch through Apache Airflow

January 17, 2025

How to ingest data to Elasticsearch through Apache Airflow

Learn how to ingest data to Elasticsearch through Apache Airflow.

Jira connector tutorial part II: 6 optimization tips

January 16, 2025

Jira connector tutorial part II: 6 optimization tips

After connecting Jira to Elasticsearch, we'll now review best practices to escalate this deployment.

Jira connector tutorial part I

January 15, 2025

Jira connector tutorial part I

Indexing our Jira content into Elasticsearch to create a unified data source and do search with Document Level Security.

High Quality RAG with Aryn DocPrep, DocParse and Elasticsearch vector database

January 21, 2025

High Quality RAG with Aryn DocPrep, DocParse and Elasticsearch vector database

Learn how to achieve high-quality RAG with effective data preparation using Aryn.ai DocParse, DocPrep, and Elasticsearch vector database.

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself