Connecting the Dots: ES|QL Joins for Richer Observability Insights

You might have seen our recent announcement about the arrival of SQL-style joins in Elasticsearch with ES|QL's LOOKUP JOIN command (now in Tech Preview!). While that post covered the basics, let's take a closer look at this in the context of Observability. How can this new join capability specifically help engineers and SREs make sense of their logs, metrics, and traces and make Elasticsearch more storage efficient by not denormalizing as much data?

Note: Before we jump into the details, it’s important to mention again that this type of functionality today relies on a special lookup index. It is not (yet) possible to JOIN any arbitrary index.

Observability isn't just about collecting data; it's about understanding it. Often, the raw telemetry data – a log line, a metric point, a trace span – lacks the full context needed for quick diagnosis or impact assessment. We need to correlate data, enrich it with business or infrastructure context, and ask more advanced questions.

Historically, achieving this in Elasticsearch involved techniques like denormalizing data at ingest time (using ingest pipelines with enrich processors, for example) or performing joins client-side.

By adding the necessary context (like host details or user attributes) as data flowed in, each document arrived fully ready for queries and analytics without extra processing later on. This approach worked well in many cases and still does, particularly when the reference data changes slowly or when the enriched fields are critical for nearly every search.

However, as environments become more dynamic and diverse, the need to frequently update reference data (or avoid storing repetitive fields in every document) highlighted some of the trade-offs.

With the introduction of ES|QL LOOKUP JOIN in Elasticsearch 8.18 and 9.0, you now have an additional, more flexible option for situations where real-time lookups and minimal duplication are desired. Both methods—ingest-time enrichment and on-the-fly LOOKUP JOIN—complement each other and remain valid, depending on use case needs around update frequency, query performance, and storage considerations.

Why Lookup Joins for Observability

Lookup joins keep things flexible. You can decide on the fly if you’d like to look up additional information to assist you in your investigation.

Here are some examples:

Deployment Information: Which version of the code is generating these errors?
Infrastructure Mapping: Which Kubernetes cluster or cloud region is experiencing high latency? What hardware does it use?
Business Context: Are critical customers being affected by this slowdown?
Team Ownership: Which team owns the service throwing these exceptions?

Keeping this kind of information perfectly denormalized onto every single log line or metric point can be challenging and inefficient. Lookup datasets – like lists of deployments, server inventories, customer tiers, or service ownership mappings – often change independently of the telemetry data itself.

LOOKUP JOIN

is ideal here because:

Lookup Indices are Writable: Update your deployment list, CMDB export, or on-call rotation in the lookup index, and your next ES|QL query immediately uses the fresh data. No need to re-run complex enrich policies or re-index data.
Flexibility: You decide at query time which context to join. Maybe today you care about deployment versions, tomorrow about cloud regions.
Simpler Setup: As the original post highlighted, there are no enrich policies to manage. Just create an index with
index.mode: lookup
and load your data - up to 2 billion documents per lookup index.

Observability Use Cases & Examples with ES|QL

Let’s now look at a few examples to see how Lookup Joins can help.

Enriching Error Logs with Deployment Context

Lets say you're seeing a spike in errors for your

checkout-service

. You have logs flowing into a data stream, but they only contain the service name. The documents don’t have any information about the deployment activity itself.

FROM logs-*
  | WHERE log.level == "error"
  | WHERE service.name == "opbeans-ruby"

You need to know if a recent deployment is contributing to these errors. To do this, we can maintain a

deployments_info_lkp

index (set with

index.mode: lookup

) that maps service names to their deployment times. This index could be updated from our CI/CD pipeline automatically any time a deployment happens.

PUT /deployments_info_lkp
{
  "settings": {
    "index.mode": "lookup"
  },
  "mappings": {
    "properties": {
      "service": {
        "properties": {
          "name": {
            "type": "keyword"
          },
          "deployment_time": {
            "type": "date"
          },
          "version": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
# Bulk index the deployment documents
POST /_bulk
{ "index" : { "_index" : "deployments_info_lkp" } }
{ "service.name": "opbeans-ruby", "service.version": "1.0", "deployment_time": "2025-05-22T06:00:00Z" }
{ "index" : { "_index" : "deployments_info_lkp" } }
{ "service.name": "opbeans-go", "service.version": "1.1.0", "deployment_time": "2025-05-22T06:00:00Z" }

Using this information you can now write a query that joins these two sources.

ES|QL Query:

FROM logs-* 
  | WHERE log.level == "error"
  | WHERE service.name == "opbeans-ruby"
  | LOOKUP JOIN deployments_info_lkp ON service.name

This alone is a good step towards troubleshooting the problem. You now have the deployment_time column available for each of your error documents. The last remaining step now is to use this for further filtering.

Any of the data we managed to join from the lookup index can be handled as any other data we’d usually have available in the ES|QL query. This means that we can filter on it, and check if we had a recent deployment.

FROM logs-*
  | WHERE log.level == "error"
  | WHERE service.name == "opbeans-ruby"
  | LOOKUP JOIN deployments_info_lkp ON service.name 
  | KEEP message, service.name, service.version, deployment_time 
  | WHERE deployment_time > NOW() - 2h

Saving disk space using JOIN

Denormalizing data by including contextual information like host OS or cloud provider details directly in every log event is convenient for querying but can increase storage consumption, especially with high-volume data streams. Instead of storing this often-redundant information repeatedly, we can leverage joins to retrieve it on demand, potentially saving valuable disk space. While compression often handles repetitive data well, removing these fields entirely can still yield noticeable storage savings.

In this example we’ll use a dataset of 1,000,000 Kubernetes container logs using the default mapping of the Kubernetes integration, with logsdb index mode enabled. The starting size for this index is 35.5mb.

GET _cat/indices/k8s-logs-default?h=index,pri.store.size
### 
k8s-logs-default       35.5mb

Using the disk usage API, we observed that fields like host.os and cloud.* contribute roughly 5% to the total index size on disk (35.5mb). These fields can be useful in some cases, but information like the os.name is rarely queried.

// Example host.os structure
"os": {
  "codename": "Plow", "family": "redhat", "kernel": "6.6.56+",
  "name": "Red Hat Enterprise Linux", "platform": "rhel", "type": "linux", "version": "9.5 (Plow)"
}

// Example cloud structure
"cloud": {
  "account": { "id": "elastic-observability" },
  "availability_zone": "us-central1-c",
  "instance": { "id": "5799032384800802653", "name": "gke-edge-oblt-edge-oblt-pool-46262cd0-w905" },
  "machine": { "type": "e2-standard-4" },
  "project": { "id": "elastic-observability" },
  "provider": "gcp", "region": "us-central1", "service": { "name": "GCE" }
}

Instead of storing this information with every document, let's instead drop this information in an ingest pipeline.

PUT _ingest/pipeline/drop-host-os-cloud
{
  "processors": [
      { "remove": { "field": "host.os" } },
      { "set": { "field": "tmp1", "value": "{{cloud.instance.id}}" } }, // Temporarily store the ID

      { "remove": { "field": "cloud" } },                             // Remove the entire cloud object
      { "set": { "field": "cloud.instance.id", "value": "{{tmp1}}" } }, // Restore just the cloud instance ID
      { "remove": { "field": "tmp1", "ignore_missing": true } }         // Clean up temporary field
    ]
}

Reindexing (and force merging to one segment) now shows the following size, resulting in approximately 5% less space.

GET _cat/indices/k8s-logs-*?h=index,pri.store.size
### 
k8s-logs-default             33.7mb
k8s-logs-drop-cloud-os       35.5mb

Now, to regain access to the removed host.os and cloud.* information during analysis without storing it in every log document, we can create a lookup index. This index will store the full host and cloud metadata, keyed by the cloud.instance.id that we preserved in our logs. This instance_metadata_lkp index will be significantly smaller than the space saved across millions or billions of log lines, as it only needs one document per unique instance.

# Create the lookup index for instance metadata
PUT /instance_metadata_lkp
{
  "settings": {
    "index.mode": "lookup"
  },
  "mappings": {
    "properties": {

      "cloud.instance.id": {  # The join key we kept in the logs
        "type": "keyword"
      },
      "host.os": {           # The full host.os object we removed
        "type": "object",
        "enabled": false      # Often don't need to search sub-fields here
      },
      "cloud": {             # The full cloud object we removed (mostly)
         "type": "object",
         "enabled": false     # Often don't need to search sub-fields here
      }
    }
  }
}

# Bulk index sample instance metadata (keyed by cloud.instance.id)
# This data might come from your cloud provider API or CMDB
POST /_bulk
{ "index" : { "_index" : "instance_metadata_lkp", "_id": "5799032384800802653" } }
{ "cloud.instance.id": "5799032384800802653", "host.os": { "codename": "Plow", "family": "redhat", "kernel": "6.6.56+", "name": "Red Hat Enterprise Linux", "platform": "rhel", "type": "linux", "version": "9.5 (Plow)" }, "cloud": { "account": { "id": "elastic-observability" }, "availability_zone": "us-central1-c", "instance": { "id": "5799032384800802653", "name": "gke-edge-oblt-edge-oblt-pool-46262cd0-w905" }, "machine": { "type": "e2-standard-4" }, "project": { "id": "elastic-observability" }, "provider": "gcp", "region": "us-central1", "service": { "name": "GCE" } } }

With this setup, when you need the full host or cloud context for your logs, you can simply use LOOKUP JOIN in your ES|QL query and continue filtering on the data from the lookup index

FROM logs-* 
  | LOOKUP JOIN instance_metadata_lkp ON cloud.instance.id 
  | WHERE cloud.region == "us-central1"

This approach allows us to query the full context when needed (e.g., filtering logs by host.os.name or cloud.region) while significantly reducing the storage footprint of the high-volume log indices by avoiding redundant data denormalization.

It should be noted that low cardinality metadata fields generally compress well and a large part of the storage savings in this case come from the “text” mapping of the host.os.name and cloud.instance.name field. Make sure to use the disk usage API to evaluate if this approach would be worth it in your specific use case.

Getting Started with Lookups for Observability

Creating the necessary lookup indices is straightforward. As detailed in our initial blog post, you can use Kibana's Index Management UI, the Create Index API, or the File Upload utility – the key is setting

"index.mode": "lookup"

in the index settings.

For Observability, consider automating the population of these lookup indices:

Export data periodically from your CMDB, CRM, or HR systems.
Have your CI/CD pipeline update the
deployments_lkp
index upon successful deployment.
Use tools like Logstash with an
elasticsearch
output configured to write to your lookup index.

A Note on Performance and Alternatives

While incredibly powerful, joins aren't free. Each

LOOKUP JOIN

adds processing overhead to your query. For contextual data that is very static (e.g., the cloud region a host permanently resides in) and needed in almost every query against that data, the traditional approach of enriching at ingest time might still be slightly more performant for those specific queries, trading upfront processing and storage for query speed.

However, for the dynamic, flexible, and targeted enrichment scenarios common in Observability – like mapping to ever-changing deployments, user segments, or team structures –

LOOKUP JOIN

offers a compelling, efficient, and easier-to-manage solution.

Conclusion

ES|QL's

LOOKUP JOIN

is making it easy to correlate and enrich your logs, metrics, and traces with up-to-date external information at query time; you can move faster from detecting problems to understanding their scope, impact, and root cause.

This feature is currently in Technical Preview in Elasticsearch 8.18 and Serverless, available now on Elastic Cloud. We encourage you to try it out with your own Observability data and share your feedback using the "Submit feedback" button in the ES|QL editor in Discover. We're excited to see how you use it to connect the dots in your systems!

Connecting the Dots: ES|QL Joins for Richer Observability Insights

Connecting the Dots: ES|QL Joins for Richer Observability Insights

Why Lookup Joins for Observability

Observability Use Cases & Examples with ES|QL

Enriching Error Logs with Deployment Context

Saving disk space using JOIN

Getting Started with Lookups for Observability

A Note on Performance and Alternatives

Conclusion

Jump to section

Share this article