Connecting the Dots: ES|QL Joins for Richer Observability Insights
You might have seen our recent announcement about the arrival of SQL-style joins in Elasticsearch with ES|QL's LOOKUP JOIN command (now in Tech Preview!). While that post covered the basics, let's take a closer look at this in the context of Observability. How can this new join capability specifically help engineers and SREs make sense of their logs, metrics, and traces and make Elasticsearch more storage efficient by not denormalizing as much data?
Note: Before we jump into the details, it’s important to mention again that this type of functionality today relies on a special lookup index. It is not (yet) possible to JOIN any arbitrary index.
Observability isn't just about collecting data; it's about understanding it. Often, the raw telemetry data – a log line, a metric point, a trace span – lacks the full context needed for quick diagnosis or impact assessment. We need to correlate data, enrich it with business or infrastructure context, and ask more advanced questions.
Historically, achieving this in Elasticsearch involved techniques like denormalizing data at ingest time (using ingest pipelines with enrich processors, for example) or performing joins client-side.
By adding the necessary context (like host details or user attributes) as data flowed in, each document arrived fully ready for queries and analytics without extra processing later on. This approach worked well in many cases and still does, particularly when the reference data changes slowly or when the enriched fields are critical for nearly every search.
However, as environments become more dynamic and diverse, the need to frequently update reference data (or avoid storing repetitive fields in every document) highlighted some of the trade-offs.
With the introduction of ES|QL LOOKUP JOIN in Elasticsearch 8.18 and 9.0, you now have an additional, more flexible option for situations where real-time lookups and minimal duplication are desired. Both methods—ingest-time enrichment and on-the-fly LOOKUP JOIN—complement each other and remain valid, depending on use case needs around update frequency, query performance, and storage considerations.
Why Lookup Joins for Observability
Lookup joins keep things flexible. You can decide on the fly if you’d like to look up additional information to assist you in your investigation.
Here are some examples:
-
Deployment Information: Which version of the code is generating these errors?
-
Infrastructure Mapping: Which Kubernetes cluster or cloud region is experiencing high latency? What hardware does it use?
-
Business Context: Are critical customers being affected by this slowdown?
-
Team Ownership: Which team owns the service throwing these exceptions?
Keeping this kind of information perfectly denormalized onto every single log line or metric point can be challenging and inefficient. Lookup datasets – like lists of deployments, server inventories, customer tiers, or service ownership mappings – often change independently of the telemetry data itself.
-
Lookup Indices are Writable: Update your deployment list, CMDB export, or on-call rotation in the lookup index, and your next ES|QL query immediately uses the fresh data. No need to re-run complex enrich policies or re-index data.
-
Flexibility: You decide at query time which context to join. Maybe today you care about deployment versions, tomorrow about cloud regions.
-
Simpler Setup: As the original post highlighted, there are no enrich policies to manage. Just create an index with
index.mode: lookupand load your data - up to 2 billion documents per lookup index.
Observability Use Cases & Examples with ES|QL
Let’s now look at a few examples to see how Lookup Joins can help.
Enriching Error Logs with Deployment Context
Lets say you're seeing a spike in errors for your
FROM logs-*
| WHERE log.level == "error"
| WHERE service.name == "opbeans-ruby"
You need to know if a recent deployment is contributing to these errors. To do this, we can maintain a
PUT /deployments_info_lkp
{
"settings": {
"index.mode": "lookup"
},
"mappings": {
"properties": {
"service": {
"properties": {
"name": {
"type": "keyword"
},
"deployment_time": {
"type": "date"
},
"version": {
"type": "keyword"
}
}
}
}
}
}
# Bulk index the deployment documents
POST /_bulk
{ "index" : { "_index" : "deployments_info_lkp" } }
{ "service.name": "opbeans-ruby", "service.version": "1.0", "deployment_time": "2025-05-22T06:00:00Z" }
{ "index" : { "_index" : "deployments_info_lkp" } }
{ "service.name": "opbeans-go", "service.version": "1.1.0", "deployment_time": "2025-05-22T06:00:00Z" }
Using this information you can now write a query that joins these two sources.
ES|QL Query:
FROM logs-*
| WHERE log.level == "error"
| WHERE service.name == "opbeans-ruby"
| LOOKUP JOIN deployments_info_lkp ON service.name
This alone is a good step towards troubleshooting the problem. You now have the deployment_time column available for each of your error documents. The last remaining step now is to use this for further filtering.
Any of the data we managed to join from the lookup index can be handled as any other data we’d usually have available in the ES|QL query. This means that we can filter on it, and check if we had a recent deployment.
FROM logs-*
| WHERE log.level == "error"
| WHERE service.name == "opbeans-ruby"
| LOOKUP JOIN deployments_info_lkp ON service.name
| KEEP message, service.name, service.version, deployment_time
| WHERE deployment_time > NOW() - 2h
Saving disk space using JOIN
Denormalizing data by including contextual information like host OS or cloud provider details directly in every log event is convenient for querying but can increase storage consumption, especially with high-volume data streams. Instead of storing this often-redundant information repeatedly, we can leverage joins to retrieve it on demand, potentially saving valuable disk space. While compression often handles repetitive data well, removing these fields entirely can still yield noticeable storage savings.
In this example we’ll use a dataset of 1,000,000 Kubernetes container logs using the default mapping of the Kubernetes integration, with logsdb index mode enabled. The starting size for this index is 35.5mb.
GET _cat/indices/k8s-logs-default?h=index,pri.store.size
###
k8s-logs-default 35.5mb
Using the disk usage API, we observed that fields like host.os and cloud.* contribute roughly 5% to the total index size on disk (35.5mb). These fields can be useful in some cases, but information like the os.name is rarely queried.
// Example host.os structure
"os": {
"codename": "Plow", "family": "redhat", "kernel": "6.6.56+",
"name": "Red Hat Enterprise Linux", "platform": "rhel", "type": "linux", "version": "9.5 (Plow)"
}
// Example cloud structure
"cloud": {
"account": { "id": "elastic-observability" },
"availability_zone": "us-central1-c",
"instance": { "id": "5799032384800802653", "name": "gke-edge-oblt-edge-oblt-pool-46262cd0-w905" },
"machine": { "type": "e2-standard-4" },
"project": { "id": "elastic-observability" },
"provider": "gcp", "region": "us-central1", "service": { "name": "GCE" }
}
Instead of storing this information with every document, let's instead drop this information in an ingest pipeline.
PUT _ingest/pipeline/drop-host-os-cloud
{
"processors": [
{ "remove": { "field": "host.os" } },
{ "set": { "field": "tmp1", "value": "{{cloud.instance.id}}" } }, // Temporarily store the ID
{ "remove": { "field": "cloud" } }, // Remove the entire cloud object
{ "set": { "field": "cloud.instance.id", "value": "{{tmp1}}" } }, // Restore just the cloud instance ID
{ "remove": { "field": "tmp1", "ignore_missing": true } } // Clean up temporary field
]
}
Reindexing (and force merging to one segment) now shows the following size, resulting in approximately 5% less space.
GET _cat/indices/k8s-logs-*?h=index,pri.store.size
###
k8s-logs-default 33.7mb
k8s-logs-drop-cloud-os 35.5mb
Now, to regain access to the removed host.os and cloud.* information during analysis without storing it in every log document, we can create a lookup index. This index will store the full host and cloud metadata, keyed by the cloud.instance.id that we preserved in our logs. This instance_metadata_lkp index will be significantly smaller than the space saved across millions or billions of log lines, as it only needs one document per unique instance.
# Create the lookup index for instance metadata
PUT /instance_metadata_lkp
{
"settings": {
"index.mode": "lookup"
},
"mappings": {
"properties": {
"cloud.instance.id": { # The join key we kept in the logs
"type": "keyword"
},
"host.os": { # The full host.os object we removed
"type": "object",
"enabled": false # Often don't need to search sub-fields here
},
"cloud": { # The full cloud object we removed (mostly)
"type": "object",
"enabled": false # Often don't need to search sub-fields here
}
}
}
}
# Bulk index sample instance metadata (keyed by cloud.instance.id)
# This data might come from your cloud provider API or CMDB
POST /_bulk
{ "index" : { "_index" : "instance_metadata_lkp", "_id": "5799032384800802653" } }
{ "cloud.instance.id": "5799032384800802653", "host.os": { "codename": "Plow", "family": "redhat", "kernel": "6.6.56+", "name": "Red Hat Enterprise Linux", "platform": "rhel", "type": "linux", "version": "9.5 (Plow)" }, "cloud": { "account": { "id": "elastic-observability" }, "availability_zone": "us-central1-c", "instance": { "id": "5799032384800802653", "name": "gke-edge-oblt-edge-oblt-pool-46262cd0-w905" }, "machine": { "type": "e2-standard-4" }, "project": { "id": "elastic-observability" }, "provider": "gcp", "region": "us-central1", "service": { "name": "GCE" } } }
With this setup, when you need the full host or cloud context for your logs, you can simply use LOOKUP JOIN in your ES|QL query and continue filtering on the data from the lookup index
FROM logs-*
| LOOKUP JOIN instance_metadata_lkp ON cloud.instance.id
| WHERE cloud.region == "us-central1"
This approach allows us to query the full context when needed (e.g., filtering logs by host.os.name or cloud.region) while significantly reducing the storage footprint of the high-volume log indices by avoiding redundant data denormalization.
It should be noted that low cardinality metadata fields generally compress well and a large part of the storage savings in this case come from the “text” mapping of the host.os.name and cloud.instance.name field. Make sure to use the disk usage API to evaluate if this approach would be worth it in your specific use case.
Getting Started with Lookups for Observability
Creating the necessary lookup indices is straightforward. As detailed in our initial blog post, you can use Kibana's Index Management UI, the Create Index API, or the File Upload utility – the key is setting
For Observability, consider automating the population of these lookup indices:
-
Export data periodically from your CMDB, CRM, or HR systems.
-
Have your CI/CD pipeline update the
deployments_lkpindex upon successful deployment. -
Use tools like Logstash with an
elasticsearchoutput configured to write to your lookup index.
A Note on Performance and Alternatives
While incredibly powerful, joins aren't free. Each
However, for the dynamic, flexible, and targeted enrichment scenarios common in Observability – like mapping to ever-changing deployments, user segments, or team structures –
Conclusion
ES|QL's
This feature is currently in Technical Preview in Elasticsearch 8.18 and Serverless, available now on Elastic Cloud. We encourage you to try it out with your own Observability data and share your feedback using the "Submit feedback" button in the ES|QL editor in Discover. We're excited to see how you use it to connect the dots in your systems!