AI-driven incident response with logs: A technical deep dive in Elastic Observability

Modern customer‑facing applications, whether e‑commerce sites, streaming platforms, or API gateways, run on fleets of microservices and cloud resources. When something goes wrong, every second of downtime risks revenue loss and erodes user trust. Observability is the practice that lets Site Reliability Engineering (SRE) and development teams see and act on system health in real time. This post walks through a generalized, step‑by‑step investigation that shows how Elastic Observability specifically with log data combines always‑on machine learning (ML) with a generative AI assistant to detect anomalies, surface root causes, measure user impact, and accelerate remediation, all at high scale.

Anomaly Detection

A production environment is ingesting millions of log lines per minute. Elastic’s AIOps jobs continuously profile normal log throughput and content without any manual rules. When log volume or message structure deviates beyond learned baselines, the platform automatically fires a high‑fidelity anomaly alert. Because the models are unsupervised, they adapt to changing traffic patterns and flag both sudden spikes (e.g., 10× error surge) and rare new log categories.

In addition to looking directly for Log Spikes, Elastic trains seasonal/univariant models to predict expected event counts per bucket and applies statistical tests to classify outliers. Simultaneously, log categorization clusters similar messages with cosine similarity on token embeddings, making it trivial to identify a previously unseen error string.

Investigating Alerts: Automated Pattern Analysis

Clicking the alert reveals more than a timestamp. Elastic’s ML job already correlates the spike with the dominant new log pattern ERROR 1114 (HY000): table "orders" is full and surfaces example lines. Instead of grep‑driven hunting, engineers get an immediate hypothesis about what subsystem is failing and why.

If deeper context is needed, the builtin Elastic AI Assistant can be invoked directly from the alert. Thanks to Retrieval‑Augmented Generation (RAG) over your telemetry, the assistant explains the anomaly in plain language, references the exact log events, and proposes next steps without hallucinating.

AI‑Assisted Root Cause Verification

From within the same chat, you might ask, “Using lens create a single graph of all http response status codes =400 from logs-nginx.access-default over the last 3 hours..” The assistant translates that intent into an ES|QL aggregation, retrieves the data, and renders a bar chart with no DSL knowledge required. If there are a number of errors with a status code above 400, you’ve validated that end‑users are impacted.

Global Impact Analysis with Enriched Logs

Structured log enrichment (e.g., GeoIP, user ID, service tags) lets the assistant answer business questions on the fly. A query like “What are the top 10 source.geo.country_name with http.response.status.code>=400 over the last 3 hours. Use logs-nginx.access-default. Provide counts for each country name.” surfaces whether the incident is regional or global.

Quantifying Business Impact

Technical metrics alone rarely sway executives. Suppose historical data shows the application normally processes $1,000 in transactions per minute. The assistant can combine that baseline with real‑time failure counts to estimate revenue loss. Presenting financial impact alongside error graphs sharpens prioritization and justifies extraordinary remediation steps.

Pinpointing Infrastructure & Ownership

Every log is automatically enriched with Kubernetes, cloud, and custom metadata. A single question “Which pod and cluster emit the ‘table full’ error, and who owns it?” returns the full information about the pod, namespace and owner as shown below.

Immediate, accurate routing replaces frantic Slack threads, cutting minutes (or hours) off of downtime.

Some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example this simple entry in the knowledge base is what allows the assistant to populate the response in the previous screenshot.

If asked about Kubernetes pod, namespace, cluster, location, or owner run the "query" tool.
1. Use the index `logs-mysql.error-default` unless another log location is specified.
2. Include the following fields in the query:
   - Pod: `agent.name`
   - Namespace: `data\_stream.namespace`
   - Cluster Name: `orchestrator.cluster.name`
   - Cloud Provider: `cloud.provider`
   - Region: `cloud.region`
   - Availability Zone: `cloud.availability\_zone`
   - Owner: `cloud.account.id`
3. Use the ES|QL query format:
   esql
   FROM logs-mysql.error-default
   | KEEP agent.name, data\_stream.namespace, orchestrator.cluster.name, cloud.provider, cloud.region, cloud.availability\_zone, cloud.account.id
   
4. Ensure the query is executed within the appropriate time range and context.

Leveraging Institutional Knowledge with RAG

Elastic can index runbooks, GitHub issues, and wikis alongside telemetry. Asking “Find documentation on fixing a full orders table” retrieves and summarizes a prior runbook that details archiving old rows and adding a partition. Grounding remediation in proven procedures avoids guesswork and accelerates fixes.

Automated Communication & Documentation

Good incident response includes timely stakeholder updates. A prompt such as “Draft an incident update email with root cause, impact, and next steps” lets the assistant assemble a structured message and send it via the alerting framework’s email or Slack connector complete with dashboard links and next‑update timelines. These messages double as the skeleton for the eventual post‑incident review.

Again as before, some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example we can instruct the AI Assistant how to call the execute_connector api, this can execute all kinds of connectors (not only email) so you could use it to tell the assistant to use slack or raise a service now ticket, even execute webhooks.

Here are specific instructions to send an email. Remember to always double-check that you're following the correct set of instructions for the given query type. Provide clear, concise, and accurate information in your response.

## Email Instructions

If the user's query requires sending an email:
1. Use the `Elastic-Cloud-SMTP` connector with ID `elastic-cloud-email`.
2. Prepare the email parameters:
   - Recipient email address(es) in the `to` field (array of strings)
   - Subject in the `subject` field (string)
   - Email body in the `message` field (string)
3. Include
   - Details for the alert along with a link to the alert
   - Root cause analysis
   - Revenue impact
   - Remediation recommendations
   - Link to GitHub issue
   - All relevant information from this conversation
   - Link to the Business Health Dashboard
4. Send the email immediately. Do not ask the user for confirmation.
5. Execute the connector using this format:
   
   execute_connector(
     id="elastic-cloud-email",
     params={
       "to": ["recipient@example.com"],
       "subject": "Your Email Subject",
       "message": "Your email content here."
     }
   )
   
6. Check the response and confirm if the email was sent successfully.

Conclusion & Key Takeaways

Elastic Observability's combination of unsupervised ML, schema-aware data ingestion, and a context-rich RAG powered AI assistant enables teams to transform incident response from reactive firefighting into proactive, data-driven operations. By automatically detecting anomalies, correlating patterns, and providing contextual insights, teams can:

Preserve revenue by quantifying business impact in real-time and prioritizing accordingly
Scale expertise by embedding institutional knowledge into RAG-powered recommendations
Improve continuously through automated documentation that feeds back into the knowledge base

The key is to collect logs broadly, maintain a unified observability store, and let ML and AI handle the heavy lifting. The payoff isn't just reduced downtime, it's the transformation of incident response from a source of organizational stress into a competitive advantage.

Try out this exact scenario and get hands in with this Elastic Logging Workshop: https://play.instruqt.com/elastic/invite/rx4yvknhpfci

AI- driven incident response with logs: A technical deep dive in Elastic Observability