End to end LLM observability with Elastic: seeing into the opaque world of generative AI applications

In the ever-evolving landscape of artificial intelligence, Large Language Models (LLMs) stand as beacons of innovation, offering unprecedented capabilities across industries. From generating human-like text and translating languages to providing personalized customer interactions, the possibilities with LLMs are vast and increasingly indispensable. Enterprises are deploying these models for everything, from automating customer support systems to enhancing creative writing processes. Imagine a virtual assistant not only answering questions but also drafting business proposals or a customer service bot that understands and responds with empathy—all powered by LLMs. However, with great power comes the need for great oversight.

Despite the transformative potential, LLMs introduce complex challenges that necessitate a new level of observability as LLMs are notoriously opaque. Enter LLM observability: a crucial component in the lifecycle management of LLMs. This aspect becomes vital for Service Reliability Engineers (SREs) and other key stakeholders tasked with ensuring seamless, error-free operations, cost control, and minimizing the risks associated with the unpredictable nature of LLM generated responses. SREs need insights into performance metrics, error frequencies, latency issues, the cost implications of running these sophisticated models, and the prompt and response exchange with the model. Traditional monitoring tools fall short in this high-stakes environment; what’s needed is a nuanced approach to address the unique observability demands that LLMs introduce.

Elastic's LLM Observability Capabilities Address These Challenges

With Elastic’s end-to-end LLM observability you can cover a wide range of use cases. To achieve this, you can onboard two types of integrations - API-based logs and metrics and via APM instrumentation. Depending on your use case, you can also choose to use of the LLM integrations.

High level overview: via API-based logs and metrics. Monitoring LLM services from providers by ingesting a curated set of service metrics and logs like latency, invocation frequency, tokens, errors, and prompts and responses. Each LLM integration comes with out-of-the-box dashboards.
Troubleshooting applications: via APM instrumentation. Fully OTel-native tracing and auto-instrumentation for LLM-based applications through Elastic Distributions of OpenTelemetry (EDOT). Additionally, you can use third party libraries (Langtrace, OpenLit, OpenLLMetry) together with Elastic to extend the coverage to additional LLM-related technologies.

High level overview: LLM Observability for Leading Providers

Elastic offers tailored API-based integrations for four major LLM hosting providers:

Azure OpenAI
OpenAI
Amazon Bedrock
Google Vertex AI

These integrations bring a curated set of logs and metrics collection tailored to each provider. What this means for SREs is straightforward access to pre-configured dashboards that highlight the prompts and responses, usage patterns, performance metrics, and cost details across different models and providers.

For instance, SREs keen on identifying which LLM generates the most errors or insights about the models in terms of latency, cost, or usage frequency can leverage these integrations. Imagine having the capability to instantly visualize which LLM is slowing down processes or incurring high costs, thus enabling data-driven decisions to optimize operations.

Troubleshooting applications: Tracing and Auto-Instrumentation of OpenAI, Amazon Bedrock and Google Vertex AI models

Elastic supports OTLP tracing capabilities in EDOT for applications using OpenAI models and models hosted on Amazon Bedrock and Google Vertex AI. In addition, Elastic also supports LLM tracing from third party libraries (Langtrace, OpenLIT, OpenLLMetry).

Tracing offers a comprehensive map of an application's request flow, pinpointing granular details about each call within the system. For each transaction and span of a request, tracing shows critical information such as specific models utilized, request duration, errors encountered, tokens used per request, and the prompts and responses between the LLM.

Tracing helps SREs troubleshoot performance issues with applications developed in languages like Python, Node.js and Java." If an SRE needs to investigate latency or error issues, LLM tracing provides a zoomed-in view into the request lifecycle and allows for profound insights into whether a delay is application-specific, model-specific or systemic across deployments.

Use Cases: Bringing Elastic's Observability Features to Life

Let’s explore some practical scenarios where Elastic’s observability tools shine:

1. Understanding LLM Performance and Reliability

An SRE team looking to optimize a customer support system powered by Azure OpenAI can utilize Elastic’s Azure OpenAI integration to quickly ascertain which model variants incur higher latency or error rates. This enhances decision-making regarding model deployment or even switching providers based on performance metrics.

Similarly SREs can also use in parallel integrations for Google Vertex AI, Amazon Bedrock, and OpenAI for other applications using models hosted on these providers.

2. Troubleshooting OpenAI-Powered Applications

Consider an enterprise utilizing an OpenAI model for real-time user interactions. Encountering unexplained delays, an SRE can use OpenAI tracing to dissect the transaction pathway, identifying if one specific API call or model invocation is the bottleneck. The SRE can also check the out-of-the-box OpenAI integration dashboard to verify if the latency is only affecting this application or all model invocations across the organization.

An engineer troubleshooting the LLM-based application can also check to see what were the prompt and response exchanges with the LLM during this request so they can rule out possible impact on performance due to the input.

3. Addressing Cost and Usage Concerns

SREs are generally acutely aware of which LLM configurations are less cost-effective than required. Elastic’s integration dashboards, pre-configured to display model usage patterns, help mitigate unnecessary spending effectively. You can find out-of-the box dashboards for Azure OpenAI, OpenAI, Amazon Bedrock, and Google VertexAI models. These dashboards show key cost and usage information such as total invocations and tokens, as well as time series breakdown by model and endpoint. In addition, some integrations show more advanced usage information such as provisioned throughput units (PTU) as well as billing cost.

4. Understanding LLM Compliance

With the Elastic Amazon Bedrock integration for Guardrails, and Azure OpenAI integration for content filtering, SREs can swiftly address security concerns, like verifying if certain user interactions prompt policy violations. Elastic's observability logs clarify whether guardrails rightly blocked potentially harmful responses, bolstering compliance assurance.

Conclusion

As LLMs continue to revolutionize the capabilities of modern applications, the role of observability becomes increasingly paramount. Elastic’s comprehensive observability framework empowers enterprises to harness the full potential of LLMs while maintaining robust operational insight and control. The integration with prominent LLM hosting providers and advanced tracing for OpenAI, Amazon Bedrock and Google Vertex AI models, equips SREs with the necessary arsenal to navigate the complex landscape of LLM-driven applications, ensuring they remain safe, reliable, efficient, and cost-effective.

In this new era of AI, balancing innovation with observability isn't just beneficial—it's essential. Whether optimizing performance, troubleshooting intricacies, or managing costs and compliance, Elastic stands at the forefront, ensuring your LLM journey is as seamless as it is groundbreaking.