LLM Observability for Google Cloud’s Vertex AI platform - understand performance, cost and reliability

As organizations increasingly adopt large language models (LLMs) for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like Google Cloud’s Vertex AI.

New Elastic Observability LLM integration with Google Cloud’s Vertex AI platform

We are thrilled to announce general availability of monitoring LLMs hosted in Google Cloud through the Elastic integration with Vertex AI. This integration enables users to experience enhanced LLM Observability by providing deep insights into the usage, cost and operational performance of models on Vertex AI, including latency, errors, token usage, frequency of model invocations as well as resources utilized by models. By leveraging this data, organizations can optimize resource usage, identify and resolve performance bottlenecks, and enhance the model efficiency and accuracy.

Observability needs for AI-powered applications using the Vertex AI platform

Leveraging AI models creates unique needs around the observability and monitoring of AI-powered applications. Some of the challenges that come with using LLMs are related to the high cost to call the LLMs, the quality and safety of LLM responses, and the performance, reliability and availability of the LLMs.

Lack of visibility into LLM observability data can make it harder for SREs and DevOps teams to ensure their AI-powered applications meet their service level objectives for reliability, performance, cost and quality of the AI-generated content and have enough telemetry data to troubleshoot related issues. Thus, robust LLM observability and detection of anomalies in the performance of models hosted on Google Cloud’s Vertex AI platform in real time is critical for the success of AI-powered applications.

Depending on the needs of their LLM applications, customers can make use of a growing list of models hosted on the Vertex AI platform such as Gemini 2.0 Pro, Gemini 2.0 Flash, and Imagen for image generation. Each model excels in specific areas and generates content in some modalities including Language, Audio, Vision, Code, etc. No two models are the same; each model has specific performance characteristics. So, it is important that service operators are able to track the individual performance, behaviour and cost of each model.

Unlocking Insights with Vertex AI Metrics

The Elastic integration with Google Cloud’s Vertex AI platform collects a wide range of metrics from models hosted on Vertex AI, enabling users to monitor, analyze, and optimize their AI deployments effectively.

Once you use the integration, you can review all the metrics in the Vertex AI dashboard

These metrics can be categorized into the following groups:

1. Prediction Metrics

Prediction metrics provide critical insights into model usage, performance bottlenecks, and reliability. These metrics help ensure smooth operations, optimize response times, and maintain robust, accurate predictions.

Prediction Count by Endpoint: Measures the total number of predictions across different endpoints.
Prediction Latency: Provides insights into the time taken to generate predictions, allowing users to identify bottlenecks in performance.
Prediction Errors: Monitors the count of failed predictions across endpoints.

2. Model Performance Metrics

Model performance metrics provide crucial insights into deployment efficiency, and responsiveness. These metrics help optimize model performance and ensure reliable operations.

Model Usage: Tracks the usage distribution among different model deployments.
Token Usage: Tracks the number of tokens consumed by each model deployment, which is critical for understanding model efficiency.

Invocation Rates: Tracks the frequency of invocations made by each model deployment.
Model Invocation Latency: Measures the time taken to invoke a model, helping in diagnosing performance issues.

3. Resource Utilization Metrics

Resource utilization metrics are vital for monitoring resource efficiency and workload performance. They help optimize infrastructure, prevent bottlenecks, and ensure smooth operation of AI deployments.

CPU Utilization: Monitors CPU usage to ensure optimal resource allocation for AI workloads.
Memory Usage: Tracks the memory consumed across all model deployments.
Network Usage: Measures bytes sent and received, providing insights into data transfer during model interactions.

4. Overview Metrics

These metrics give an overview of the models deployed in Google Cloud’s Vertex AI platform. They are essential for tracking overall performance, optimizing efficiency, and identifying potential issues across deployments.

Total Invocations: The overall count of prediction invocations across all models and endpoints, providing a comprehensive view of activity.
Total Tokens: The total number of tokens processed across all model interactions, offering insights into resource utilization and efficiency.
Total Errors: The total count of errors encountered across all models and endpoints, helping identify reliability issues.

All metrics can be filtered by region, offering localized insights for better analysis.

Note: The Elastic I integration with Vertex AI provides comprehensive visibility into both deployment models: provisioned throughput, where capacity is pre-allocated, and pay-as-you-go, where resources are consumed on demand.

Conclusion

This integration with Vertex AI represents a significant step forward in enhancing the LLM Observability for users of Google Cloud’s Vertex AI platform. By unlocking a wealth of actionable data, organizations can assess the health, performance and cost of LLMs and troubleshoot operational issues, ensuring scalability, and accuracy in AI-driven applications.

Now that you know how the Vertex AI integration enhances LLM Observability, it’s your turn to try it out n. Spin up an Elastic Cloud, and start monitoring your LLM applications hosted on Google Cloud’s Vertex AI platform.