# LLM Observability in Production: From GPU Metrics to Response Quality

Deploying an LLM to SageMaker is the easy part. The hard part is knowing, in real time, whether it is answering well, using GPU efficiently, and costing what you planned. This article details the observability stack I would build today for financial-grade LLM inference.

- URL: https://fernando.moretes.com/blog/llm-observabilidade-qualidade-custo-sagemaker-bedrock

- Markdown: https://fernando.moretes.com/blog/llm-observabilidade-qualidade-custo-sagemaker-bedrock/article.md?lang=en

- Published: 2026-06-10T13:08:00.000Z

- Category: Data Platforms

- Tags: llmops, observability, sagemaker, bedrock, grafana, opentelemetry, mlops, financial-grade

- Reading time: 9 min

- Source: [Comprehensive observability for SageMaker AI LLM inference](https://aws.amazon.com/blogs/machine-learning/)

---

When an LLM enters production in a financial environment — whether for regulatory report generation, document triage, or analyst assistance — the question the platform team must answer is not 'is the endpoint up?' but rather: 'is the generated response trustworthy, within the latency SLA, is the cost per token within budget, and is the model not hallucinating about client data?' These four dimensions — availability, latency, cost, and semantic quality — demand layered instrumentation that goes well beyond basic CloudWatch. This article is a set of field notes on how I build that stack today.

## The Real Problem: Three Layers of Operational Blindness

Most teams deploying LLMs on SageMaker start with the same metric set they'd use for any ML endpoint: `Invocations`, `ModelLatency`, `OverheadLatency`, and `Invocation4XXErrors`. That is necessary but wholly insufficient for language inference.

The **first layer of blindness** is GPU infrastructure. An `ml.g5.12xlarge` endpoint with 4xA10G can show `GPUUtilization` at 40% while `GPUMemoryUtilization` is at 95% — the model is thrashing VRAM, causing P99 latency of 8s while P50 looks healthy at 1.2s. Without separating those two metrics per instance and per loaded model, you'll scale horizontally when the problem is vertical (batch size, quantization, tensor parallelism).

The **second layer** is token semantics. `ModelLatency` measures request to response, but it does not distinguish Time to First Token (TTFT) from Inter-Token Latency (ITL). For a user interacting with a financial chat, a TTFT of 3s is unacceptable even if total throughput is high. SageMaker's LMI (Large Model Inference) container exposes `TimeToFirstByte` via CloudWatch, but only if you configure `OPTION_OUTPUT_FORMATTER=jsonlines` and instrument the client to record the first-chunk timestamp.

The **third layer** — and the most neglected — is semantic quality. The model can be responding fast, with efficient GPU usage, and still produce responses with hallucinations, excessive refusals, or tone drift that violates compliance policies. This is only detectable with asynchronous post-generation evaluation.

## LLM Observability Pipeline: From GPU to Semantic Quality

Observability data flow for LLM inference on SageMaker, covering infrastructure, token, and semantic quality metrics

### 🟧 AWS — Inference Layer

- API Gateway WAF + Auth (security)
- SageMaker Endpoint ml.g5.12xlarge LMI (ai)
- GPU Metrics Util / VRAM / TTFT (compute)

### 🟦 AWS — Telemetry Pipeline

- CloudWatch Metrics + Logs Insights (data)
- Kinesis Firehose log streaming (messaging)
- S3 raw inference logs (storage)
- OTEL Collector sidecar / Lambda (compute)

### 🟩 Quality Evaluation Layer

- Lambda Evaluator async quality scorer (compute)
- Bedrock Claude LLM-as-judge (ai)
- DynamoDB quality scores + trace (storage)

### 📊 Visualization & Alerting

- Grafana Unified Dashboard (external)
- SNS + PagerDuty SLO breach alerts (messaging)

### Flows

- client -> apigw: HTTPS request
- apigw -> sm_ep: invoke endpoint
- sm_ep -> gpu_metrics: native metrics
- gpu_metrics -> cw: CloudWatch agent
- sm_ep -> otel: TTFT / ITL spans
- otel -> cw: custom metrics
- sm_ep -> firehose: inference log
- firehose -> s3_logs: raw logs
- s3_logs -> eval_lambda: S3 event trigger
- eval_lambda -> bedrock_judge: prompt + response
- bedrock_judge -> ddb: semantic score
- ddb -> grafana: quality metrics
- cw -> grafana: infra + token metrics
- grafana -> sns: SLO breach

## Instrumenting SageMaker LMI for Token-Level Metrics

SageMaker's LMI container (based on DJL Serving) exposes token metrics via the internal server's `/metrics` endpoint, but they do not flow to CloudWatch automatically. The approach that works in production is an **OTEL Collector sidecar** running in the same task definition (for SageMaker Multi-Container) or as a Lambda invoked asynchronously by the endpoint via invocation destination.

The critical metrics to capture are:
- `ttft_ms` (Time to First Token): P50/P95/P99 per model and per `request_type`
- `itl_ms` (Inter-Token Latency): distribution — high variance here indicates GPU contention
- `tokens_per_second`: real generation throughput, not to be confused with invocations per second
- `prompt_tokens` and `completion_tokens`: essential for cost and for detecting prompt injection that inflates context
- `cache_hit_rate`: if you use KV-cache with prefix caching (available in TGI and vLLM backends of LMI)

To publish these metrics to CloudWatch with correct dimensions, use `put_metric_data` with namespace `LLMInference/SageMaker` and dimensions `ModelName`, `EndpointName`, and `InstanceType`. The cost of custom CloudWatch metrics is $0.30/metric/month — with 6 metrics × 3 dimensions × 2 endpoints, that's ~$10/month, completely justifiable.

A critical detail: set `SAGEMAKER_CONTAINER_LOG_LEVEL=20` (INFO) on the endpoint and enable `DataCaptureConfig` with `SamplingPercentage=100` for staging and `SamplingPercentage=10` for high-volume production. Capturing 100% in production with large models can generate significant S3 and Firehose costs.

## Playbook: Building the LLM Observability Stack in 7 Steps

1. **1. GPU infrastructure baseline** — Enable CloudWatch Container Insights for SageMaker. Set alerts on `GPUMemoryUtilization > 85%` (not GPUUtilization) as the primary memory pressure signal. Create a dashboard separating P50 and P99 of ModelLatency — if the gap is greater than 5x, there is contention.

2. **2. Instrument TTFT on the client side** — Use streaming response (`stream=True` in boto3 `invoke_endpoint_with_response_stream`). Record `time.time()` before the invoke and on receiving the first chunk. Publish as custom metric `TTFT_ms` with dimension `model_id`. Recommended SLA for financial chat: P95 < 1500ms.

3. **3. Inference log pipeline to S3** — Configure `DataCaptureConfig` on the endpoint with `S3OutputPath` pointing to a bucket with SSE-KMS (customer-managed key). Enable lifecycle policy: 90 days in S3 Standard, then Glacier Instant Retrieval. This satisfies 7-year audit requirements without linear cost.

4. **4. Asynchronous quality evaluator** — Create a Lambda with S3 trigger (DataCapture prefix). The Lambda reads the prompt/response pair, calls Bedrock Claude 3 Haiku (cheaper, sufficient for scoring) with an evaluation rubric (factuality, undue refusal, tone). Store the score in DynamoDB with 180-day TTL. Estimated cost: $0.25 per 1000 evaluations with Haiku.

5. **5. Unified dashboard in Grafana** — Use Amazon Managed Grafana with CloudWatch data sources (infra and token metrics) and DynamoDB (via Lambda datasource or AWS Data API). Organize in three rows: Infrastructure (GPU, latency), Token Economics (tokens/s, cost/request, cache hit rate), and Quality (factuality scores, refusal rate, weekly drift).

6. **6. SLOs and SLO burn rate alerts** — Define SLOs in CloudWatch with `ServiceLevelObjective`: TTFT P95 < 1500ms, availability > 99.5%, quality score > 0.75 (0-1 scale). Configure burn rate alerts: 14x in 1h (critical) and 6x in 6h (warning). Route via SNS → PagerDuty with runbook link in the alert.

7. **7. Cost traceability per use case** — Add a `use_case` tag in each invocation header (via API Gateway context) and propagate to DataCapture logs. Use Cost Explorer with `use_case` tag to separate GPU instance cost per product. This is mandatory in financial environments where each business line needs chargeback.

> **KV-Cache and Prefix Caching: The Most Underestimated Efficiency Multiplier:** If you have a fixed or semi-fixed system prompt (e.g., compliance instructions, product context), enable prefix caching in the LMI vLLM backend. In tests with 512-token fixed prefix prompts, TTFT reduction reaches 60-70% and GPU cost drops proportionally. Monitor `cache_hit_rate` as a primary efficiency KPI — if it is below 40% with repetitive prompts, revisit your context construction strategy.

## LLM-as-Judge at Scale: Pitfalls and Calibration

The pattern of using an LLM to evaluate another LLM (LLM-as-judge) is powerful, but it has specific failure modes I have seen cause quality false positives in financial production.

The first problem is **position bias**: models like Claude tend to rate longer responses higher regardless of accuracy. To mitigate, use structured rubrics with per-dimension scoring (factuality: 1-5, completeness: 1-5, tone: 1-5) rather than a single score. Include few-shot examples in the judge prompt with cases of short, correct responses that should score high.

The second problem is **evaluation cost at scale**. Evaluating 100% of inferences with Claude 3 Sonnet is prohibitive — for an endpoint processing 10,000 requests/day with an average 300-token response, the cost would be ~$180/day in evaluation alone. The correct strategy is **stratified sampling**: 100% of samples with low confidence scores (detected by logit entropy, if available), 10% random of the remainder, and 100% of any request that triggered a guardrail.

The third problem is **judge drift**: the judge model can also change with Bedrock version updates. Use explicit model versioning (`anthropic.claude-3-haiku-20240307-v1:0`) and never `claude-3-haiku` without a version in production. Maintain a golden dataset of 200 prompt/response pairs with human scores to recalibrate the judge monthly — this is equivalent to regression testing in a classic ML system.

For environments with data sovereignty requirements (LGPD, Brazilian banking regulation), the evaluation pipeline must ensure that client data does not leave the VPC boundary. Use Bedrock via VPC endpoint (`com.amazonaws.region.bedrock-runtime`) and never route inference logs over public networks.

## Reference Benchmarks for LLM Inference Observability

- **60-70%** — TTFT reduction with prefix caching enabled. For prompts with fixed prefix ≥ 256 tokens on LMI vLLM backend
- **~$0.25** — Cost per 1000 LLM-as-judge evaluations with Claude 3 Haiku. Assuming 300 prompt tokens + 150 judge response tokens
- **5x** — P99/P50 latency ratio as contention alert threshold. Gap greater than 5x indicates GPU thrashing or request queuing

## Security and Governance in the Observability Pipeline

The LLM observability pipeline is, paradoxically, an attack surface. DataCapture logs contain full prompts and responses — in a financial context, this can include client data, account numbers, portfolio analyses. Treating this pipeline with less rigor than the endpoint itself is a serious mistake.

The control layers I implement as mandatory:

**Encryption in transit and at rest**: The DataCapture S3 bucket must have SSE-KMS with a CMK separate from the endpoint. The key policy must allow decrypt only for the evaluator Lambda role and the audit role — not for the endpoint role itself (separation of duties principle).

**PII masking before evaluation**: The evaluator Lambda must run the text through Amazon Comprehend `detect_pii_entities` before sending to the Bedrock judge. Replace PII entities with placeholders (`[ACCOUNT_NUMBER]`, `[CPF]`) in the evaluation payload. This ensures sensitive data does not enter the evaluator model's context.

**IAM with context conditions**: The evaluator Lambda role must have `bedrock:InvokeModel` with `aws:SourceVpc` condition restricted to the pipeline VPC. Add `aws:RequestedRegion` to ensure invocations only occur in the primary region — relevant for BACEN and LGPD compliance requiring processing on national territory.

**Access auditing for logs**: Configure CloudTrail with S3 data events on the DataCapture bucket. Any bucket access must generate an auditable event. In environments with SOC 2 or ISO 27001, this is a control requirement, not optional.

A detail that is frequently forgotten: the SageMaker endpoint itself must have `KmsKeyId` configured in the `ProductionVariant` to encrypt the instance EBS volume. Without this, model weights and KV-cache sit on unencrypted disk.

## Common Anti-Patterns in LLM Observability

- **Using only ModelLatency as a health proxy**: ModelLatency measures total inference time, not user experience. TTFT can be 3x above SLA while ModelLatency looks normal at P50.
- **Evaluating quality 100% in production with large models**: Using Claude 3 Sonnet to evaluate every response in high-volume production costs more than the main endpoint. Use Haiku with stratified sampling.
- **Not versioning the judge model**: Updating the evaluation model without a reference golden dataset invalidates historical quality series. Scores from different weeks become incomparable.
- **DataCapture without lifecycle policy**: Inference logs grow linearly. Without lifecycle to Glacier after 90 days, S3 cost can exceed GPU cost within 6 months.
- **Scaling horizontally before optimizing batch**: Adding instances when `GPUMemoryUtilization > 85%` without first testing `MaxConcurrentRequests` and `MaxBatchSize` in LMI. Often, doubling batch size resolves the problem without additional cost.
- **Inference logs without PII masking**: Sending raw prompts to the external evaluator without passing through Comprehend. In a financial context, this is an LGPD violation and potentially a banking regulation violation.

## Frequently Asked Questions

### Should I use SageMaker or Bedrock for LLM inference in financial production?

It depends on the control required. Bedrock offers less native observability (no GPU metric access, no granular TTFT) but simplifies compliance with approved models. SageMaker gives full control over the inference stack, quantization, KV-cache, and metrics — essential if you have aggressive latency SLAs or custom models. For foundation models without fine-tuning, Bedrock with Guardrails is the starting point. For fine-tuned models or P95 < 1s latency requirements, SageMaker with LMI.

### How to detect prompt injection via observability?

Monitor `prompt_tokens` per request. Prompt injection attempting to inject additional context usually causes anomalous spikes in prompt size. Set an alarm on `prompt_tokens > P99 + 3σ` as an anomaly signal. Combine with Bedrock Guardrails (if using Bedrock) or with a lightweight pre-processing classifier in the entry Lambda for SageMaker.

### What is the difference between LLM observability and traditional MLOps?

Traditional MLOps monitors feature drift and prediction accuracy — deterministic metrics. LLM observability adds three non-deterministic dimensions: semantic quality (which requires evaluation by another model), token latency (TTFT/ITL, not just total latency), and cost per token (which varies with context size). The observability data model is fundamentally different: you need to store prompt/response pairs, not just numerical features.

### How to calculate the real cost per request in SageMaker?

Cost per request = (instance cost/hour) / (requests/hour). For an `ml.g5.12xlarge` at ~$5.67/hour processing 3600 requests/hour (1 req/s), the cost is ~$0.0016/request. But this ignores real utilization. The correct metric is: cost per generated token = (cost/hour) / (tokens_per_second × 3600). Monitor `tokens_per_second` as an efficiency KPI and calculate cost/token in real time on the dashboard.

## Operationalizing Quality SLOs: What Nobody Tells You

Defining a quality SLO for LLMs is conceptually simple — 'quality score > 0.75 in 95% of responses' — but operationalizing this in CloudWatch with burn rate alerts requires a specific data architecture.

The problem is that quality scores are produced asynchronously (by the evaluator Lambda, seconds or minutes after the original response), while latency SLOs are synchronous. To unify in the same SLO framework, you need a **deferred evaluation window**: the quality SLO operates over a 1-hour window with a 5-minute delay (evaluation time), not in real time.

The practical implementation uses DynamoDB Streams: when the evaluator Lambda writes the score to DynamoDB, the stream triggers another Lambda that publishes the `QualityScore` metric to CloudWatch with a retroactive timestamp (using `put_metric_data` with an explicit `Timestamp`, not the current time). This allows the CloudWatch SLO to correctly calculate the burn rate over the historical window.

A critical aspect for financial environments: the quality SLO must have **differentiated thresholds by request type**. A question about account balance has a different factuality threshold than a question about investment strategy. Use the `request_type` dimension in CloudWatch to create separate SLOs per category — this is more work to configure but prevents low-risk responses from diluting the quality signal in high-risk requests.

Finally, document the quality SLO as an ADR (Architecture Decision Record) with the justification for the chosen threshold, the judge model used, the calibration golden dataset, and the quarterly review process. In compliance audits, the ability to demonstrate that you have a formal AI quality evaluation process is as important as the numbers themselves.

## Well-Architected Framework Lenses

- **security**: SSE-KMS with separate CMK for DataCapture, IAM with VPC and region conditions, PII masking via Comprehend before any external evaluation, CloudTrail with S3 data events on the log bucket.
- **reliability**: SLOs with burn rate alerts across three dimensions (availability, latency, quality), DataCapture with automatic retry, evaluator Lambda with DLQ to ensure no prompt/response pair is lost in evaluation.
- **performance**: Prefix caching for TTFT reduction, P99/P50 monitoring as contention signal, batch size optimization before horizontal scaling, cache_hit_rate as primary efficiency KPI.

> **Curator's Note:** In practice, the biggest gap I see in teams deploying LLMs in financial production is not a lack of metrics — it is a lack of clear ownership over what 'quality' means for that specific use case. Before building any evaluation pipeline, I force a session with the product owner and compliance team to define the evaluation rubric in plain language, with concrete examples of good and bad responses. Without that, you will build a technically impeccable evaluator that measures the wrong thing. The hardest lesson I have learned: an LLM with a quality score of 0.85 that answers the wrong questions confidently is more dangerous than one with a score of 0.65 that refuses when it does not know.

## Verdict: LLM Observability Is Not Optional in Financial Production

The stack described here — GPU + TTFT/ITL metrics via OTEL, DataCapture to S3 with SSE-KMS, asynchronous evaluation with sampled LLM-as-judge, three-dimension SLOs in CloudWatch, and unified dashboard in Grafana — is the minimum viable setup to operate an LLM in financial production responsibly. The incremental cost of this instrumentation is marginal compared to the GPU instance cost: estimated $50-150/month for a mid-size endpoint. The cost of not having it — an incorrect response about client data that goes unnoticed for weeks — is incalculable. Deploy the playbook in staging first, calibrate the judge model with a human golden dataset, and only then move to production. Do not skip the calibration step.

**Rating:** Essential for financial-grade LLM produc

## References

- [SageMaker LMI Container — DJL Serving Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html)
- [SageMaker Data Capture for Model Monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture.html)
- [Amazon CloudWatch — Service Level Objectives](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-ServiceLevelObjectives.html)
- [Amazon Bedrock Guardrails](https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html)
- [AWS Blog: Comprehensive observability for SageMaker AI LLM inference](https://aws.amazon.com/blogs/machine-learning/)
- [OpenTelemetry Collector — AWS CloudWatch Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awscloudwatchmetricsexporter)
- [LLM-as-Judge: Judging the Quality of LLM Outputs (Zheng et al., 2023)](https://arxiv.org/abs/2306.05685)
- [Amazon Managed Grafana — CloudWatch Data Source](https://docs.aws.amazon.com/grafana/latest/userguide/using-amazon-cloudwatch-in-AMG.html)
