Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Data PlatformsComparison

ML Observability on EKS: Logs, Metrics and Tracing Head-to-Head

May 31, 2026 9 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 18:38

Download MP3

0:0018:38

Speed

The MP3 is saved to S3 after the first play.

Data PlatformsComparison

96×

Time-series multiplier per GPU node

DCGM exposes ~40 metrics × 8 GPUs × 3 dimensions (node, GPU, process) on a p4d.24xlarge

21×

Monthly cost ratio: Enhanced vs manual OTel (20 GPU nodes)

$17,280 (Enhanced) vs ~$820 (OTel + CW EMF + Fluent Bit standalone)

<15 s

Alert latency with Datadog Live Tail + monitor threshold

Compared to 60–120 s with CloudWatch Metric Alarms at high cardinality

fernando.moretes.com

ML workloads on EKS generate telemetry volumes that expose the limits of any observability pipeline not designed for that profile. In this article I compare four collection and routing approaches for logs and metrics, focusing on real cost, diagnostic latency and fitness for regulated financial environments.

When a distributed training job on EKS starts diverging silently — gradients exploding, GPU workers idle, data throughput halved — the time between the first anomalous signal and a confirmed diagnosis is determined entirely by the quality of the observability pipeline you chose to build. In financial environments where every GPU-hour costs between $3 and $32 (p3.16xlarge to p4d.24xlarge), that diagnostic window is not an operational detail: it is a direct cost line. This article is an honest bake-off between four observability strategies for ML on EKS — native Fluent Bit, OpenTelemetry Collector (OTel), CloudWatch Container Insights with Enhanced Observability, and Datadog Agent with DogStatsD. Each carries a radically different cost, latency and operational complexity profile, and the wrong choice at scale costs more than the workload you are trying to monitor.

Why ML workloads on EKS are a special observability case

A stateless inference pod generates predictable telemetry: a few hundred log lines per minute, CPU/memory metrics, request latency. A distributed training job with PyTorch DDP or Ray Train across 32 GPU nodes is an entirely different category.

First, log volume is non-linear with worker count. Each rank emits epoch progress, checkpointing, gradient norm and NCCL collective logs. With 32 workers and 10-minute epochs, it is common to see 50–200 MB/min of unstructured stdout arriving at the collection DaemonSet — before any business-application log.

Second, GPU metrics have high cardinality. DCGM Exporter exposes ~40 metrics per GPU (SM utilization, memory bandwidth, NVLink throughput, tensor core activity, ECC errors). On a p4d.24xlarge node with 8 A100s that is 320 time series per node, scraped every 15 s. Across 20 nodes you have 6,400 active series — within CloudWatch's limit (10 custom metrics per namespace free up to 10 k) but dangerously close to blowing the custom-metrics budget if you do not filter at the source.

Third, distributed tracing in training differs from microservice tracing. There is no request trace; there is an execution graph of CUDA operations, collective communication and data I/O. OpenTelemetry still lacks native semantics for this — you must manually instrument PyTorch hooks or use MLflow Tracing, which is a separate layer.

These three characteristics — explosive log volume, high-cardinality GPU metrics and absent native tracing — define the evaluation criteria for this bake-off.

The four contenders: architecture and positioning

Fluent Bit (native EKS DaemonSet) is the default log collector in the aws-for-fluent-bit add-on. It tails /var/log/containers/*.log, parses JSON or regex, and routes to CloudWatch Logs, S3, Kinesis Data Streams or Firehose. Backpressure configuration via mem_buf_limit and storage.type filesystem is critical: without it, a 200 MB/min burst of training logs will OOM the DaemonSet on memory-constrained nodes. Its strength is lightness — ~50 MB memory in normal operation — and native IAM integration via IRSA. Its weakness is that it collects no metrics or traces; it is a pure log collector.

OpenTelemetry Collector (OTel) is the most versatile contender. Deployed as a DaemonSet or Deployment via opentelemetry-operator, it unifies logs (via filelog receiver), metrics (via prometheusreceiver scraping DCGM) and traces (via otlp receiver) in a single pipeline. Operational cost is higher: the pipeline needs tuning of batch processor (send_batch_size, timeout), memory_limiter and parallel exporters. But the ability to route to multiple backends simultaneously — CloudWatch, S3 via OTLP/Parquet, Jaeger — without duplicating agents is the real architectural differentiator.

CloudWatch Container Insights with Enhanced Observability for EKS is the AWS managed option. Enabled via the amazon-cloudwatch-observability add-on, it automatically installs a pre-configured OTel Collector, DCGM Exporter and Fluent Bit. Operational cost is zero, but financial cost is high: Enhanced Observability charges $0.009 per vCPU-hour and $0.009 per GB memory-hour per monitored node — on a 20-node p4d.24xlarge cluster (96 vCPUs each), that is ~$17,280/month in observability alone.

Datadog Agent with DogStatsD is the most complete enterprise option. The datadog/datadog Helm chart installs the Agent as a DaemonSet with log collection, metrics (including native DCGM integration), APM and NPM. The differentiator is ML Observability (formerly Weights & Biases integration) and automatic correlation between logs, metrics and traces. Cost is $15–23/host/month depending on tier, plus log ingestion at $0.10/GB after the free tier.

ML Observability Pipelines on EKS: Four Approaches in Parallel

Each column represents a collection strategy. GPU nodes and DCGM Exporter are shared. Arrows show telemetry flow to the analytics backend.

🖥️ EKS — Workload Layer

Training Pod · PyTorch DDP / Ray
DCGM Exporter · /metrics :9400
stdout/stderr · /var/log/containers

📦 EKS — Collection DaemonSets

Fluent Bit · mem_buf_limit=256MB
OTel Collector · filelog+prom+otlp
CW Addon · amazon-cloudwatch-obs
Datadog Agent · DogStatsD :8125

🟧 AWS — Managed Backends

CloudWatch Logs · /aws/eks/ml-cluster
CloudWatch Metrics · Custom NS: ML/GPU
S3 + Parquet · long-term archive
Kinesis Data Streams · hot path routing

🔵 External — SaaS Backends

Datadog SaaS · ML Observability
Jaeger / Tempo · distributed traces

Technical Comparison: Four ML Observability Strategies on EKS

	Criterion	Native Fluent Bit	OTel Collector	CW Container Insights Enhanced	Datadog Agent
Signals covered	Logs only	Logs + Metrics + Traces	Logs + Metrics (GPU via DCGM)	Logs + Metrics + Traces + APM	—
Memory overhead per node	~50 MB	150–400 MB (pipeline size)	200–350 MB (bundle)	300–600 MB (full agent)	—
Monthly cost (20 p4d.24xlarge nodes)	~$180 (CW Logs ingestion)	~$400–800 (CW EMF + Logs)	~$17,280 (Enhanced vCPU/mem fee)	~$460 + log ingestion	—
Diagnostic latency (P99)	30–90 s (CW Logs Insights query)	10–30 s (backend dependent)	15–45 s (CW dashboards)	5–15 s (Live Tail + alerts)	—
GPU metrics support (DCGM)	Not native	Via prometheusreceiver	Native (add-on installs DCGM)	Native (DCGM integration)	—
Compliance / data sovereignty	High (data stays in AWS)	High (configurable per backend)	High (data stays in AWS)	Medium (data leaves to SaaS)	—
Operational complexity	Low	High (pipeline YAML, tuning)	Low (managed)	Medium (Helm + API key rotation)	—
Vendor lock-in	AWS (moderate)	Minimal (open standard)	AWS (high)	Datadog (high)	—

The real problem with Enhanced Observability: hidden vCPU cost

The amazon-cloudwatch-observability add-on with Enhanced Observability is the simplest option to enable — one eksctl enable addon and you have DCGM, Fluent Bit and OTel Collector pre-configured. But the pricing model is a trap for ML clusters.

Enhanced Observability charges per vCPU-hour and GB-memory-hour per monitored node, regardless of how many metrics you actually use. A p4d.24xlarge has 96 vCPUs and 1,152 GB RAM. At $0.009/vCPU-hour, the vCPU dimension alone costs $0.864/hour per node. With 20 nodes running 24/7 that is $12,441/month — and the memory dimension adds another ~$4,976/month. Total: ~$17,417/month.

For comparison, the GPU cost of those 20 nodes is ~$460,800/month (at $32.77/hour per node). So observability represents ~3.8% of compute cost — which might seem reasonable until you realise that a standalone OTel Collector with EMF exporter to CloudWatch Metrics covers 90% of the same use cases for ~$800/month.

Practical recommendation: use Enhanced Observability only on dev/staging clusters with smaller nodes (m5, c5) where the per-vCPU cost is negligible. On production clusters with large GPU instances, build the OTel pipeline manually with memory_limiter set to 80% of the container limit, batch processor with send_batch_size: 1000 and timeout: 10s, and prometheusreceiver scraping DCGM every 30 s (not 15 s — you halve metric volume without meaningful diagnostic loss).

Numbers that matter in production ML clusters

96×

Time-series multiplier per GPU node

DCGM exposes ~40 metrics × 8 GPUs × 3 dimensions (node, GPU, process) on a p4d.24xlarge

21×

Monthly cost ratio: Enhanced vs manual OTel (20 GPU nodes)

$17,280 (Enhanced) vs ~$820 (OTel + CW EMF + Fluent Bit standalone)

<15 s

Alert latency with Datadog Live Tail + monitor threshold

Compared to 60–120 s with CloudWatch Metric Alarms at high cardinality

OTel Collector: the pipeline worth the operational cost

The OpenTelemetry Collector is the only one of the four contenders that solves the problem of multiple backends without agent duplication. In regulated financial environments this is often mandatory: you need to send logs to CloudWatch (7-year regulatory retention via S3 Glacier), metrics to an internal analytics backend (Prometheus + Thanos or Amazon Managed Prometheus) and traces to Jaeger or Tempo — all simultaneously, with different delivery guarantees.

The pipeline configuration I use in production for ML workloads has three critical stages:

1. Parallel receivers with memory isolation: filelog receiver with start_at: beginning disabled (prevents re-reading historical logs on pod restart), prometheusreceiver with scrape_interval: 30s and target_allocator enabled to distribute scraping across multiple Collectors when the cluster has >50 nodes.

2. Chained processors with circuit breaker: memory_limiter as the first processor (limit at 80% of the container's resources.limits.memory), followed by resource processor to add k8s.cluster.name, ml.job.id and gpu.node.type attributes — essential for cross-signal correlation. The batch processor comes last, not first: placing batch before memory_limiter is the most common mistake I see in architecture reviews.

3. Exporters with retry and persistent queue: awscloudwatchlogs exporter with log_stream_name derived from k8s.pod.name (avoids per-stream throttling), awsemf exporter with explicit metric_declarations to avoid sending all 320 DCGM series to CloudWatch (select the 8–12 that actually matter: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL). retry_on_failure with max_elapsed_time: 300s and sending_queue with storage: file_storage ensures a temporary CloudWatch throttle does not lose training data.

Decision Matrix: Which Strategy for Which Context

Native Fluent Bit

Pros

Minimal overhead (~50 MB); ideal for nodes where memory is contested by GPU containers
Native IRSA integration; no static credentials
Routing to Kinesis for real-time alert hot-path

Cons

Covers logs only; GPU metrics require a separate agent
No native correlation between logs and metrics

Use as a complement to another collector, not as a standalone solution

OTel Collector

Pros

Unifies logs, metrics and traces in a single DaemonSet
Open standard; no lock-in; exports to any backend
21× lower cost than Enhanced Observability on large GPU clusters
target_allocator scales scraping horizontally

Cons

High configuration complexity; pipeline mistakes cause silent data loss
Requires expertise in OTel pipeline YAML and memory tuning

Best choice for production on medium-to-large GPU clusters with multi-backend requirements

CW Container Insights Enhanced

Pros

Zero operation; AWS-managed add-on
Native DCGM without additional configuration
Ideal for teams without OTel expertise

Cons

Prohibitive cost on large GPU instances (~$17k/month for 20 p4d nodes)
Strong CloudWatch lock-in; difficult to migrate historical data

Acceptable only on dev/staging clusters with smaller instances; never on large-scale GPU production

Datadog Agent

Pros

Lowest diagnostic latency (<15 s with Live Tail)
ML Observability with automatic run/experiment correlation
Native APM for serving pipelines (Triton, TorchServe)

Cons

Data leaves AWS; problematic in environments with financial data sovereignty restrictions
Cost grows linearly with hosts; can exceed OTel+CW on large clusters
API key rotation requires additional security process (Secrets Manager + External Secrets Operator)

Best for ML teams that need fast diagnosis and have no data sovereignty restrictions

Security and governance: what changes when ML logs contain sensitive data

In financial environments, training logs frequently contain information that should not leave the security perimeter: customer IDs in validation datasets, feature values derived from transactions, or simply the model name and version (which is sensitive competitive information). This adds a layer of requirements that most observability comparisons ignore.

CloudWatch Logs with KMS: all four collectors support sending to CloudWatch Logs, but only if you configure kms_key_id on the log group. Fluent Bit does this via auto_create_group Off and pre-creating the group with aws logs create-log-group --kms-key-id arn:aws:kms:.... OTel Collector requires the log group to exist before the first PutLogEvents — a common race condition in clusters that scale rapidly.

IAM with context conditions: the collector's IRSA policy must have Condition: StringEquals: aws:RequestedRegion: [region] to prevent cross-region exfiltration, and aws:SourceVpc if you use VPC Endpoints for CloudWatch. The Datadog Agent stores the API key in a Kubernetes Secret — use External Secrets Operator with AWS Secrets Manager and 90-day automatic rotation, not a manual kubectl create secret.

Data masking in the pipeline: OTel Collector has the transform processor with OTTL (OpenTelemetry Transformation Language) that allows masking fields before sending: replace_pattern(body, "customer_id=\\d+", "customer_id=REDACTED"). Fluent Bit has the lua filter for the same purpose. Datadog has Sensitive Data Scanner on the backend side, but that means the data has already left AWS — for regulated environments, masking must happen in the collector, not the backend.

Retention and immutability: for compliance with financial regulations (BACEN, CVM, SOX), configure S3 Object Lock in COMPLIANCE mode on log archive buckets with a 7-year retention period. Fluent Bit with S3 output supports s3_key_format /%Y/%m/%d/%H/ for temporal partitioning that facilitates audit queries with Athena.

The most expensive architecture mistake: DCGM scraping without cardinality filtering

I have seen teams send all ~40 DCGM Exporter metrics to CloudWatch Metrics without metric_declarations. On a 30-node p4d.24xlarge cluster that generates 9,600 custom time series. CloudWatch charges $0.30 per custom metric/month after the first 10,000 — but the real cost is not that. It is the API cost: PutMetricData has a limit of 1,000 values per call and 150 TPS per account. With 15 s scraping and 9,600 series you need ~10 calls per cycle per node — and you start seeing ThrottlingException that OTel Collector handles with exponential retry, increasing diagnostic latency exactly when you need it most. Filter to 8–12 essential metrics at the source, not the destination.

Anti-patterns I have seen in ML/EKS architecture reviews

Enabling Enhanced Observability on GPU production clusters without calculating the per-vCPU-hour cost upfront
Placing batch processor before memory_limiter in the OTel pipeline — the batch accumulates data in memory before the limiter can act
Using start_at: beginning in the filelog receiver without a storage extension — causes re-reading of all log history on every Collector restart
Storing Datadog API keys in Kubernetes Secrets without automatic rotation via External Secrets Operator
Not configuring mem_buf_limit in Fluent Bit on GPU nodes — a training log burst can OOM the DaemonSet and stop collection for all pods on the node
Sending PII data from validation datasets to SaaS backends without masking in the collector

My curation note

Senior Solutions Architect

In production I build the pipeline as OTel Collector with prometheusreceiver for DCGM (scrape at 30 s, not 15 s) + awsemf exporter with explicit metric_declarations for the 10 GPU metrics that actually matter + Fluent Bit only for the critical-log hot-path via Kinesis. Datadog is reserved for environments where the ML team needs <15 s diagnosis and there is no data sovereignty restriction — which in Brazilian banks is rarely the case. The most expensive lesson I have learned: observability cost on GPU clusters is not marginal — on a large-scale training cluster it can easily exceed the cost of a senior engineer per month if you do not consciously size the pipeline from the first deploy.

Final Recommendation: Composition, Not a Single Choice

OTel Collector + Fluent Bit hot-path

There is no single winner in this bake-off because the four contenders solve different problems. The architecture I recommend for production ML clusters in financial environments is a deliberate composition: Collection layer: OTel Collector as the primary DaemonSet, with opentelemetry-operator managing the lifecycle. Configure filelog receiver for logs, prometheusreceiver for DCGM (30 s, filtered to 10 metrics), and otlp receiver for serving traces. Use memory_limiter as the first processor, resource for ML attribute enrichment, and batch last. Dual routing: awsemf exporter for critical metrics in CloudWatch (operational alerts), awscloudwatchlogs for KMS-encrypted logs, and otlphttp exporter to S3 via Parquet for long-term archival and Athena analysis. For clusters with serving trace requirements, add a jaeger exporter pointing to AWS X-Ray via OTel. Fluent Bit as hot-path: keep the Fluent Bit DaemonSet only for routing critical logs (errors, OOM, checkpoint failures) to Kinesis Data Streams → Lambda → PagerDuty. This guarantees <30 s alert latency without OTel Collector overhead on the critical path.

Technical References

AWS: Amazon CloudWatch Observability EKS Add-on AWS: CloudWatch Container Insights Enhanced Observability Pricing AWS: Using AWS Distro for OpenTelemetry with EKS OpenTelemetry: Collector Configuration — Memory Limiter Processor NVIDIA: DCGM Exporter for Kubernetes AWS: Fluent Bit for EKS — aws-for-fluent-bit AWS Architecture Blog: EKS ML Core Logging OpenTelemetry: Target Allocator for Prometheus Receiver

#eks#mlops#observability#opentelemetry#cloudwatch#fluent-bit#datadog#cost-optimization

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: EKS ML core logging

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Data PlatformsLLM Observability in Production: From GPU Metrics to Response QualityDeploying an LLM to SageMaker is the easy part. The hard part is knowing, in real time, whether it is answering well, using GPU efficiently, and costing what you planned. This article details the observability stack I would build today for financial-grade LLM inference.Read Data PlatformsCloudWatch to OTel: Tearing Down the Observability Bridge PatternThe CloudWatch-to-OpenTelemetry bridge pattern solves a real observability fragmentation problem in multi-platform environments, but it carries operational costs and design pitfalls that rarely surface in tutorials. In this article I tear down the anatomy of this pattern, when it makes sense, and when it creates more problems than it solves.Read AI & AgentsCloudWatch Logs Intelligent Tiering: Field Guide for Financial-Grade EnvironmentsCloudWatch Logs Intelligent Tiering arrived in July 2026 promising lower log retention costs without operational overhead — but for financial-grade environments with audit requirements, the story is more nuanced. In this field note, I analyze the three tiers, the automatic transition thresholds, and where this feature genuinely delivers value versus where it can create unpleasant surprises.Read

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime

Data PlatformsComparison

ML Observability on EKS: Logs, Metrics and Tracing Head-to-Head

May 31, 2026 9 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 18:38

Download MP3

0:0018:38

Speed

The MP3 is saved to S3 after the first play.

Data PlatformsComparison

96×

Time-series multiplier per GPU node

DCGM exposes ~40 metrics × 8 GPUs × 3 dimensions (node, GPU, process) on a p4d.24xlarge

21×

Monthly cost ratio: Enhanced vs manual OTel (20 GPU nodes)

$17,280 (Enhanced) vs ~$820 (OTel + CW EMF + Fluent Bit standalone)

<15 s

Alert latency with Datadog Live Tail + monitor threshold

Compared to 60–120 s with CloudWatch Metric Alarms at high cardinality

fernando.moretes.com

Why ML workloads on EKS are a special observability case

These three characteristics — explosive log volume, high-cardinality GPU metrics and absent native tracing — define the evaluation criteria for this bake-off.

The four contenders: architecture and positioning

ML Observability Pipelines on EKS: Four Approaches in Parallel

Each column represents a collection strategy. GPU nodes and DCGM Exporter are shared. Arrows show telemetry flow to the analytics backend.

🖥️ EKS — Workload Layer

Training Pod · PyTorch DDP / Ray
DCGM Exporter · /metrics :9400
stdout/stderr · /var/log/containers

📦 EKS — Collection DaemonSets

Fluent Bit · mem_buf_limit=256MB
OTel Collector · filelog+prom+otlp
CW Addon · amazon-cloudwatch-obs
Datadog Agent · DogStatsD :8125

🟧 AWS — Managed Backends

CloudWatch Logs · /aws/eks/ml-cluster
CloudWatch Metrics · Custom NS: ML/GPU
S3 + Parquet · long-term archive
Kinesis Data Streams · hot path routing

🔵 External — SaaS Backends

Datadog SaaS · ML Observability
Jaeger / Tempo · distributed traces

Technical Comparison: Four ML Observability Strategies on EKS

	Criterion	Native Fluent Bit	OTel Collector	CW Container Insights Enhanced	Datadog Agent
Signals covered	Logs only	Logs + Metrics + Traces	Logs + Metrics (GPU via DCGM)	Logs + Metrics + Traces + APM	—
Memory overhead per node	~50 MB	150–400 MB (pipeline size)	200–350 MB (bundle)	300–600 MB (full agent)	—
Monthly cost (20 p4d.24xlarge nodes)	~$180 (CW Logs ingestion)	~$400–800 (CW EMF + Logs)	~$17,280 (Enhanced vCPU/mem fee)	~$460 + log ingestion	—
Diagnostic latency (P99)	30–90 s (CW Logs Insights query)	10–30 s (backend dependent)	15–45 s (CW dashboards)	5–15 s (Live Tail + alerts)	—
GPU metrics support (DCGM)	Not native	Via prometheusreceiver	Native (add-on installs DCGM)	Native (DCGM integration)	—
Compliance / data sovereignty	High (data stays in AWS)	High (configurable per backend)	High (data stays in AWS)	Medium (data leaves to SaaS)	—
Operational complexity	Low	High (pipeline YAML, tuning)	Low (managed)	Medium (Helm + API key rotation)	—
Vendor lock-in	AWS (moderate)	Minimal (open standard)	AWS (high)	Datadog (high)	—

The real problem with Enhanced Observability: hidden vCPU cost

Numbers that matter in production ML clusters

96×

Time-series multiplier per GPU node

DCGM exposes ~40 metrics × 8 GPUs × 3 dimensions (node, GPU, process) on a p4d.24xlarge

21×

Monthly cost ratio: Enhanced vs manual OTel (20 GPU nodes)

$17,280 (Enhanced) vs ~$820 (OTel + CW EMF + Fluent Bit standalone)

<15 s

Alert latency with Datadog Live Tail + monitor threshold

Compared to 60–120 s with CloudWatch Metric Alarms at high cardinality

OTel Collector: the pipeline worth the operational cost

The pipeline configuration I use in production for ML workloads has three critical stages:

Decision Matrix: Which Strategy for Which Context

Native Fluent Bit

Pros

Minimal overhead (~50 MB); ideal for nodes where memory is contested by GPU containers
Native IRSA integration; no static credentials
Routing to Kinesis for real-time alert hot-path

Cons

Covers logs only; GPU metrics require a separate agent
No native correlation between logs and metrics

Use as a complement to another collector, not as a standalone solution

OTel Collector

Pros

Unifies logs, metrics and traces in a single DaemonSet
Open standard; no lock-in; exports to any backend
21× lower cost than Enhanced Observability on large GPU clusters
target_allocator scales scraping horizontally

Cons

High configuration complexity; pipeline mistakes cause silent data loss
Requires expertise in OTel pipeline YAML and memory tuning

Best choice for production on medium-to-large GPU clusters with multi-backend requirements

CW Container Insights Enhanced

Pros

Zero operation; AWS-managed add-on
Native DCGM without additional configuration
Ideal for teams without OTel expertise

Cons

Prohibitive cost on large GPU instances (~$17k/month for 20 p4d nodes)
Strong CloudWatch lock-in; difficult to migrate historical data

Acceptable only on dev/staging clusters with smaller instances; never on large-scale GPU production

Datadog Agent

Pros

Lowest diagnostic latency (<15 s with Live Tail)
ML Observability with automatic run/experiment correlation
Native APM for serving pipelines (Triton, TorchServe)

Cons

Data leaves AWS; problematic in environments with financial data sovereignty restrictions
Cost grows linearly with hosts; can exceed OTel+CW on large clusters
API key rotation requires additional security process (Secrets Manager + External Secrets Operator)

Best for ML teams that need fast diagnosis and have no data sovereignty restrictions

Security and governance: what changes when ML logs contain sensitive data

The most expensive architecture mistake: DCGM scraping without cardinality filtering

Anti-patterns I have seen in ML/EKS architecture reviews

Enabling Enhanced Observability on GPU production clusters without calculating the per-vCPU-hour cost upfront
Placing batch processor before memory_limiter in the OTel pipeline — the batch accumulates data in memory before the limiter can act
Using start_at: beginning in the filelog receiver without a storage extension — causes re-reading of all log history on every Collector restart
Storing Datadog API keys in Kubernetes Secrets without automatic rotation via External Secrets Operator
Not configuring mem_buf_limit in Fluent Bit on GPU nodes — a training log burst can OOM the DaemonSet and stop collection for all pods on the node
Sending PII data from validation datasets to SaaS backends without masking in the collector

My curation note

Senior Solutions Architect

Final Recommendation: Composition, Not a Single Choice

OTel Collector + Fluent Bit hot-path

Technical References

#eks#mlops#observability#opentelemetry#cloudwatch#fluent-bit#datadog#cost-optimization

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: EKS ML core logging

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime