ML observability on EKS with logs, metrics and cost
Architecture visual related to ml observability on eks with logs, metrics and cost.
Architecture analysis on ml observability on eks with logs, metrics and cost, connecting AWS, security, operations, cost and evolution. The focus is turning a current technical signal into an applicable design for critical systems.
What you will take away
Why this topic matters
ML observability on EKS with logs, metrics and cost is a useful example of how modern architecture is no longer only a technology choice. The signal comes from AWS Architecture Blog, but the senior-architect read is about connecting product, security, operations, cost and evolution. In critical environments, especially financial systems, the solution must start with clear boundaries, telemetry, limits and an operational story that makes sense.
The central point is avoiding a beautiful diagram that fails during the first incident. I look at eks, mlops, observability asking: which business value is protected, which risk grows with scale, which part should be managed by cloud services and which part needs to be governed as an internal product?
Architecture decisions
My preference is to split the problem into four layers. The first one is the edge: authentication, authorization, WAF, rate limits, cache and experience. The second is the domain layer, where rules, policies and orchestrations live behind versioned contracts. The third is the data and event platform, responsible for history, auditability, traceability and integration. The fourth is operations, with metrics, logs, traces, runbooks and automation.
This separation reduces coupling and improves the conversation with security, product and operations. It also prevents a local decision from becoming structural debt. When the architecture involves governanca de dados, observabilidade, linhagem and custo por dominio, every boundary needs an owner, an SLO, an estimated cost and an evolution criterion.
Applying it on AWS
On AWS, I would design the first increment with managed services and simple guardrails. CloudFront and AWS WAF protect the entry point. API Gateway or ALB standardize contracts. Lambda, ECS or EKS run domain capabilities according to workload profile. EventBridge and Step Functions help create auditable and idempotent flows. S3, DynamoDB, Aurora, OpenSearch, Glue and Athena enter according to the data pattern. CloudWatch, X-Ray, CloudTrail and logs in S3 close the visibility loop.
When AI appears in the design, Bedrock must be treated as a controlled capability, not as a shortcut. Prompts, tools, sources, cost limits, evaluation and fallback need versioning. Model calls need timeout, quota, cache where useful, abuse protection and logging without sensitive data.
Trade-offs
The main trade-off is speed versus governance. Serverless and managed services can accelerate delivery, but architecture accountability cannot be outsourced. There is also cost versus observability depth: logging everything is expensive and risky, logging too little leaves operations blind. The balance comes from semantic events, sampling, sensitive-data redaction and purpose-based retention.
Evolution is another important point. A good solution today must accept changes in volume, regulation, teams and external dependency. That is why I like short ADRs, OpenAPI or AsyncAPI contracts, resilience tests, Well-Architected reviews and dashboards with cost per unit of value.
Maturity signals
I would consider the solution mature when it has explicit limits, actionable alarms, clear runbooks, least privilege, failure tests, deployment pipeline and rollback criteria. It also needs to show impact: response time, availability, cost per transaction, error rate, audit coverage and user satisfaction.
Conclusion
ML observability on EKS with logs, metrics and cost shows that strong architecture is a discipline of choices. Technology matters, but the difference is designing systems that explain their limits, protect data, scale with predictable cost and improve with evidence. This is the kind of architecture I like to build: practical, secure, well operated and ready to change.
Reference architecture
The design separates experience, domain, data/events and operations to reduce coupling and improve traceability.
Reference flow for controlled execution
Protected edge
CloudFront / WAFFilters traffic, applies limits, protects the origin and preserves user experience.
Contracted API
API GatewayExposes versioned contracts, authentication and consumption policies by audience.
Domain workflow
Lambda / ECS / Step FunctionsOrchestrates business rules with idempotency, retries and clear rollback.
Data and events
S3 / DynamoDB / EventBridgeStores state, events and audit trails with encryption and defined retention.
Operate and improve
CloudWatch / AthenaTurns telemetry into metrics, alerts, unit costs and continuous learning.
C4 architecture view
C4 view to make boundaries, responsibilities, technologies and evolutionary contracts explicit.
Containers
Web/API Edge
CloudFront, WAF, API GatewayPublic layer with protection, authentication, rate limits and routing.
Domain Platform
Lambda, ECS/EKS, Step FunctionsContainer for domain capabilities, policies and orchestration.
Knowledge and Data
S3, DynamoDB, OpenSearch, AuroraContainer for data, history, search, evidence and auditability.
Operations Plane
CloudWatch, CloudTrail, AthenaContainer for observability, security, cost and continuous improvement.
Components
Policy evaluator
IAM / rules engineEvaluates access, risk, quotas and exceptions before execution.
Workflow coordinator
Step FunctionsCoordinates steps, compensations, timeouts and safe reprocessing.
Telemetry normalizer
OpenTelemetry / CloudWatchStandardizes logs, metrics and traces for operations and later analysis.
Cost guardrail
Budgets / custom metricsLinks consumption to value unit and triggers anomaly alerts.
Code and contracts
API contract
OpenAPIDefines schema, errors, authentication and limits for consumers.
Architecture decision
ADRRecords context, decision, alternatives and consequences.
Implementation checklist
Practical items to turn the analysis into an execution plan.
Define context and ADR
Record objective, alternatives, decision and operational consequences.
Draw C4 and contracts
Make containers, components, OpenAPI/AsyncAPI and trust boundaries explicit.
Create guardrails
Implement minimum IAM, quotas, WAF/rate limits, budgets and input validation.
Instrument telemetry
Standardize logs, metrics, traces, auditability and cost per unit.
Test failures
Validate timeout, retry, rollback, graceful degradation and restoration.
Evolve by metrics
Use operations and business data to prioritize the next increment.
Anti-patterns to avoid
- Starting with the tool before defining boundaries, risks and metrics.
- Logging sensitive data or full prompts without clear purpose and retention.
- Scaling without cost limits, quotas, observability and rollback.
- Mixing product, domain, data and operations responsibilities in a single block.
AWS Well-Architected lens
A pillar-based read that turns architecture decisions into sustainable operations.
Operational excellence
Define SLOs, runbooks, actionable alarms and ownership before increasing scope.
Security
Apply least privilege, encryption, input validation and logs without sensitive data.
Reliability
Model failures, quotas, timeouts, retries, idempotency and tested recovery.
Performance efficiency
Choose services by workload profile and validate bottlenecks with tests and telemetry.
Cost optimization
Use cost per value unit, budgets, anomalies and consumption limits per route.
Sustainability
Reduce unnecessary processing, retain data by purpose and prefer efficient automation.
References and next steps
Useful links to deepen the architecture decision.
Subscribe to the newsletter
Daily and weekly digests on AI, AWS, tech and markets. No spam, unsubscribe anytime.