Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

The AI Architect Track

Module 5 · Apply + Project· Lesson 21/22

From prototype to production

What separates a pretty demo from a reliable AI system in production.

5 min read

Every AI demo works on the first presentation. The problem starts when you try to put it into production and the model hallucinates, costs explode, timeouts fire, and nobody knows why. The gap between a pretty prototype and a reliable system isn't small — but it's crossable if you know what to add and in what order.

The valley between demo and production

An AI prototype has three characteristics that make it deceptively good: you control the inputs, you're present to fix failures on the spot, and you have no real volume. Remove any of those three and the system starts showing its cracks.

The "valley" is the set of problems that only appear in production: unexpected inputs that break the prompt, variable model latency that blows the SLA, costs that rise with real usage, sensitive data leaking through context, and no visibility to diagnose any of it.

The good news: these problems are known and have engineering solutions. It's not magic — it's the same discipline you already apply to APIs and data pipelines, adapted for systems that have a non-deterministic component in the middle. Previous lessons covered each piece in isolation (evaluation in lesson 09, guardrails in lesson 10, tools in lesson 07). This lesson connects everything into a readiness journey.

Prototype-to-production journey

Each phase adds a layer of reliability. You don't need everything on day one, but you need to know where you are on the map.

🧪 Fase 1 — Protótipo

Prompt fixo · no código
Modelo único · sem fallback

🔬 Fase 2 — Qualidade

Evals contínuos · (golden set)
Versionamento · de prompts

🔭 Fase 3 — Observabilidade

Traces / spans · por chamada
Medição de tokens · e custo

🛡️ Fase 4 — Segurança

Guardrails · (input + output)
PII redaction · antes do contexto

⚙️ Fase 5 — Resiliência

Timeout + retry · exponencial
Fallback de modelo · ou resposta padrão
Idempotência · em tool calls

🔄 Fase 6 — Loop de melhoria

Coleta de feedback · (thumbs / flags)
Sinal para ajuste · de prompt / modelo

What to add — and why each piece matters

Continuous evaluation is your safety belt. Without a golden set of cases tested on every deploy, you don't know if a prompt change improved or broke behavior. Keep at least 30–50 representative cases and run evals in CI before any promotion.

Observability with traces is different from traditional logging. You need spans that cover: model call latency, tokens consumed (input + output), which prompt version was used, and the guardrail result. Tools like AWS CloudWatch, Langfuse, or OpenTelemetry with a custom exporter handle this. Without it, you're flying blind.

Prompt versioning feels like bureaucracy until the day you need to roll back. Treat prompts like code: hash, changelog, and rollback. Don't let the prompt live only in someone's head or in a .env with no history.

Cost limits should be hard limits, not alerts. Define a token ceiling per user per hour and a global daily ceiling. Without this, a misconfigured agent loop or an abusive user can generate an unexpected bill before the next monitoring cycle.

Graceful degradation means that when the primary model fails or is too slow, the system responds with something useful — whether a smaller model, a cached response, or an honest message that the service is temporarily limited. Silence or a 500 error is worse than a partial response.

In practice

Senior Solutions Architect

In practice, the most common mistake I see is teams that instrument the front-end but forget to trace what happens inside the agent loop. You see the request come in and the response go out, but you don't see how many tool calls were made, which one was slow, or which one returned incorrect data. Trace every step of the agent like a distributed transaction — because that's exactly what it is.

Reliability, security, and the improvement loop

Timeouts and retries on model calls are not optional. Large models have variable latency — p99 can be 3–5x the p50. Set an aggressive timeout (e.g., 15s for synchronous responses) and implement retry with exponential backoff and jitter. But beware: retry without idempotency in tool calls can cause duplicate side effects. If the tool writes to the database or sends an email, it needs to be idempotent or you need a deduplication mechanism.

Security and privacy have two fronts. The first is protection against prompt injection — covered in lesson 10 — which requires guardrails on both input and output. The second is data privacy: never send PII directly to the model without need. Redact before assembling the context, and review what goes into stored conversation history.

The improvement loop is what separates systems that stay good from systems that go stale. Collect structured feedback (thumbs up/down, incorrect response flags), store it with the corresponding trace, and use it to update the golden eval set and eventually adjust prompts or switch models. Without this loop, you're operating in the dark — and quality will silently degrade with model changes or real-world data drift.

Lesson 17 (Bedrock AgentCore) and lesson 18 (Knowledge Bases) show how AWS manages parts of this cycle. But the responsibility of defining what "good" means and closing the feedback loop is always yours.

Put in order

Order the path to production

From prototype to an operable system.

1Working prototype with prompt + model
2Add guardrails, observability and cost caps
3Add evaluation (evals) and measure quality/cost
4Operate, monitor and continuously improve

Production readiness checklist

1
Evals in CI
Golden set with at least 30 cases. No deploy without passing evals.
2
Versioned prompts
Hash + changelog + ability to roll back in under 5 minutes.
3
Per-call traces
Latency, tokens, prompt version, guardrail result — all in one traceable span.
4
Active guardrails
Input and output validation. PII redacted before entering the context.
5
Cost limits configured
Hard token limit per user/hour and global daily ceiling. Alert before 80% of the ceiling.
6
Timeout + retry + fallback
Timeout defined, retry with backoff, and fallback response for when the primary model fails.
7
Idempotent tool calls
Every tool with a side effect has an idempotency key or deduplication mechanism.
8
Active feedback loop
Feedback collection linked to the trace. Defined process to update evals and prompts.

Frequently asked questions

Do I need all of this before the first production deploy?

No, but you need a minimum subset: basic evals, timeout/retry, and input guardrails. The rest you add in the first weeks. What you can't do is go to production with none of these pieces.

How do I version prompts without a dedicated tool?

Git solves the basic problem. Put prompts in text files in the repository, with semantic names and a changelog in the commit. If you want more control, Bedrock Prompt Management or Langfuse have native versioning.

What is a 'golden set' of evals in practice?

It's a collection of (input, expected output or evaluation criterion) pairs that represents the most important cases in your system — including happy paths, edge cases, and cases that previously failed. You run these cases on every change and compare the result to the expected, whether by exact match, LLM-as-judge, or a custom metric.

What comes next

production-ready

The next lesson is the guided project: you'll build a complete system — from RAG to agent — applying each of these layers. The checklist from this lesson is your acceptance criterion. If the project passes all items, you have a system that can genuinely go to production.

References

AWS Well-Architected — Machine Learning Lens Amazon Bedrock — Guardrails Amazon Bedrock — Prompt Management Langfuse — Open Source LLM Observability OpenTelemetry for LLM Observability (OTEL Semantic Conventions)Building Production-Ready RAG Applications — AWS Blog

Previous Next lesson

The valley between demo and production

Prototype-to-production journey

Each phase adds a layer of reliability. You don't need everything on day one, but you need to know where you are on the map.

🧪 Fase 1 — Protótipo

Prompt fixo · no código
Modelo único · sem fallback

🔬 Fase 2 — Qualidade

Evals contínuos · (golden set)
Versionamento · de prompts

🔭 Fase 3 — Observabilidade

Traces / spans · por chamada
Medição de tokens · e custo

🛡️ Fase 4 — Segurança

Guardrails · (input + output)
PII redaction · antes do contexto

⚙️ Fase 5 — Resiliência

Timeout + retry · exponencial
Fallback de modelo · ou resposta padrão
Idempotência · em tool calls

🔄 Fase 6 — Loop de melhoria

Coleta de feedback · (thumbs / flags)
Sinal para ajuste · de prompt / modelo

What to add — and why each piece matters

Reliability, security, and the improvement loop

Production readiness checklist

Evals in CI

Golden set with at least 30 cases. No deploy without passing evals.

Versioned prompts

Hash + changelog + ability to roll back in under 5 minutes.

Per-call traces

Latency, tokens, prompt version, guardrail result — all in one traceable span.

Active guardrails

Input and output validation. PII redacted before entering the context.

Cost limits configured

Hard token limit per user/hour and global daily ceiling. Alert before 80% of the ceiling.

Timeout + retry + fallback

Timeout defined, retry with backoff, and fallback response for when the primary model fails.

Idempotent tool calls

Every tool with a side effect has an idempotency key or deduplication mechanism.

Active feedback loop

Feedback collection linked to the trace. Defined process to update evals and prompts.

Frequently asked questions

Do I need all of this before the first production deploy?

No, but you need a minimum subset: basic evals, timeout/retry, and input guardrails. The rest you add in the first weeks. What you can't do is go to production with none of these pieces.

How do I version prompts without a dedicated tool?

What is a 'golden set' of evals in practice?