From prototype to production
What separates a pretty demo from a reliable AI system in production.
5 min read
Every AI demo works on the first presentation. The problem starts when you try to put it into production and the model hallucinates, costs explode, timeouts fire, and nobody knows why. The gap between a pretty prototype and a reliable system isn't small — but it's crossable if you know what to add and in what order.
The valley between demo and production
An AI prototype has three characteristics that make it deceptively good: you control the inputs, you're present to fix failures on the spot, and you have no real volume. Remove any of those three and the system starts showing its cracks.
The "valley" is the set of problems that only appear in production: unexpected inputs that break the prompt, variable model latency that blows the SLA, costs that rise with real usage, sensitive data leaking through context, and no visibility to diagnose any of it.
The good news: these problems are known and have engineering solutions. It's not magic — it's the same discipline you already apply to APIs and data pipelines, adapted for systems that have a non-deterministic component in the middle. Previous lessons covered each piece in isolation (evaluation in lesson 09, guardrails in lesson 10, tools in lesson 07). This lesson connects everything into a readiness journey.
Prototype-to-production journey
Each phase adds a layer of reliability. You don't need everything on day one, but you need to know where you are on the map.
- Prompt fixo · no código
- Modelo único · sem fallback
- Evals contínuos · (golden set)
- Versionamento · de prompts
- Traces / spans · por chamada
- Medição de tokens · e custo
- Guardrails · (input + output)
- PII redaction · antes do contexto
- Timeout + retry · exponencial
- Fallback de modelo · ou resposta padrão
- Idempotência · em tool calls
- Coleta de feedback · (thumbs / flags)
- Sinal para ajuste · de prompt / modelo
What to add — and why each piece matters
Continuous evaluation is your safety belt. Without a golden set of cases tested on every deploy, you don't know if a prompt change improved or broke behavior. Keep at least 30–50 representative cases and run evals in CI before any promotion.
Observability with traces is different from traditional logging. You need spans that cover: model call latency, tokens consumed (input + output), which prompt version was used, and the guardrail result. Tools like AWS CloudWatch, Langfuse, or OpenTelemetry with a custom exporter handle this. Without it, you're flying blind.
Prompt versioning feels like bureaucracy until the day you need to roll back. Treat prompts like code: hash, changelog, and rollback. Don't let the prompt live only in someone's head or in a .env with no history.
Cost limits should be hard limits, not alerts. Define a token ceiling per user per hour and a global daily ceiling. Without this, a misconfigured agent loop or an abusive user can generate an unexpected bill before the next monitoring cycle.
Graceful degradation means that when the primary model fails or is too slow, the system responds with something useful — whether a smaller model, a cached response, or an honest message that the service is temporarily limited. Silence or a 500 error is worse than a partial response.
In practice, the most common mistake I see is teams that instrument the front-end but forget to trace what happens inside the agent loop. You see the request come in and the response go out, but you don't see how many tool calls were made, which one was slow, or which one returned incorrect data. Trace every step of the agent like a distributed transaction — because that's exactly what it is.
Reliability, security, and the improvement loop
Timeouts and retries on model calls are not optional. Large models have variable latency — p99 can be 3–5x the p50. Set an aggressive timeout (e.g., 15s for synchronous responses) and implement retry with exponential backoff and jitter. But beware: retry without idempotency in tool calls can cause duplicate side effects. If the tool writes to the database or sends an email, it needs to be idempotent or you need a deduplication mechanism.
Security and privacy have two fronts. The first is protection against prompt injection — covered in lesson 10 — which requires guardrails on both input and output. The second is data privacy: never send PII directly to the model without need. Redact before assembling the context, and review what goes into stored conversation history.
The improvement loop is what separates systems that stay good from systems that go stale. Collect structured feedback (thumbs up/down, incorrect response flags), store it with the corresponding trace, and use it to update the golden eval set and eventually adjust prompts or switch models. Without this loop, you're operating in the dark — and quality will silently degrade with model changes or real-world data drift.
Lesson 17 (Bedrock AgentCore) and lesson 18 (Knowledge Bases) show how AWS manages parts of this cycle. But the responsibility of defining what "good" means and closing the feedback loop is always yours.
Order the path to production
From prototype to an operable system.
- 1Working prototype with prompt + model
- 2Add guardrails, observability and cost caps
- 3Add evaluation (evals) and measure quality/cost
- 4Operate, monitor and continuously improve
Production readiness checklist
- 1
Evals in CI
Golden set with at least 30 cases. No deploy without passing evals.
- 2
Versioned prompts
Hash + changelog + ability to roll back in under 5 minutes.
- 3
Per-call traces
Latency, tokens, prompt version, guardrail result — all in one traceable span.
- 4
Active guardrails
Input and output validation. PII redacted before entering the context.
- 5
Cost limits configured
Hard token limit per user/hour and global daily ceiling. Alert before 80% of the ceiling.
- 6
Timeout + retry + fallback
Timeout defined, retry with backoff, and fallback response for when the primary model fails.
- 7
Idempotent tool calls
Every tool with a side effect has an idempotency key or deduplication mechanism.
- 8
Active feedback loop
Feedback collection linked to the trace. Defined process to update evals and prompts.
Frequently asked questions
Do I need all of this before the first production deploy?
No, but you need a minimum subset: basic evals, timeout/retry, and input guardrails. The rest you add in the first weeks. What you can't do is go to production with none of these pieces.
How do I version prompts without a dedicated tool?
Git solves the basic problem. Put prompts in text files in the repository, with semantic names and a changelog in the commit. If you want more control, Bedrock Prompt Management or Langfuse have native versioning.
What is a 'golden set' of evals in practice?
It's a collection of (input, expected output or evaluation criterion) pairs that represents the most important cases in your system — including happy paths, edge cases, and cases that previously failed. You run these cases on every change and compare the result to the expected, whether by exact match, LLM-as-judge, or a custom metric.
What comes next
The next lesson is the guided project: you'll build a complete system — from RAG to agent — applying each of these layers. The checklist from this lesson is your acceptance criterion. If the project passes all items, you have a system that can genuinely go to production.