Playbook: AI, ML, Deep Learning, LLM and Agents — the Map for Choosing the Right Tool
Listen to study
generated on playGenerated only on first play
Powered by Amazon Polly + OmniVoice
A practical layered guide for engineers and architects who need to stop defaulting to 'let's use an LLM'. The correct map is AI → ML → Deep Learning → GenAI/LLM → Agents, each layer a subset of the previous — not synonyms. The right question is never 'which model?', it's 'what is the simplest tool that solves this well?'.
Every week someone proposes an LLM for a problem that a logistic regression classifier would solve at 99% accuracy for $0.001 per thousand calls. The cost isn't just financial — it's latency, unpredictability, hallucination surface, and operational complexity you didn't need. This playbook is the missing map: the real layers of the AI field, what each one solves, and the decision process to arrive at the simplest tool that works.
What you'll be able to decide after this playbook
Quick Reference Glossary
- Artificial Intelligence (AI)
- The entire field: any technique enabling machines to simulate human cognitive capabilities — perception, reasoning, decision-making. Includes rule systems, search, optimization, ML, and everything else.
- Machine Learning (ML)
- Subset of AI where the system learns patterns from data without being explicitly programmed for each rule. Includes supervised, unsupervised, and reinforcement learning.
- Deep Learning (DL)
- Subset of ML using neural networks with multiple (deep) layers to learn hierarchical representations. Dominant in computer vision, NLP, and audio.
- Generative AI (GenAI)
- Subset of Deep Learning focused on generating new content — text, image, code, audio — from learned patterns. LLMs are the most visible case.
- Large Language Model (LLM)
- Large-scale language model trained on massive text corpora via self-supervised learning. Specialized in natural language understanding and generation. E.g.: GPT-4, Claude, Titan.
- Embeddings
- Dense vector representations of text, images, or other data in high-dimensional space. Enable semantic search, clustering, and RAG without requiring a generative LLM in the critical path.
- AI Agent
- LLM equipped with tools (APIs, searches, code execution) and memory (session context, vector store), capable of planning and executing action sequences to achieve a goal.
The mental model that unlocks everything: nested layers, not synonyms
The most expensive mistake I see in design reviews isn't technical — it's semantic. When someone says 'let's use AI for this', they might be proposing anything from an if/else rule system to an autonomous agent with access to external APIs. That ambiguity produces wrong architectures because the team isn't talking about the same thing.
The correct map is of nested layers, where each layer is a strict subset of the previous:
- AI is the entire field. Everything that follows is AI.
- ML is the dominant approach within AI: systems that learn from data. But AI includes rule systems, symbolic planning, and classical optimization that are not ML.
- Deep Learning is ML with deep neural networks. Powerful for unstructured data (text, image, audio), but not the only way to do ML. A gradient boosting model on tabular data is ML, not DL.
- GenAI is Deep Learning with a generative objective — producing new content. Includes LLMs, but also image diffusion, audio synthesis, and code generation.
- LLM is GenAI specialized in language. It's the most visible subset today, but it remains a subset.
- Agents are LLMs with tools and memory. They are not a different type of model — they are a system architecture that uses an LLM as a reasoning engine.
Why does this matter in practice? Because tool selection should start from the outermost layer (simplest) and only descend to inner layers when there is evidence the previous layer is insufficient. Going straight to LLM or agent without this evaluation is like using a scalpel when scissors solve it — and paying the price in cost, latency, and risk.
Layered Map: AI → ML → DL → GenAI → LLM → Agents
Each layer is a strict subset of the previous. The decision of which to use starts from outside in — from simplest to most complex. Agents are not 'more AI' than rules; they are more complex and more expensive.
- Regras Explícitas · if/else, regex, heurísticas
- Busca & Otimização · A*, LP, constraint solvers
- 🤖 ML — Machine Learning · Aprende de dados
- ML Clássico · Regressão, SVM, GBM, KNN
- 🧠 Deep Learning · Redes neurais profundas
- Visão Computacional · CNN, YOLO, ResNet
- Áudio & Fala · Whisper, WaveNet
- ✨ GenAI · Geração de conteúdo
- Geração de Imagem · Diffusion, DALL-E
- 💬 LLM · Modelos de Linguagem
- LLM Base · Claude, GPT-4, Titan, Llama
- Embeddings · Busca semântica, RAG
- 🤝 Agentes · LLM + Ferramentas + Memória
- Motor de Raciocínio · LLM como orquestrador
- Ferramentas · APIs, código, buscas
- Memória · Contexto, vetor store
When each layer is the right answer
The question that structures the decision is not 'which AI model should I use?' — it's 'what is the simplest tool that solves this problem with acceptable quality?'. Simple here is not pejorative: it means lower operational cost, higher predictability, smaller failure surface, and easier to audit.
Explicit rules win when the domain is small and stable, conditions are enumerable, and you need 100% explainability. CPF validation, field formatting, routing by document type — these are rule wins. Cost: near zero. Latency: microseconds. Audit: trivial.
Classical ML (regression, trees, gradient boosting) wins when you have sufficient labeled data, the problem is classification or regression on structured data, and you need cost predictability. A fraud classifier on tabular data with XGBoost frequently outperforms an LLM in accuracy, costs a fraction of the price, and produces feature importances that the compliance team can audit. Amazon SageMaker has mature infrastructure for this pattern.
Deep Learning (non-generative) enters when data is unstructured — images, audio, text for classification — and volume justifies training cost. Object detection in camera images, audio transcription, sentiment classification at scale: these are DL cases, not LLM cases.
Embeddings deserve special mention because they are often the right answer for problems that seem to need an LLM. Semantic search, deduplication, document clustering, recommendation — all of this can be solved with embeddings + vector search (Amazon OpenSearch, pgvector) without a generative LLM in the critical path. Embedding cost is an order of magnitude lower than generation.
LLM is the right tool when the problem requires understanding or generating natural language in an open domain: summarization, entity extraction from free text, draft generation, Q&A over documents with RAG. But even here, the first question should be: 'would a smaller, fine-tuned model solve it?' A 7B parameter model fine-tuned on your domain frequently outperforms GPT-4 zero-shot on your specific case, at 10-50x lower cost.
Agents are the right answer when the problem requires multiple reasoning steps, use of external tools, and the flow is not deterministic enough to be encoded as a workflow. Autonomous research, assistants that execute actions in external systems, analysis pipelines that need to decide which data to collect. Agent cost is high — latency of multiple LLM round-trips, accumulated token cost, observability complexity. Use when the value justifies it.
Rules vs. Classical ML vs. LLM: when to use each
| Criterion | Explicit Rules | Classical ML (e.g. XGBoost) | LLM (e.g. Claude, GPT-4) | |
|---|---|---|---|---|
| Cost per call | ~$0 (local compute) | ~$0.0001–$0.001 | ~$0.002–$0.06 (varies by model) | — |
| Typical latency | < 1ms | 1–50ms | 500ms–10s (generation) | — |
| Output predictability | Deterministic | Stochastic, but calibratable | Stochastic, hard to guarantee | — |
| Hallucination risk | Zero | Zero (classifies, doesn't generate) | Present — requires mitigation | — |
| Explainability / Audit | Trivial — code is the doc | Good — feature importance, SHAP | Hard — black box by nature | — |
| Training data required | None | Hundreds to thousands of labeled examples | None (zero-shot) or few (few-shot) | — |
| Open domain / free language | Not supported | Limited to engineered features | Strong suit | — |
| Maintenance over time | High if domain changes | Periodic retraining needed | Prompt engineering + continuous monitoring | — |
| When to use | Small, stable, auditable domain | Structured data, high volume, compliance | Free language, generation, open domain | — |
The decision process: outside in
The sequence of questions below is what I run mentally in any design review involving 'AI'. It's not bureaucracy — it's the shortcut to avoid over-engineering.
1. Is the problem enumerable?
If you can write all conditions in code without losing relevant coverage, use rules. Field validation, routing by type, formatting — rules. Done.
2. Is the data structured and do you have labels?
If yes, evaluate classical ML first. Gradient boosting (XGBoost, LightGBM) on tabular data is often the best available classifier, with negligible inference cost and reasonable explainability. Amazon SageMaker supports this pattern with autopilot, managed endpoints, and drift monitoring.
3. Is the data unstructured (image, audio) but the output is structured (classification, detection)?
Specialized DL — not LLM. A computer vision model for defect detection on a production line costs a fraction of a multimodal LLM and is more accurate in the specific domain.
4. Does the problem involve natural language but the output is structured (classification, extraction)?
Consider embeddings + classifier or a smaller fine-tuned language model before going to generative LLM. Amazon Bedrock offers embedding models (Titan Embeddings) that solve semantic search without generation.
5. Does the problem require generation or comprehension in an open domain?
Now LLM makes sense. But still ask: does zero-shot solve it? Does few-shot solve it? Would fine-tuning on the domain be better? Is RAG sufficient? The smaller and more specialized the model, the lower the cost and the higher the predictability.
6. Does the problem require multiple steps, external tools, and non-deterministic reasoning?
Now agents make sense. But implement with observability from the start — logs for each step, tool call tracing, iteration limits. Amazon Bedrock Agents offers this pattern with native integration to Lambda, Knowledge Bases, and Action Groups.
This sequence is not dogma — it's a filter. You can skip steps if you have clear evidence. But the burden of proof is on whoever proposes the more complex layer.
Decision Process: from Problem to Tool
- 1
Step 1 — Define the problem precisely
Write in one sentence: what is the input, what is the expected output, what is the success metric. 'Use AI to improve experience' is not a problem — it's an intention. 'Classify support tickets into 8 categories with F1 > 0.90' is a problem.
- 2
Step 2 — Test the rules hypothesis
Ask: 'can I cover 95%+ of cases with explicit rules?' If yes, implement rules and monitor uncovered cases. Add ML only when edge cases are costly enough to justify it.
- 3
Step 3 — Evaluate the data profile
Structured data with labels → classical ML. Unstructured data with structured output → specialized DL. Text with structured output → embeddings + classifier. Text with free output → LLM. No historical data and open domain → LLM zero/few-shot.
- 4
Step 4 — Estimate cost and volume
Calculate: monthly call volume × cost per call for each option. A classifier at $0.001/1k vs LLM at $0.05/1k = 50x difference. At 10M calls/month: $10 vs $500. Document this number before any decision.
- 5
Step 5 — Evaluate explainability and compliance requirements
Financial, healthcare, or people-decision systems often require auditable explainability. LLMs are black boxes by nature. If the explainability requirement is strong, classical ML with SHAP or rules are the answer — not LLM.
- 6
Step 6 — Prototype and measure before architecting
Before designing the full architecture, validate the hypothesis with a minimal prototype. An sklearn classifier on 200 labeled examples takes 2 hours to test. An LLM zero-shot with 20 evaluation examples takes 1 hour. Measure F1, cost, and latency before committing to the architecture.
- 7
Step 7 — Define fallback strategy and monitoring
Any model in production needs: (1) continuously monitored quality metrics, (2) confidence threshold with fallback to human or rule, (3) retraining or prompt update process when quality degrades. This is true for classical ML, LLM, and agents.
Anti-patterns that produce wrong architectures
1. LLM as default. Treating LLM as the obvious answer for any text-related problem is the most expensive anti-pattern of the moment. A fine-tuned BERT sentiment classifier costs 20x less, has 10x lower latency, and doesn't hallucinate. Use LLM when generation or open-domain comprehension is genuinely needed. 2. Confusing concepts = wrong architecture. Calling a classifier an 'agent', or proposing a 'deep learning model' when you mean 'LLM', isn't just semantic imprecision — it leads to wrong infrastructure decisions, wrong cost estimates, and wrong service choices. 3. Agents for deterministic flows. If the flow can be mapped as a DAG of steps with known conditions, use Step Functions or a workflow engine — not an agent. Agents introduce non-determinism where you don't want it. Reserve agents for problems where the action space is not enumerable in advance. 4. Ignoring token cost in agents. An agent with 5 LLM round-trips, each with 2k tokens of context, costs 10k tokens per task. At volume, this is significant. Measure the average cost per completed task before going to production. 5. Premature fine-tuning. Fine-tuning is expensive (training compute + quality labeled data) and creates a model that needs to be maintained. Before fine-tuning, exhaust zero-shot, few-shot, and RAG. Fine-tuning makes sense when you have hundreds of high-quality examples and the base model consistently fails in the domain.
Rule of Thumb
Always choose the simplest tool that solves it well. A $0.001 classifier that gets 98% right beats a $0.05 LLM that hallucinates 2% confidently — especially in production, where the 2% hallucination shows up exactly in the most sensitive cases. Complexity must be justified by the problem, not by enthusiasm for the technology.
Over the past two years, I've reviewed dozens of architecture proposals where LLM was the first line of the solution. In at least half the cases, a 30-minute conversation about the real problem revealed that a classifier, vector search, or even explicit rules solved it better — with lower cost and risk. What concerns me isn't enthusiasm for LLMs — it's the absence of an elimination process. The question 'why not a simpler solution?' should be mandatory in any AI design review. In practice, what I do: I always start with a precise problem definition and a comparative cost estimate. If the LLM is 10x more expensive than the alternative, the burden of proof is on whoever proposes the LLM — and 'it's more modern' is not proof. I also maintain a clear distinction between what is a language problem (where LLM has a real advantage) and what is a classification or search problem disguised as language (where embeddings or classical ML win). For agents specifically: I only recommend them when the flow genuinely cannot be mapped as a deterministic workflow. Most 'agents' I see in production are actually RAG pipelines with a few logic steps — which would be more reliable, cheaper, and observable as Step Functions + Lambda + Bedrock than as an autonomous agent. The field is evolving fast, and smaller models are becoming increasingly capable. But the principle of choosing the simplest tool that solves it well is timeless — and it's what separates engineering from cargo cult.
Verdict
AI is not a product — it's a field. ML is not a synonym for Deep Learning. LLM is not a synonym for AI. Agent is not a synonym for LLM. Confusing these terms is not just academic imprecision: it leads to over-engineered architectures, unnecessary costs, and systems that fail in ways you didn't anticipate. The correct map is of nested layers, and the decision direction is always outside in — from the simplest tool to the most complex, with the burden of proof on whoever proposes the next layer. A well-trained classifier that gets 98% right and costs $0.001 per thousand calls is not 'less AI' than an LLM. It's better engineering.
Post-mortems, ADRs and architecture deep dives in your inbox — the way an architect reads them.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this study from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.