Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

The AI Architect Track

Module 1 · Foundations· Lesson 02/22

How a model learns: training, parameters and inference

The intuition behind training and using a model, no calculus — just what an architect needs.

5 min read

Listen — Fernando's cloned voice

0:008:10

Before using any AI model in production, you need to understand one fundamental distinction: training a model and using a model are two completely different operations — with opposite costs, actors, and architecture decisions. This lesson breaks that difference down intuitively, no math required, and connects it directly to what you'll do day to day.

Training: turning billions of knobs until the error drops

Imagine a panel with billions of rotary knobs. Each knob is a parameter (also called a weight). The model starts with random values on those knobs — it knows nothing. You then feed it examples: texts, images, question-answer pairs. For each example, the model produces an output, you measure how wrong it is (that's the loss function), and an algorithm called backpropagation turns each knob a tiny bit in the direction that reduces that error.

Repeat this billions of times, across trillions of tokens of text, on GPU clusters running for weeks or months. At the end, the parameters have converged to values that make the model produce useful outputs for an enormous variety of inputs. This process is pre-training — expensive, slow, done by labs like Anthropic, Meta, Google, Amazon. You don't do this.

The result of training is a file (or set of files) with all those weights saved. That is literally the model. Everything the model 'knows' is encoded in those numbers — not in a separate database, not in explicit rules. Just in the weights.

From training to inference: a model's lifecycle

Training happens once (or rarely). Inference happens every time your application makes a call. They are separate worlds.

🏋️ Fase de Treino — Labs de IA

Dados de treino · Trillhões de tokens
Cluster de GPUs · Semanas / meses
Backpropagation · Ajuste de pesos
Arquivo de pesos · (o modelo treinado)

🔧 Fine-tuning Opcional — Você / Lab

Dados específicos · do domínio
Fine-tuning · Horas / dias

⚡ Fase de Inferência — Sua Aplicação

Sua aplicação · (API call)
Endpoint do modelo · (ex: Bedrock)
Resposta gerada · (tokens de saída)

Inference: what you actually do in production

Inference is the act of using the trained model to produce an output from an input. You send a prompt, the model passes that text through the weights (now fixed, no adjustment) and returns a response. It's a read operation on the parameters, not a write.

From a cost perspective, the difference is brutal. Training a large model costs tens of millions of dollars in compute. A single inference call costs fractions of a cent. That's why the division of responsibility makes sense: labs invest in training, you consume inference via API.

From a latency perspective, inference is what determines the user experience. Each generated token requires a forward pass through the weights — larger models have more weights, so more operations per token, so more latency. That's why choosing the right model for the use case matters: a 7-billion-parameter model responds much faster than a 70-billion one, and for many tasks the quality is sufficient. We'll explore this decision in depth in the Amazon Bedrock lesson.

In practice, when you integrate an LLM into your application — whether via AWS Bedrock, OpenAI, or any other provider — you are 100% of the time doing inference. Training already happened before, far away from you.

Training vs. Fine-tuning vs. Inference

	Aspect	Pre-training	Fine-tuning	Inference
Who does it	Labs (Anthropic, Meta…)	You / partner lab	You (via API)	—
Cost	Very high (millions $)	Medium (hundreds–thousands $)	Low (fractions of ¢ per call)	—
Frequency	Very rare (once per version)	Occasional (per domain/task)	Continuous (each request)	—
Weights change?	Yes — created from scratch	Yes — adjusted from base	No — read-only	—
Do you need this?	No	Rarely	Always	—

Fine-tuning: when it's worth it and when it's waste

Fine-tuning is an additional, shorter training run on top of an already pre-trained model. You take the existing weights as a starting point and adjust them with examples from your specific domain. The model doesn't forget what it learned — it specializes.

This sounds tempting. In practice, most use cases don't need fine-tuning. Why? Because modern models are incredibly capable via prompt. If you want the model to respond in formal Portuguese, just say so in the prompt. If you want a specific output format, demonstrate it in the prompt. The prompting lesson will detail this — but the point here is: fine-tuning has cost, pipeline complexity, regression risk, and requires quality data. It's not the first tool to reach for.

Fine-tuning makes sense when: (a) you need a very specific style or vocabulary that the prompt can't inject reliably; (b) you want to reduce prompt size in production to save tokens; (c) you have proprietary data that teaches behaviors the base model doesn't have. Outside those cases, well-crafted prompting — and eventually RAG to inject external knowledge — solves it. We'll cover RAG in lesson 06.

Match

Match concept to definition

Tap a concept, then its definition.

In practice: you are an inference consumer

Senior Solutions Architect

In practice, in 95% of the projects I've architected or reviewed, the right decision was: use a foundation model via API (Bedrock, for example), invest in prompting and RAG, and only consider fine-tuning if you have clear evidence that the base model doesn't cut it. The most common mistake I see is jumping straight to fine-tuning as a solution to a poorly written prompt problem. It's like replacing the car's engine because the driver can't drive.

What to take away from this lesson

Parameters (weights) are the model — billions of numbers adjusted during training to minimize error.

Training is expensive, rare, and done by labs. Inference is cheap, continuous, and what you do in production.

During inference, weights don't change — the model only reads parameters to generate the response.

Larger models have more parameters → more latency and cost per call. Size matters in architecture.

Fine-tuning adjusts weights on top of a base model. It's optional and rarely necessary — good prompting solves more.

On AWS, you consume inference via Bedrock — no model infrastructure to manage, no training.

Frequently asked questions

If weights don't change during inference, how does the model 'learn' from the conversation context?

It doesn't learn — it remembers, within the context window. The conversation history is passed as part of the input on each call. The weights remain fixed; what changes is what you put in the prompt. This is a central point of lesson 03 (context and tokens) and lesson 12 (agent memory).

Does fine-tuning permanently change the weights?

Yes, for that version of the model. The original base model is not altered — you generate a new artifact of adjusted weights. Techniques like LoRA do this more efficiently, training only a fraction of the parameters.

Can I train my own LLM from scratch?

Technically yes. Practically, the compute, data, and expertise cost puts this out of reach for the vast majority of companies. The sensible architecture decision is to use existing foundation models and customize them via prompt, RAG, or fine-tuning when needed.

Quiz

Quick check

1. In production, what do you normally do and pay for?

References

AWS — What is machine learning? (foundational concepts)Amazon Bedrock — Foundation Models overview AWS Blog — Fine-tuning vs. prompt engineering: choosing the right approach Andrej Karpathy — Neural Networks: Zero to Hero (YouTube, reference for backprop intuition)

Previous Next lesson