How a model learns: training, parameters and inference
The intuition behind training and using a model, no calculus — just what an architect needs.
5 min read
Before using any AI model in production, you need to understand one fundamental distinction: training a model and using a model are two completely different operations — with opposite costs, actors, and architecture decisions. This lesson breaks that difference down intuitively, no math required, and connects it directly to what you'll do day to day.
Training: turning billions of knobs until the error drops
Imagine a panel with billions of rotary knobs. Each knob is a parameter (also called a weight). The model starts with random values on those knobs — it knows nothing. You then feed it examples: texts, images, question-answer pairs. For each example, the model produces an output, you measure how wrong it is (that's the loss function), and an algorithm called backpropagation turns each knob a tiny bit in the direction that reduces that error.
Repeat this billions of times, across trillions of tokens of text, on GPU clusters running for weeks or months. At the end, the parameters have converged to values that make the model produce useful outputs for an enormous variety of inputs. This process is pre-training — expensive, slow, done by labs like Anthropic, Meta, Google, Amazon. You don't do this.
The result of training is a file (or set of files) with all those weights saved. That is literally the model. Everything the model 'knows' is encoded in those numbers — not in a separate database, not in explicit rules. Just in the weights.
From training to inference: a model's lifecycle
Training happens once (or rarely). Inference happens every time your application makes a call. They are separate worlds.
- Dados de treino · Trillhões de tokens
- Cluster de GPUs · Semanas / meses
- Backpropagation · Ajuste de pesos
- Arquivo de pesos · (o modelo treinado)
- Dados específicos · do domínio
- Fine-tuning · Horas / dias
- Sua aplicação · (API call)
- Endpoint do modelo · (ex: Bedrock)
- Resposta gerada · (tokens de saída)
Inference: what you actually do in production
Inference is the act of using the trained model to produce an output from an input. You send a prompt, the model passes that text through the weights (now fixed, no adjustment) and returns a response. It's a read operation on the parameters, not a write.
From a cost perspective, the difference is brutal. Training a large model costs tens of millions of dollars in compute. A single inference call costs fractions of a cent. That's why the division of responsibility makes sense: labs invest in training, you consume inference via API.
From a latency perspective, inference is what determines the user experience. Each generated token requires a forward pass through the weights — larger models have more weights, so more operations per token, so more latency. That's why choosing the right model for the use case matters: a 7-billion-parameter model responds much faster than a 70-billion one, and for many tasks the quality is sufficient. We'll explore this decision in depth in the Amazon Bedrock lesson.
In practice, when you integrate an LLM into your application — whether via AWS Bedrock, OpenAI, or any other provider — you are 100% of the time doing inference. Training already happened before, far away from you.
Training vs. Fine-tuning vs. Inference
| Aspect | Pre-training | Fine-tuning | Inference | |
|---|---|---|---|---|
| Who does it | Labs (Anthropic, Meta…) | You / partner lab | You (via API) | — |
| Cost | Very high (millions $) | Medium (hundreds–thousands $) | Low (fractions of ¢ per call) | — |
| Frequency | Very rare (once per version) | Occasional (per domain/task) | Continuous (each request) | — |
| Weights change? | Yes — created from scratch | Yes — adjusted from base | No — read-only | — |
| Do you need this? | No | Rarely | Always | — |
Fine-tuning: when it's worth it and when it's waste
Fine-tuning is an additional, shorter training run on top of an already pre-trained model. You take the existing weights as a starting point and adjust them with examples from your specific domain. The model doesn't forget what it learned — it specializes.
This sounds tempting. In practice, most use cases don't need fine-tuning. Why? Because modern models are incredibly capable via prompt. If you want the model to respond in formal Portuguese, just say so in the prompt. If you want a specific output format, demonstrate it in the prompt. The prompting lesson will detail this — but the point here is: fine-tuning has cost, pipeline complexity, regression risk, and requires quality data. It's not the first tool to reach for.
Fine-tuning makes sense when: (a) you need a very specific style or vocabulary that the prompt can't inject reliably; (b) you want to reduce prompt size in production to save tokens; (c) you have proprietary data that teaches behaviors the base model doesn't have. Outside those cases, well-crafted prompting — and eventually RAG to inject external knowledge — solves it. We'll cover RAG in lesson 06.
Match concept to definition
Tap a concept, then its definition.
In practice, in 95% of the projects I've architected or reviewed, the right decision was: use a foundation model via API (Bedrock, for example), invest in prompting and RAG, and only consider fine-tuning if you have clear evidence that the base model doesn't cut it. The most common mistake I see is jumping straight to fine-tuning as a solution to a poorly written prompt problem. It's like replacing the car's engine because the driver can't drive.
What to take away from this lesson
Frequently asked questions
If weights don't change during inference, how does the model 'learn' from the conversation context?
It doesn't learn — it remembers, within the context window. The conversation history is passed as part of the input on each call. The weights remain fixed; what changes is what you put in the prompt. This is a central point of lesson 03 (context and tokens) and lesson 12 (agent memory).
Does fine-tuning permanently change the weights?
Yes, for that version of the model. The original base model is not altered — you generate a new artifact of adjusted weights. Techniques like LoRA do this more efficiently, training only a fraction of the parameters.
Can I train my own LLM from scratch?
Technically yes. Practically, the compute, data, and expertise cost puts this out of reach for the vast majority of companies. The sensible architecture decision is to use existing foundation models and customize them via prompt, RAG, or fine-tuning when needed.
Quick check
1. In production, what do you normally do and pay for?