Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

The AI Architect Track

Module 1 · Foundations· Lesson 03/22

What an LLM is: tokens, context and the next token

How a large language model actually works inside — and the limits that imposes.

6 min read

An LLM doesn't know anything — it predicts. At every step, the model looks at everything that came before and bets on the most likely next piece of text. Understanding this completely changes how you design AI systems: the limits, the costs, and the failures stop being surprises and become direct consequences of how the model works.

The core idea: predicting the next token

A Large Language Model is, at its core, a very large mathematical function trained to answer one simple question: given everything that came before, what is the most likely next token?

This function was tuned on trillions of tokens of text — books, code, web pages, scientific articles. During training (covered in the previous lesson), the model adjusted billions of parameters to get progressively better at this prediction. The result is a network that captured language patterns at industrial scale.

At inference time, the process is sequential and autoregressive: the model generates a token, appends that token to the context, then generates the next one — repeating until it hits a stop signal or the token limit. There is no 'thinking before answering'. The model doesn't plan the whole sentence and then write it. It builds the response token by token, left to right.

This has a direct implication: the model cannot 'go back' and fix something it said at the start of the response. What was generated, was generated. That's why techniques like chain-of-thought work — forcing the model to reason out loud before giving the final answer improves quality, because the intermediate reasoning becomes context for the next tokens.

Flow: from prompt to generated token

The complete pipeline of an LLM call — from text input to output token, repeated at every generation step.

📥 Entrada

Prompt · texto bruto

🔤 Tokenização

Tokenizer · BPE / WordPiece
Token IDs · [1042, 318, 257, ...]

🧠 Modelo (Transformer)

Embeddings · vetor por token
Attention Layers · relaciona todos os tokens
Logits · probabilidade p/ cada token

🎲 Amostragem

Temperatura · (0 = deterministico)
Próximo token · escolhido

🔁 Loop autoregressivo

Adiciona token · ao contexto
Token de parada? · Sim → resposta final

Tokens and context window: the unit of everything

A token is not a word. It's a chunk of text — usually a sub-word. The word architecture might become two tokens: arch + itecture. An emoji can be three tokens. Code tends to have more tokens per character than English prose. In Portuguese, expect to spend more tokens than the English equivalent for the same content.

Why does this matter? Because everything in an LLM is measured in tokens: API cost (you pay per input and output token), generation speed (tokens per second), and the context window limit.

The context window is the maximum number of tokens the model can 'see' at once — prompt + history + response being built, all together. Modern models have windows of 128k, 200k or more tokens. That sounds like a lot, but filling it has a cost: latency increases, cost increases, and the model's attention quality in very long contexts can degrade (the lost in the middle phenomenon — the model tends to pay more attention to the beginning and end of the context).

For architecture, the practical rule is: don't dump everything into the context just because it fits. Select what's relevant. That's exactly what RAG does (lesson 06) — instead of putting a thousand documents in the context, you retrieve the three most relevant ones. A lean context is cheaper, faster, and often more accurate.

What you need to remember about tokens and context

Token ≠ word. One word can be 1, 2, or more tokens depending on the language and tokenizer.

Cost, latency, and context limit are all measured in tokens — not words, characters, or requests.

Context window = everything the model sees at once. Prompt + history + output all consume the same budget.

Long context degrades attention in the middle. Put critical information at the beginning or end.

Portuguese uses more tokens than English for the same content — factor this into your cost estimates.

Put in order

Order an LLM response flow

Put the steps in the order they happen.

1The token is appended and the process repeats until done
2The model predicts the most likely next token
3The prompt text is split into tokens
4The tokens enter the model's context window

Why LLMs hallucinate — and what that means for you

Hallucination is not a bug. It's a feature of the mechanism.

The model was trained to generate plausible text, not true text. It has no access to a database of verifiable facts. When you ask 'what is the tax ID of company X?', the model doesn't query any source — it generates the sequence of tokens that, statistically, would most likely follow that question in the text it was trained on. If the correct ID wasn't well represented in training, the model will invent one that looks right.

The problem is that the model doesn't know what it doesn't know. The probability distribution over tokens carries no 'I'm confident / I'm not confident' bit. The model generates a correct fact and a fabricated fact with the same fluency.

For architecture, this has three direct consequences:

Never use an LLM as a primary source of facts. If the data needs to be correct (name, number, date, regulation), it must come from a verified source — and you inject it into the context via RAG or tool calling.
Evaluate outputs, don't just generate. Production systems need evals (lesson 09) and guardrails (lesson 10) to detect when the model has gone off the rails.
Temperature controls randomness. Temperature zero makes the model always pick the most probable token — more deterministic, less creative. High temperature spreads probability across more options — more variation, more risk of drift. For factual tasks, low temperature. For creative generation, higher temperature.

Understanding hallucination as a structural property — not an occasional glitch — is what separates people who design robust AI systems from people who are surprised when the model makes things up.

In practice: what changes in your design

Senior Solutions Architect

In practice, when I start designing a system with an LLM, the first questions I ask are: what's the average context size I'll be sending? How many output tokens do I expect? That gives me the expected cost and latency before writing a single line of code. Then: which parts of the response need to be factual and verifiable? Those parts never live in the model's head — they come from external sources injected into the context. The model is the language engine, not the database. When you internalize that separation, the architecture becomes much clearer.

Flashcards

LLM terms

Tap a card to flip it.

Frequently asked questions

Do larger models hallucinate less?

Generally yes — larger models have better calibration and more factual knowledge absorbed in training. But hallucination doesn't disappear. Even the best models fabricate facts in domains poorly represented in training or when forced to answer something they don't know. Size reduces, doesn't eliminate.

Can I use an LLM for complex mathematical reasoning?

With care. LLMs have improved a lot in math, especially with chain-of-thought and reasoning models like OpenAI's o-series or Claude 3.7. But for critical calculations, the right pattern is to use tool calling (lesson 07) to call a real calculator or code — not rely solely on token generation.

Does temperature 0 guarantee identical responses every time?

Almost. Temperature zero is deterministic in the sense of always choosing the highest-probability token, but distributed inference implementations can introduce minimal variations due to floating-point rounding. For practical purposes, temperature 0 is reproducible enough for testing and evals.

What happens when the context exceeds the window?

The API returns an error or silently truncates, depending on the implementation. You need to manage this actively: summarize old history, use sliding windows, or select only the relevant context. This is part of agent memory design — we'll cover it in detail in lesson 12.

References

Andrej Karpathy — Intro to Large Language Models (YouTube)OpenAI Tokenizer (tiktokenizer)Lost in the Middle: How Language Models Use Long Contexts (Stanford)Amazon Bedrock — Model inference parameters Attention Is All You Need (Transformer original paper)

Previous Next lesson