Lesson 3: Predict the Next Token
The core mechanic
Section titled “The core mechanic”This is the single most important concept in this course. Given a sequence of tokens, the model predicts the most probable next token. Then the next. Then the next. That is all it does.
There is no reasoning engine, no understanding module, no intent parser. The entire output of a language model is produced one token at a time by repeatedly asking: “Given everything so far, what token is most likely to come next?”
How it works
Section titled “How it works”- You provide an input (your prompt) which is tokenised
- The model processes all input tokens through its neural network
- The network outputs a probability distribution over its entire vocabulary
- The token with the highest probability (or a weighted random selection) is chosen
- That token is appended to the sequence
- Steps 2-5 repeat until the model produces a stop token or hits a limit
Why it works so well
Section titled “Why it works so well”The quality of the output comes from two things:
- Scale of training data — Models are trained on enormous amounts of text (books, websites, code, documentation)
- Pattern internalization — Through training, the model learns grammar, facts, code syntax, reasoning patterns, and much more
When the model generates a well-structured function or a coherent paragraph, it is not “thinking” — it is producing the sequence of tokens that statistically follows patterns it saw during training.
Temperature
Section titled “Temperature”The temperature parameter controls how the model selects tokens:
| Temperature | Behaviour |
|---|---|
| 0.0 | Always picks the most probable token (deterministic, repetitive) |
| 0.2–0.4 | Mostly predictable, good for code and factual content |
| 0.7–0.9 | More creative, good for writing and brainstorming |
| 1.0+ | Highly random, often incoherent |
In ReArch, agent temperature defaults to 0.2 because predictable, deterministic output is what you want for code generation.
Implications
Section titled “Implications”Understanding next-token prediction helps explain common AI behaviours:
- Hallucination — The model produces plausible-sounding but incorrect information because the “most likely next token” is not necessarily the “most factually correct next token”
- Verbosity — Models tend to over-explain because training data contains many examples of detailed explanations
- Context sensitivity — The quality of output depends heavily on the quality and specificity of input