Lesson 3: Predict the Next Token

The core mechanic

This is the single most important concept in this course. Given a sequence of tokens, the model predicts the most probable next token. Then the next. Then the next. That is all it does.

There is no reasoning engine, no understanding module, no intent parser. The entire output of a language model is produced one token at a time by repeatedly asking: “Given everything so far, what token is most likely to come next?”

How it works

You provide an input (your prompt) which is tokenised
The model processes all input tokens through its neural network
The network outputs a probability distribution over its entire vocabulary
The token with the highest probability (or a weighted random selection) is chosen
That token is appended to the sequence
Steps 2-5 repeat until the model produces a stop token or hits a limit

Why it works so well

The quality of the output comes from two things:

Scale of training data — Models are trained on enormous amounts of text (books, websites, code, documentation)
Pattern internalization — Through training, the model learns grammar, facts, code syntax, reasoning patterns, and much more

When the model generates a well-structured function or a coherent paragraph, it is not “thinking” — it is producing the sequence of tokens that statistically follows patterns it saw during training.

Temperature

The temperature parameter controls how the model selects tokens:

Temperature	Behaviour
0.0	Always picks the most probable token (deterministic, repetitive)
0.2–0.4	Mostly predictable, good for code and factual content
0.7–0.9	More creative, good for writing and brainstorming
1.0+	Highly random, often incoherent

In ReArch, agent temperature defaults to 0.2 because predictable, deterministic output is what you want for code generation.

Implications

Understanding next-token prediction helps explain common AI behaviours:

Hallucination — The model produces plausible-sounding but incorrect information because the “most likely next token” is not necessarily the “most factually correct next token”
Verbosity — Models tend to over-explain because training data contains many examples of detailed explanations
Context sensitivity — The quality of output depends heavily on the quality and specificity of input