Skip to content

Lesson 3: Predict the Next Token

This is the single most important concept in this course. Given a sequence of tokens, the model predicts the most probable next token. Then the next. Then the next. That is all it does.

There is no reasoning engine, no understanding module, no intent parser. The entire output of a language model is produced one token at a time by repeatedly asking: “Given everything so far, what token is most likely to come next?”

  1. You provide an input (your prompt) which is tokenised
  2. The model processes all input tokens through its neural network
  3. The network outputs a probability distribution over its entire vocabulary
  4. The token with the highest probability (or a weighted random selection) is chosen
  5. That token is appended to the sequence
  6. Steps 2-5 repeat until the model produces a stop token or hits a limit

The quality of the output comes from two things:

  • Scale of training data — Models are trained on enormous amounts of text (books, websites, code, documentation)
  • Pattern internalization — Through training, the model learns grammar, facts, code syntax, reasoning patterns, and much more

When the model generates a well-structured function or a coherent paragraph, it is not “thinking” — it is producing the sequence of tokens that statistically follows patterns it saw during training.

The temperature parameter controls how the model selects tokens:

TemperatureBehaviour
0.0Always picks the most probable token (deterministic, repetitive)
0.2–0.4Mostly predictable, good for code and factual content
0.7–0.9More creative, good for writing and brainstorming
1.0+Highly random, often incoherent

In ReArch, agent temperature defaults to 0.2 because predictable, deterministic output is what you want for code generation.

Understanding next-token prediction helps explain common AI behaviours:

  • Hallucination — The model produces plausible-sounding but incorrect information because the “most likely next token” is not necessarily the “most factually correct next token”
  • Verbosity — Models tend to over-explain because training data contains many examples of detailed explanations
  • Context sensitivity — The quality of output depends heavily on the quality and specificity of input