AI 101 for PMs · Chapter 2

How LLMs Work

The generation layer of the stack. Eleven concepts that turn a large language model from a black box into a system you can reason about, budget for, and debug — without the matrix math.

Why this chapter matters

Chapter 1 was about the training layer — how a model learns, and why a single accuracy number hides its failure modes. This chapter is about the generation layer: what happens every time a deployed LLM answers a user.

The same discipline carries over, one layer up. A model's fluent, confident sentence is itself a number you shouldn't trust on its own — you have to know what it was measured against: was the answer grounded in retrieved evidence, or recalled from memory? Did the critical fact sit where the model actually reads, or buried in the middle? Is that latency tied to the prompt, or to every token it has to generate? By the end you'll be able to ask the question that exposes each failure before it ships.

Each concept pairs a plain-English explanation with a live widget. Play with every one — the interaction is where the idea sticks.

Chapter summary — the throughline

The spine of this whole course: never trust a single number until you know what it was measured against — at every layer of the stack. Chapter 1 applied it to training. Here it applies to generation, and the same trap reappears in three disguises:

  • Fluency is not truth. A confident sentence is a plausibility score, not a correctness score. When a RAG feature is wrong, the first suspect is retrieval quality, not model fluency — most RAG failures are retrieval failures. The number to interrogate is "did we fetch the right chunks?", not "is the LLM smart enough?".
  • A big context window is a ceiling, not a guarantee. "200K tokens" tells you what the model can see, not how well it uses the middle of it. Lost-in-the-middle means the disciplined move is to retrieve the few relevant pieces, not dump everything in and trust the headline window size.
  • Latency is tied to output tokens, not prompt size. Generation is sequential — every token is a full forward pass that can't start until the last one exists. So the lever is shorter outputs (and streaming for perceived speed), not a faster prompt. Inference cost, not training cost, is the recurring bill that dominates at scale.

Tokenisation, attention, embeddings, temperature, hallucination, and tool calling are all just the mechanics that explain why these three hold. Learn the mechanics so you can name the lever — grounding, retrieval, temperature, tools, output length — instead of saying "the model lied."

Chapter 2 quiz

Eleven questions, one per concept. Score 80% or higher to mark this chapter complete. Miss any and you'll get deeplinks back to the exact cards to review.