Why this chapter matters
Chapters 1 and 2 gave you the machinery: how models learn, and how LLMs turn text into tokens and tokens back into answers. This chapter is about the layer that decides whether any of it is good enough to ship — evaluation — plus the two forces that bound every real launch: safety and cost.
The throughline of the whole course shows up here in its sharpest form: never trust a single number until you know what it was measured against. An accuracy figure, a benchmark win, a "it felt better in the demo" — each is a hypothesis, not proof. The eval harness on your own data is how a hypothesis becomes evidence.
Each concept has a live widget — drag, toggle, and step through it. The interaction is the lesson; the prose just frames it.
The throughline
Never trust a single number until you know what it was measured against — at every layer of the stack. Chapter 1 made that point about the data split; Chapter 2 about token counts and context. Here it lands on the evaluation layer.
A public benchmark score is a hypothesis about capability, not proof of product fit. The proof is the full eval harness on your own data: a held-out set that looks like production, scored automatically on every change, with safety treated as a ship-blocking dimension and cost measured in the same currency as latency and carbon — tokens. Right-size the model, ground the answers, set the refusal boundary deliberately, and wire it all into a harness so a regression can't slip past while everyone is admiring the headline number.
Chapter 3 quiz
Answer all five — 80% to mark the chapter complete. Wrong answers link back to the exact card to revise.