Chapter 4 · Watch & Apply

Why this chapter matters for PMs

The first three chapters gave you a working mental model of the whole stack: how a model learns (Chapter 1), how a deployed LLM generates (Chapter 2), and how you evaluate, secure, and cost it (Chapter 3). You can already name the lever behind most failures.

This chapter does two things. First, it hands you two of the best explanations available — one for the intuition, one for the build details — so you hear the same ideas from practitioners who train these systems, in their own words. You'll recognise nearly everything; that recognition is the point. Second, the capstone forces the model to consolidate: a single synthesis paragraph that names the throughline explicitly, then a ten-question cross-chapter check that proves it sticks under pressure rather than only feeling familiar in a demo.

Heads up: the two videos below are embedded from YouTube, so they need a network connection — that's the only online dependency on this entire site. Everything else, including the capstone quiz and your saved progress, works fully offline.

Andrej Karpathy — Intro to Large Language Models

~60 min · general audience · Nov 2023. The best single hour for turning the LLM black box into an intuition you can carry into any meeting.

Player not loading? Watch on YouTube ↗ — embeds need a real web origin, so they play once the site is served (e.g. GitHub Pages) but may stay blank when opened straight from a local file.

What to watch for. Karpathy frames an LLM as essentially two files — the parameters and the code that runs them — which is Chapter 1's "a model is just learned weights" said out loud. Listen for how he separates pre-training (the costly one-time run) from the fine-tuning that makes a model an assistant: that's Chapter 3's alignment story, and the "two stages" he describes map cleanly onto why inference, not training, is the bill you pay forever. When he reaches hallucination and tool use, you'll hear Chapter 2 almost verbatim.

PM takeaway: if you only ever recommend one video to a colleague who "wants to get AI," this is it — it builds the same end-to-end intuition this course did, in a single sitting.

Stanford CS229 — Building Large Language Models

~90 min · intermediate · deeper and more technical. A graduate-lecture walk through how these models are actually built, from data to architecture to evaluation.

Watch on YouTube ↗

Stanford has turned off inline embedding for this lecture, so it opens on YouTube in a new tab — watch it here ↗. (The first video plays inline above.)

What to watch for. This one goes a level deeper than you strictly need as a PM — expect some math notation and training-loop detail — so treat it as optional enrichment, not a prerequisite. The payoff is the evaluation section: it makes the Chapter 3 case rigorously, that benchmarks are a starting filter and not a launch criterion, and that how you measure a model is as much a design choice as the model itself. If the first video gave you the intuition, this one shows you the machinery behind every term you've learned.

PM takeaway: watch it once for breadth, not mastery — the goal is to recognise the moving parts so a vendor or an engineer can never hide a decision behind jargon you don't understand.

Capstone — the one idea, at every layer

The spine of this entire course, in one sentence: never trust a single number until you know what it was measured against, at every layer of the stack. It is not three separate lessons — it is one lesson seen three times, climbing the stack:

Chapter 1 — DATA. The number to distrust is accuracy. The question that exposes it is "validation or training accuracy?" A model that aced its training data may have memorised it; only data it was deliberately never shown tells you whether it learned the real pattern.
Chapter 2 — GENERATION. The number to distrust is the model's fluent confidence. The question that exposes it is "was this grounded in retrieved evidence, or recalled from memory?" When a RAG feature is wrong, the first suspect is retrieval quality, not model fluency.
Chapter 3 — EVALUATION. The number to distrust is a single headline benchmark. The question that exposes it is "measured by the full eval harness on our own data, or one public benchmark?" The moment a metric becomes the target, Goodhart's Law says it starts lying to you.

Tokenisation, attention, gradient descent, embeddings, alignment, the environmental footprint — every concept in this course is a mechanism that explains why one of those three holds. If you finish able to do just one thing reflexively — ask what a number was measured against before you act on it — you will outperform most people in the room who only know the vocabulary.

Capstone self-check — 10 cross-chapter questions

These ten questions deliberately mix all three chapters in no fixed order — exactly how the ideas show up in a real launch review. Score 80% or higher to mark this chapter complete; miss any and you'll get deeplinks straight back to the right card to revise.

← Prev: Chapter 3 Take the final assessment → Revise