Interactive Learning

LLM101

One mechanism. Five depths. A depth-adjustable explainer for large language models — and a worked example of capability-first design.

How does next-token prediction become something that looks like intelligence?

Interactive Explorer Open full screen

Capability Matters Lens

Case Exemplar: Building an LLM explainer as a capability problem

Method exemplar. Not an incident case: this shows how the flywheel and the human-AI authoring loop work in practice, rather than diagnosing a failure.

Most "explain how AI works" projects treat the artifact as content. LENS treats it as a capability intervention, and asks a different first question: what should a person be able to do differently afterward, in an environment where that difference has consequence?

The artifact is llm101 (github.com/wrgr/llm101): a depth-adjustable explainer for large language models. One mechanism, five views, two arcs. A reader chooses how deep to go, from a single metaphor to the formal account, and the content is assembled on the fly from tagged segments. At the time of writing it is a working architecture with one fully authored module (Prediction) and the rest scaffolded.

The capability

The target is not "knows facts about LLMs." It is the ability to reason correctly about what a model's output is: a prediction of plausible next tokens, not a retrieval of stored truth. This is where operational harm concentrates. A user who believes the model retrieves will trust a fabricated citation; a user who understands prediction will verify before acting. It is not an accident that the one fully authored module is Prediction. The capability the whole system exists to build is the one instantiated first.

The design as an engineering decision

Five views, two arcs, and a sparse segment engine that assembles the talk at any depth from tagged beats. Depth-adjustment is a hypothesis, not a preference: it claims you can build the same capability across very different starting points without making anyone re-read. The fading scaffolds sit on a citation layer grounded in the learning sciences, kept deliberately separate from the content citations grounded in machine learning and language literature. Design as testable hypothesis, not design as taste.

The human-AI loop

This is the spine of the case. The authoring flywheel treats AI as both creative partner and epistemic risk in the same motion. As partner, it drafts beats from tagged objectives faster than a human would. As risk, it seeds plausible citations and invents sources that do not exist. The system is built so that confidence cannot become decision-grade on its own: the repository instructs a verification pass against Crossref, arXiv, and DOI before publishing, and labels the meaning-arc sources as candidate until checked. That verification gate is the move made concrete. The AI produces; the human closes the gap against primary literature; only verified output is allowed to seed the next module. The conversation is structured, not open-ended.

Evidence, honestly

The flywheel is two loops, and only one is spinning.

The authoring flywheel is closed and running: it produces verified content on every pass.

The learning flywheel is designed but not spinning. The per-module checks are its intended instrument: they would tell you whether a given depth level actually built the capability. They are ungraded and undeployed. With one authored module and no readers, there is no learner evidence yet. Naming this is the point, not a disclaimer around it.

Adaptation: turning the learning flywheel by hand

The learning flywheel's first turn was never an analytics pipeline. It is a person reading feedback and making an attribution call. That call is the teachable move, and it is exactly the one adjacent programs skip.

The discipline is: do not patch, attribute first. Every reader miss is one of three things, and they take different repairs. This is gap attribution at artifact scale.

  • Content gap: the explanation is wrong or incomplete. Fix the beat.
  • Design gap: the explanation is right but shown at the wrong depth or in the wrong order. Fix the segment engine's view range or sequencing, not the beat.
  • Framing gap: the metaphor itself misleads. A scaffold that makes prediction feel like lookup is actively teaching the wrong model. Fix the scaffold.

Same observed miss, three different fixes. Naming which one you are in before touching the file is the judgment the exemplar is built to teach.

Three caveats keep this honest:

  1. Include a "no change" verdict. Not every miss is a defect. The point of five views is that not every reader clears every level; sometimes the right adaptation is routing the reader down a view, not rewriting the beat. Over-adapting to feedback pulls the artifact toward the mean and blunts the expert levels. Attribute-first is what protects against that: if the gap is the reader's foundation, the fix is routing, not revision.
  2. AI reappears at the revision step, and so does the gate. Feed it the miss and it proposes a rewrite, with the same epistemic risk: it will confidently propose a patch that misdiagnoses the gap, usually defaulting to "add more words" when the real fault was sequence or metaphor. The gate moves from authoring to revision. The AI proposes the adaptation; the human validates the attribution before it ships.
  3. Early evidence is thin by construction. Ungraded, optional, self-selected checks at near-zero N are weak signal. The highest-value evidence available today is verbal: one reader saying "I still don't see why prediction isn't just lookup" out-informs a hundred slider events. Lead with that until traffic makes behavior mean something.

Instrumentation and ethics: what we chose to measure, and what we did not

Capture is the easy part. The slider, the glossary popovers, and the checks are already client-side events; attaching the current view and the term identity to each is an afternoon's work. The real question is persistence: a static GitHub Pages build has nowhere for events to land, so "can we track this" is a design and ethics decision before it is an engineering one. Three shapes:

  • Local only. Trivial, no infrastructure, but one reader on one device with no aggregation. Fine for a reader reviewing their own path; useless as evidence.
  • Ambient aggregate analytics. Easy to wire and gives counts across readers, cookieless if chosen carefully. It also crosses the surveillance line the program itself interrogates, even anonymized. Turning ambient tracking on by default on a public explainer is a choice one would have to defend in our own seminar.
  • Consented study mode. A flag that logs richly only in facilitated or opted-in sessions and stays dark otherwise. Minimal, purpose-bound, defensible. It lets the artifact model the thesis it teaches rather than contradict it.

The decision follows from the evidence position, not from what is technically possible. At one authored module with no deployment, instrumentation is premature. The signals cheapest to log — slider jumps and glossary clicks — are the weak-N ones, and an event is not a construct: a reader jumping from view one to view four could be bored, lost and fleeing, or an expert skipping ahead. The click cannot tell you which. You are back to attribution, and telemetry cannot make that call for you.

So the instrumentation is built for interpretation, not volume, and only inside consented study mode: log the checks and the slider and glossary events so each can be paired with what the reader said aloud. The pairing is what makes an event mean anything. Ambient public analytics stays off until there is a question only counts can answer.

One carryover from the spine: if these events are later handed to a model to find the patterns, it will confidently narrate noise and default to tidy stories the data does not support. The same gate applies at the analysis step. The model proposes the reading; the human validates the attribution before it touches a beat.

A restrained instrumentation decision, stated with its reasoning, is a stronger artifact than a dashboard: what we chose to measure, what we chose not to, and why.

Boundary

One authored module, no deployment, learning flywheel unproven. The case earns its place as a method exemplar, not as an outcomes claim. Its value is the demonstration it does make: AI held as partner and risk under an explicit verification gate, at authoring, at revision, and at analysis; capability defined by consequence rather than content coverage; the flywheel practiced as a production discipline before it is claimed as a learning-analytics result.


Produced with AI synthesis and verified by the author. Content and citation claims are subject to human review against primary sources before external use.