From 00116d393e4da686e7c3673220649c600d139f88 Mon Sep 17 00:00:00 2001 From: Shayan Rais Date: Thu, 7 May 2026 12:48:20 +0500 Subject: [PATCH] reformat slides 12 and 13 to match slide 10 pattern; trim slide 12 heading MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Slides 12 ("How an LLM tokenizes input") and 13 ("What the LLM actually sees: integer token IDs") restructured to use the heading-with-separator pattern modeled after slide 17 (and just-applied to slide 10): -

uses default styling (no inline overrides) → border-bottom separator - Outer flex-centering wrapper dropped - SVG-internal title promoted to slide heading; SVG-internal subtitle promoted to single-line bold caption - Figure max-width: 860px → 1100px (slide 12); 960px → 1100px (slide 13) Slide 12 heading shortened to "How an LLM tokenizes input" (was "How an LLM tokenizes input and generates text autoregressively") — the longer form was wrapping to two lines on a projector. Autoregressive generation is already covered on slide 10; slide 12's caption makes clear this slide's focus is tokenization specifically. Slide 13 heading trimmed: "What the LLM actually sees: integer token IDs" (dropped "(advanced view)" parenthetical — read as redundant scaffolding in heading position). Slide 13 caption: chose the comma-as-ID-11 line over the abstract sequence-of-integers definition. The math notation from the SVG subtitle was deliberately not promoted — it's been removed from the SVG entirely (see paired commit b667fc5). Co-Authored-By: Claude --- .../claude-code-best-practice/index.html | 60 +++++++------------ 1 file changed, 22 insertions(+), 38 deletions(-) diff --git a/presentation/claude-code-best-practice/index.html b/presentation/claude-code-best-practice/index.html index 521c0b2..3ce3142 100644 --- a/presentation/claude-code-best-practice/index.html +++ b/presentation/claude-code-best-practice/index.html @@ -522,50 +522,34 @@
-
- - -

Tokens in, tokens out

- - -
- Animated diagram combining tokenization and autoregressive generation: the BPE-tokenized prompt feeds into the LLM, which generates the answer token-by-token using the same shared vocabulary. -
- Input and output share the same vocabulary — tokenization shapes what the model even “sees”.
- “Anthropic” becomes “Anth” + “ropic” because that’s how it appears most often in training data. -
-
- -
+

How an LLM tokenizes input

+
+ Animated diagram combining tokenization and autoregressive generation: the BPE-tokenized prompt feeds into the LLM, which generates the answer token-by-token using the same shared vocabulary. +
+ BPE chops words into subword tokens — same color = same word, gray = punctuation. +
+
-
- - -

What the model actually sees

- - -
- Animated diagram showing the 32 integer token IDs the model receives: e.g. 28133 for 'Does', 17554 for ' Chat', 162016 for 'GPT', 97481 for ' Claude'. Generated tokens are also shown as IDs. Vocab size V ≈ 200,000. -
- The model never reads text — it reads a sequence of integers, each one an index into a vocabulary of ~200,000 entries.
- Notice the comma is always ID 11 — the same punctuation mark maps to the same integer, everywhere, every time. -
-
- -
+

What the LLM actually sees: integer token IDs

+
+ Animated diagram showing the 32 integer token IDs the model receives: e.g. 28133 for 'Does', 17554 for ' Chat', 162016 for 'GPT', 97481 for ' Claude'. Generated tokens are also shown as IDs. Vocab size V ≈ 200,000. +
+ Notice the comma is always ID 11 — the same punctuation mark maps to the same integer, everywhere, every time. +
+