From 9a2657c431a0c504bfb4db4bbb057d887ae59855 Mon Sep 17 00:00:00 2001 From: Shayan Rais Date: Thu, 7 May 2026 11:58:57 +0500 Subject: [PATCH] =?UTF-8?q?add=20llm-animation-tokenids.svg=20=E2=80=94=20?= =?UTF-8?q?advanced=20tokenization=20view=20with=20integer=20IDs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Animated SVG showing what the LLM actually receives: integer token IDs (one layer deeper than llm-advanced.svg). Each of the 32 input tokens displays the ID prominently with the token text in small italic underneath (e.g., 28133 "Does", 17554 " Chat", 162016 "GPT", 97481 " Claude", 29683 " Anth", 71571 "ropic"). Same 7-iteration autoregressive loop; generated tokens also shown as IDs. Vocab size labeled V ≈ 200,000. Title formula: f: ℤᵏ → ℝⱽ; next_id = argmax(f(ids)). ViewBox 1360×600 (wider than the other LLM SVGs). Co-Authored-By: Claude --- .../assets/llm/llm-animation-tokenids.svg | 157 ++++++++++++++++++ 1 file changed, 157 insertions(+) create mode 100644 presentation/assets/llm/llm-animation-tokenids.svg diff --git a/presentation/assets/llm/llm-animation-tokenids.svg b/presentation/assets/llm/llm-animation-tokenids.svg new file mode 100644 index 0000000..2819bd8 --- /dev/null +++ b/presentation/assets/llm/llm-animation-tokenids.svg @@ -0,0 +1,157 @@ + + + + + + + + + + + + + + + What the LLM actually sees: integer token IDs (advanced view) + + + BPE encodes text → integer IDs. The model is a function f: ℤᵏ → ℝⱽ ; next_id = argmax(f(ids)) + + + ITERATION 1 / 7 + ITERATION 2 / 7 + ITERATION 3 / 7 + ITERATION 4 / 7 + ITERATION 5 / 7 + ITERATION 6 / 7 + ITERATION 7 / 7 + + + INPUT TOKEN IDs (k = 32, vocab V ≈ 200,000) + + + Prompt encoded as 32 IDs (large) with token text below (small italic) + 28133Does + 17554 Chat + 162016GPT + 11, + 97481 Claude + 11, + 29683 Anth + 71571ropic + 11, + 451 Ll + 42804ama + 11, + 391 Mi + 2534str + 280al + 11, + 115613 Gemini + 11, + 326 and + 4651 Per + 12081plex + 536ity + 722 all + 1199 use + 20445 Byte + 10316- + 1517Pair + 70820 Encoding + 350 ( + 33B + 3111PE + 20707)? + + Generated token IDs (autoregressive feedback) + 12814*Yes + 11, + 722 all + 328* of + 1295* them + 656* do + 13. + + + + + + + + LLM + f: ℤᵏ → ℝⱽ + + + + + + + + + + + + + + + + + + + + + + + + no characters inside the box — only integers + + + + + + + PREDICTED NEXT TOKEN ID + + argmax over V ≈ 200,000 logit dimensions + next_token_id =12814*↓ decodes to"Yes" + next_token_id =11↓ decodes to"," + next_token_id =722↓ decodes to" all" + next_token_id =328*↓ decodes to" of" + next_token_id =1295*↓ decodes to" them" + next_token_id =656*↓ decodes to" do" + next_token_id =13↓ decodes to"." + decoding text is post-processing — the model never produces strings + + + + + + + next_token_id appended to input_ids → next forward pass + + + + * Response IDs are illustrative estimates; prompt IDs are from OpenAI's o200k_base tokenizer. + +