-
-
-
-
-
- Input and output share the same vocabulary — tokenization shapes what the model even “sees”.
- “Anthropic” becomes “Anth” + “ropic” because that’s how it appears most often in training data.
-
-
-
-
+
How an LLM tokenizes input
+
+
+
+ BPE chops words into subword tokens — same color = same word, gray = punctuation.
+
+
-
-
-
-
What the model actually sees
-
-
-
-
-
- The model never reads text — it reads a sequence of integers, each one an index into a vocabulary of ~200,000 entries.
- Notice the comma is always ID 11 — the same punctuation mark maps to the same integer, everywhere, every time.
-
-
-
-
+
What the LLM actually sees: integer token IDs
+
+
+
+ Notice the comma is always ID 11 — the same punctuation mark maps to the same integer, everywhere, every time.
+
+