ACL2025 Pagoni: Patches Scale Better Than Tokens

One-Liner “Patches in groups of tokenization scale better than tokens”
Motivation / Novelty typical byte-level LMs don’t are very expensive because many tokens its hard to go beyond 4-6 bytes per token: Zipf’s Law so, we model them as token patches Notable Methods token patch “how do we segment the byte sequence into patches?” — insight: group predicable tokens after every hard choice! i.e., once you train a model, there are “obvious”
patcher and unpatcher cross attend
Key Figs New Concepts Notes