[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / raw/paper/iclr2025/kbhiclr2025_neitemeier_hierachical_autoregressive_transformers.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "ICLR2025 Neitemeier: Hierachical Autoregressive Transformers" source: https://www.jemoka.com/posts/kbhiclr2025_neitemeier_hierachical_autoregressive_transformers/ --- “A Byte Level transformer, with some compression” Key insight: use a [CLS] token in front of every word to train a small “tokenizer”, and then do a normal transformer on the [CLS] tokens, and then autoregressive decode out the single bytes. Method Hierarchical Autoregressive Transformers We put a [cls] in front of every word. So the input looks like [CLS] M y _ [CLS] n a m e _ [CLS] i s We then run a small encoder over each sequence. And then you take the encoded [CLS], and run Dynamically Allocating Compute You can dynamically allocate [CLS] tokens.