Suggest edit — Byte-Pair Encoding

Title

Name

Note

---
title: "Byte-Pair Encoding"
type: concept
related: [Bpe, Morpheme]
source: https://www.jemoka.com/posts/kbhbpe/
confidence: high
status: active
---

BPE is a common Subword Tokenization scheme.
Training choose two symbols that are most frequency adjacent merge those two symbols as one symbol throughout the text repeat to step \(1\) until we merge \(k\) times v = set(corpus.characters()) for i in range(k): tl, tr = get_most_common_bigram(v) tnew = f&#34;{tl}{tr}&#34; v.push(tnew) corpus.replace((tl,tr), tnew) return v Most commonly, BPE is not ran alone: it usually run inside space separation systems. Hence, after each word we usually put a special _ token which delineates end of word.
Hence: &ldquo;pink fluffy unicorn dancing on rainbows&rdquo; becomes
p i n k _ f l u f f y _ u n i c o r n _ d a n c i n g _ o n _ r a i n b o w s Inference During inference time, we apply our stored merges in the order we learned them. As in, if we merged er first during training, we should do that first during inference before merging say n er.
Frequent subwords often ends up being morphemes.