[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / raw/concept/kbhcorpus.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "corpus" source: https://www.jemoka.com/posts/kbhcorpus/ --- usually we use \(N\) to denote the number of tokens, and \(V\) the “vocab” or set of word types. Corpora is usually considered in context of: specific writers at specific time for specific varieties of specific languages for a specific function Particularly hard: code switching, gender, demographics, variety, etc. Herdan’s Law \begin{equation} |V| = kN^{\beta} \end{equation} with \(\beta\) being a constant between \(0.67 < \beta < 0.75\). The vocab size is roughly proportional to the number of tokens.