corpus

usually we use N to denote the number of tokens, and V the “vocab” or set of word types.
Corpora is usually considered in context of:
specific writers at specific time for specific varieties of specific languages for a specific function Particularly hard: code switching, gender, demographics, variety, etc.
Herdan’s Law \begin{equation} |V| = kN^{\beta} \end{equation}
with \beta being a constant between 0.67 < \beta < 0.75.
The vocab size is roughly proportional to the number of tokens.

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?