[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / wiki/papers/moe_review.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "MOE_REVIEW Paper Notes" type: paper-collection source-count: 13 status: active --- # MOE_REVIEW Paper Notes 13 papers reviewed. ### MOEReview Fedus: Switch Transformers At scale, with regularization (including dropout), k=1 on expert routing is fine! *[Full note](https://www.jemoka.com/posts/kbhmoereview_fedus_switch_transformers/)* ### MOEReview Gale: MegaBlocks Standard MoEs either waste computation by padding unused capacity within each expert, or drop tokens assigned to an expert when it exceeds capacity (i.e. truncate so that we don’t have to pad too much). Method Instead of we do and leverage efficient block sparse multiplication to have variably-sized experts. *[Full note](https://www.jemoka.com/posts/kbhmoereview_gale_megablocks/)* ### MOEReview Kaushik: Universal Subspace Hypothesis One-Liner There’s a low-rank “shared” universal subspace across many pretrained LMs, which could be thus leveraged to adapt a model to new tasks easier. Notable Methods Did a PCA, and projected variance from one architecture to others (i.e. LoRAs trained for different things). *[Full note](https://www.jemoka.com/posts/kbhmoereview_kaushik_universal_subspace_hypothesis/)* ### MOEReview Krajewski: Scaling Laws for MoE Define “granularity” as: \begin{equation} G = \frac{d_{\text{ff}}}{d_{\text{expert}}} \end{equation} at \(G=1\), we have a dense model; at \(G>1\), we have some kind of MoE. Here are thy scaling laws: notice how its mostly linear! tiny experts yay! *[Full note](https://www.jemoka.com/posts/kbhmoereview_krajewski_scaling_laws_for_moe/)* ### MOEReview Li: Branch-Train-Merge weighted parameter average of the existing experts (or copy the new perts) training each expert independently And then when inference we can use domain-conditioned averaging between the experts by computing: or by averaging the parameters of the experts. *[Full note](https://www.jemoka.com/posts/kbhmoereview_li_branch_train_merge/)* ### MOEReview Pan: Dense Training Sparse Inference Train experts densely, and then during inference keep only topk *[Full note](https://www.jemoka.com/posts/kbhmoereview_pan_dense_training_sparse_inference/)* ### MOEReview Rajbhandari: DeepSpeed MoE Proposes: more MoEs at later layers + a shared expert. *[Full note](https://www.jemoka.com/posts/kbhmoereview_rajbhandari_deepspeed_moe/)* ### MOEReview Sharma: LAZER One-Liner Getting rid of low singular value components in weights actually improves model performance. Motivation Previous work has shown that pruning SVD components works without significant performance degradation. But this work shows that with knowing where to prune more carefully, we can obtain better-than-baseline performance. Notable Methods We do this by trying all reductions based on \(\qty(\tau, \ell, \rho)\) tuples where we have \(\tau\) being the parameter type (projs q, k, v, attn ou... *[Full note](https://www.jemoka.com/posts/kbhmoereview_sharma_lazer/)* ### MOEReview Shen: ModuleFormer The old’ load balancing loss. Instead of training a router with explicitly labeled data for each expert, a load balancing + load concentration loss induces the modularity in data. Insight: we want to maximize the mutual information between tokens and modules. For the router \(m \sim g\qty(\cdot \mid x)\) (“which module \(m\) should we assign, given token \(x\)”), we write: \begin{equation} \ell_{MI} = \underbrace{\sum_{m=1}^{N} p\qty(m) \log p\qty(m)}_{-H\qty(m)} - \frac{1}{|X|... *[Full note](https://www.jemoka.com/posts/kbhmoereview_shen_moduleformer/)* ### MOEReview Sukhbaatar: Branch-Train-MiX Its MOEReview Li: Branch-Train-Merge but MoEs now. Each layer is combined by standard moe routing with a weight that is tuned. *[Full note](https://www.jemoka.com/posts/kbhmoereview_sukhbaatar_branch_train_mix/)* ### MOEReview Tan: Scattered MoE A single kernel to scatter the residuals and then run forward pass at the same time instead of copying and grouping first. *[Full note](https://www.jemoka.com/posts/kbhmoereview_tan_scattered/)* ### MOEReview Yun: Inference-Optimal MoEs “the scaling law (Section 3) shows that more experts (larger E) result in a higher performance; on the other hand, more experts result in a larger inference cost (Section 4.2)” How do we trade off cost of more experts (in terms of GPU-seconds or , for \(C_0\) being the cost for some per second GPU cost) and performance? so, slight over-wraiting achieves better performance. Two findings: smaller bigger expert (4/8) is the most serving efficient, but costs more to train to the same los... *[Full note](https://www.jemoka.com/posts/kbhmoereview_yun_inference_optimal_moes/)* ### MOEReview Zhang: Mixure of Attention Heads Split \(Q\) projection and attention out projection into experts, with one router coordinating them. Better than MHA performanec. *[Full note](https://www.jemoka.com/posts/kbhmoereview_zhang_mixure_of_attention_heads/)*