[Blog] Deriving and Explaining Muon

TL;DR. In this blog, we go through the derivation of Muon and try to explain Muon from the directional sharpness perspective. The final attempt shows that the benefits of Muon over SGD is highly related to the effective rank of gradient, a claim consistent with current literature.

Steepest Descent

Steepest descent. First, steepest descent seeks an update that minimizes

<aside> 💡

Steepest descent surrogate objective.

$$ f(x) + \langle \nabla f(x), \Delta \rangle + \frac{L}{2} \Vert \Delta \Vert^2, $$

where $\Vert \cdot \Vert$ might not indicate the $\ell_2$-norm and $L$ denotes a sharpness-related parameter.

</aside>

We also need the definition of dual norm.

<aside> 💡

Dual norm. Let $\Vert \cdot \Vert$ be a norm and $\Vert \cdot \Vert_{*}$ its dual:

$$ \Vert x \Vert_{*} = \sup_{\Vert v\Vert=1} \langle x, v \rangle. $$

</aside>

We then build the core theorem of steepest descent.

<aside> 💡

Steepest descent proposition. For any norm $\Vert \cdot \Vert: \R^d \to \R$ with dual norm $\Vert \cdot \Vert_{*}$, we have

$$ \underset{\Delta}{\arg \min} \left( \langle \nabla f(x), \Delta \rangle + \frac{L}{2} \Vert \Delta \Vert^2 \right) = - \frac{\Vert \nabla f(x) \Vert_{*}}{L} \underset{\Vert v \Vert = 1}{\arg \max} \langle \nabla f(x), v \rangle. $$

</aside>

Proof.

Define $\Delta = c v$ with ,

$$ \begin{aligned} \underset{\Delta}{\arg \min} \left( \langle \nabla f(x), \Delta \rangle + \frac{L}{2} \Vert \Delta \Vert^2 \right) &= \underset{\Delta = cv, \Vert v \Vert=1}{\arg \min} \left( c\langle \nabla f(x), v \rangle + \frac{c^2L}{2} \right) \\ &= \underset{c}{\arg \min} \left( c \min_{\Vert v\Vert=1}\langle \nabla f(x), v \rangle + \frac{c^2L}{2} \right) \underset{\Vert v \Vert=1}{\arg \min} \langle \nabla f(x), v \rangle \\ &= \underset{c}{\arg \min} \left( -c \Vert \nabla f(x) \Vert_{} + \frac{c^2L}{2} \right) \underset{\Vert v \Vert=1}{\arg \min} \langle \nabla f(x), v \rangle \\ &= - \frac{\Vert \nabla f(x) \Vert_{}}{L} \underset{\Vert v \Vert=1}{\arg \max} \langle \nabla f(x), v \rangle. \end{aligned} $$

End of Proof.

Steepest descent under spectral norm. Given both the gradient ( $G = \nabla f(X) \in \R^{m\times d}$ ) and update ( $\Delta \in \R^{m\times d}$ ) as matrices and consider the spectral norm, then we have

$$ \begin{aligned} \underset{\Delta}{\arg \min} \left( \langle G, \Delta \rangle_F + \frac{L}{2} \Vert \Delta \Vert^2_{op} \right) &= - \frac{\Vert G \Vert_{nuc}}{L} \underset{\Vert W \Vert_{op} = 1}{\arg \max} \langle G, W \rangle. \end{aligned} $$

Note that the dual norm of spectral norm is the nuclear norm.

$$ \begin{aligned} \underset{\Vert W \Vert_{op} = 1}{\arg \max} \langle G, W \rangle &= U V^{\top}, \text{ where } G = U\Sigma V^{\top}. \end{aligned} $$

Therefore, the overall update rule gives

$$ X_{t+1} = X_t - \eta\frac{\Vert G\Vert_{nuc}}{L} UV^\top. $$

Muon aligns with steepest descent under spectral norm up to scaling, which gives $X_{t+1} = X_t - \eta UV^\top.$

Directional Sharpness

We’d like to show the Muon’s advantage over SGD via the directional sharpness theory.

Directional sharpness. Recall that in directional sharpness theory (see Toward Understanding Why Adam Converges Faster Than SGD for Transformers), we seek to minimize second-order Talyor expansion.

$$ f(x) + \underbrace{\langle \nabla f(x), \Delta \rangle }{\text{gradient correlation}} + \frac{1}{2} \underbrace{\Delta^{\top} \nabla^2 f(x) \Delta}{\text{directional sharpness}} + O(\eta^3). $$