TL;DR. In scale-invariant settings, what matters is directional change. To this end, we define directional feature learning, which requires updates to have a non-trivial orthogonal component to current features. We also derive a necessary condition of directional feature learning. This condition directly implies the Manifold Muon in https://thinkingmachines.ai/blog/modular-manifolds/.
Inspired by blogs from Jianlin Su and others, I decided to start writing my own notes. This blog is a place to organize some thoughts that I find interesting but not yet fully polished.
In this post, I try to understand a simple question:
What does it really mean for a network to learn features, especially under scale invariance?
Rather than starting from a fully formal theory, I take a bottom-up approach: start from simple setups, identify the right quantities to track, and see what conditions naturally emerge.
Basic Setup. We consider a deep linear network at initialization and update using a batch of inputs $\boldsymbol{X} \in \R^{B\times n_0}$.
$$ \boldsymbol{h}\ell (\boldsymbol{X}) = \boldsymbol{W}\ell \boldsymbol{h}_{\ell-1}(\boldsymbol{X}), \quad \forall \ell = \{2, \dots, L\}; \quad \boldsymbol{h}_1(\boldsymbol{X}) = \boldsymbol{W}_1\boldsymbol{X}. $$
where $\boldsymbol{h}{\ell}(\boldsymbol{X})\in \R^{B \times n\ell}$ and $\boldsymbol{W}{\ell} \in \R^{n\ell \times n_{\ell-1}}.$
For a vector $\boldsymbol{x}$, we use $\Vert \boldsymbol{x}\Vert_2$ to denote its $\ell_2$-norm. For a matrix $\boldsymbol{A}$, we use $\Vert \boldsymbol{A} \Vert_*$ and $\Vert \boldsymbol{A} \Vert_F$ to denote its spectral norm and Frobenius norm respectively.
Feature Learning. Recall the original definition of feature learning.
<aside> 💡
Desideratum 1 (Feature Learning). We desire that
$$ \Vert \boldsymbol{h}\ell \Vert_2 = \Theta(\sqrt{n\ell}) \quad\text{ and }\quad \Vert \Delta \boldsymbol{h}{\ell} \Vert_2 = \Theta(\sqrt{n\ell}), \forall \ell \in \{1, \dots, L-1\}. $$
</aside>
Both the activation and its changes should maintain $\Theta(1)$ scale on each entry, regardless of the width scaling.
Spectral Scaling. A sufficient condition to meet the feature learning desideratum under GD is the spectral scaling condition.
<aside> 💡
Condition 1 (Spectral Scaling). Consider applying a gradient update $\Delta \boldsymbol{W}{\ell}$Â to $\boldsymbol{W}{\ell}$. The spectral norms of these matrices should satisfy:
$$ \Vert \boldsymbol{W}{\ell} \Vert* = \Theta\left(\sqrt{\frac{n_{\ell}}{n_{\ell-1}}}\right) \quad\text{ and }\quad \Vert \Delta \boldsymbol{W}\ell \Vert* = \Theta\left(\sqrt{\frac{n_\ell}{n_{\ell-1}}}\right), \forall \ell \in \{1, \dots, L\}. $$
</aside>
Directional Feature Learning. In scale-invariant networks, the direction of a feature is more important than its magnitude. This motivates requiring the normalized feature $\bar{\boldsymbol h}\ell := \boldsymbol h\ell / \|\boldsymbol h_\ell\|_2$ to evolve by an order-one amount during training. Formally,
<aside> 💡
Desideratum 2 (Directional Feature Learning). We desire that
$$ \Vert \Delta \bar{\boldsymbol h}\ell \Vert_2 = \Theta(1) \quad\text{ where }\quad \Delta \bar{\boldsymbol h}\ell := \frac{\boldsymbol h_\ell + \Delta \boldsymbol{h}\ell}{\Vert \boldsymbol h\ell + \Delta \boldsymbol{h}\ell \Vert_2} - \frac{\boldsymbol h\ell }{\Vert \boldsymbol{h}_\ell \Vert_2}. $$
</aside>
Given that,