TL;DR. In scale-invariant settings, what matters is directional change. To this end, we define directional feature learning, which requires updates to have a non-trivial orthogonal component to current features. We also derive a necessary condition of directional feature learning. This condition directly implies the Manifold Muon in https://thinkingmachines.ai/blog/modular-manifolds/.

Openning

Inspired by blogs from Jianlin Su and others, I decided to start writing my own notes. This blog is a place to organize some thoughts that I find interesting but not yet fully polished.

In this post, I try to understand a simple question:

What does it really mean for a network to learn features, especially under scale invariance?

Rather than starting from a fully formal theory, I take a bottom-up approach: start from simple setups, identify the right quantities to track, and see what conditions naturally emerge.

Preliminaries

Basic Setup. We consider a deep linear network at initialization and update using a batch of inputs $\boldsymbol{X} \in \R^{B\times n_0}$.

$$ \boldsymbol{h}\ell (\boldsymbol{X}) = \boldsymbol{W}\ell \boldsymbol{h}_{\ell-1}(\boldsymbol{X}), \quad \forall \ell = \{2, \dots, L\}; \quad \boldsymbol{h}_1(\boldsymbol{X}) = \boldsymbol{W}_1\boldsymbol{X}. $$

where $\boldsymbol{h}{\ell}(\boldsymbol{X})\in \R^{B \times n\ell}$ and $\boldsymbol{W}{\ell} \in \R^{n\ell \times n_{\ell-1}}.$

For a vector $\boldsymbol{x}$, we use $\Vert \boldsymbol{x}\Vert_2$ to denote its $\ell_2$-norm. For a matrix $\boldsymbol{A}$, we use $\Vert \boldsymbol{A} \Vert_*$ and $\Vert \boldsymbol{A} \Vert_F$ to denote its spectral norm and Frobenius norm respectively.

Feature Learning. Recall the original definition of feature learning.

<aside> 💡

Desideratum 1 (Feature Learning). We desire that

$$ \Vert \boldsymbol{h}\ell \Vert_2 = \Theta(\sqrt{n\ell}) \quad\text{ and }\quad \Vert \Delta \boldsymbol{h}{\ell} \Vert_2 = \Theta(\sqrt{n\ell}), \forall \ell \in \{1, \dots, L-1\}. $$

</aside>

Both the activation and its changes should maintain $\Theta(1)$ scale on each entry, regardless of the width scaling.

Spectral Scaling. A sufficient condition to meet the feature learning desideratum under GD is the spectral scaling condition.

<aside> 💡

Condition 1 (Spectral Scaling). Consider applying a gradient update $\Delta \boldsymbol{W}{\ell}$  to $\boldsymbol{W}{\ell}$. The spectral norms of these matrices should satisfy:

$$ \Vert \boldsymbol{W}{\ell} \Vert* = \Theta\left(\sqrt{\frac{n_{\ell}}{n_{\ell-1}}}\right) \quad\text{ and }\quad \Vert \Delta \boldsymbol{W}\ell \Vert* = \Theta\left(\sqrt{\frac{n_\ell}{n_{\ell-1}}}\right), \forall \ell \in \{1, \dots, L\}. $$

</aside>

Directional Feature Learning

Directional Feature Learning. In scale-invariant networks, the direction of a feature is more important than its magnitude. This motivates requiring the normalized feature $\bar{\boldsymbol h}\ell := \boldsymbol h\ell / \|\boldsymbol h_\ell\|_2$ to evolve by an order-one amount during training. Formally,

<aside> 💡

Desideratum 2 (Directional Feature Learning). We desire that

$$ \Vert \Delta \bar{\boldsymbol h}\ell \Vert_2 = \Theta(1) \quad\text{ where }\quad \Delta \bar{\boldsymbol h}\ell := \frac{\boldsymbol h_\ell + \Delta \boldsymbol{h}\ell}{\Vert \boldsymbol h\ell + \Delta \boldsymbol{h}\ell \Vert_2} - \frac{\boldsymbol h\ell }{\Vert \boldsymbol{h}_\ell \Vert_2}. $$

</aside>

Given that,