It argues that Adam's second moment actually causes word representations to become narrow and directional (anisotropic).
It shows that Adam minimizes a specific form of sharpness —specifically the trace of the square root of the Hessian—which is fundamentally different from how SGD behaves. 4. Better Embeddings with Coupled Adam Splitting Adam
This paper effectively "splits" the Adam algorithm into two distinct components to study them: It argues that Adam's second moment actually causes
It isolates the stochastic direction (the sign of the gradient) from the adaptive step size (the relative variance). Splitting Adam
It proposes Coupled Adam to fix this specific side effect.
Published in 2025, this paper "splits" the problem of in LLM embeddings.