Grokking of Diffusion Models: Case Study on Modular Addition
arXiv:2604.17673v1 Announce Type: new Abstract: Despite their empirical success, how diffusion models generalize remains poorly understood from a mechanistic perspective. We demonstrate that diffusion models trained with flow-matching objectives exhibit grokking--delayed generalization after overfit