Bradley, Arwen V and Gomez-Uribe, Carlos A and Vuyyuru, Manish Reddy (2022) Shift-curvature, SGD, and generalization. Machine Learning: Science and Technology, 3 (4). 045002. ISSN 2632-2153
Bradley_2022_Mach._Learn.__Sci._Technol._3_045002.pdf - Published Version
Download (2MB)
Abstract
A longstanding debate surrounds the related hypotheses that low-curvature minima generalize better, and that stochastic gradient descent (SGD) discourages curvature. We offer a more complete and nuanced view in support of both hypotheses. First, we show that curvature harms test performance through two new mechanisms, the shift-curvature and bias-curvature, in addition to a known parameter-covariance mechanism. The shift refers to the difference between train and test local minima, and the bias and covariance are those of the parameter distribution. These three curvature-mediated contributions to test performance are reparametrization-invariant even though curvature itself is not. Although the shift is unknown at training time, the shift-curvature as well as the other mechanisms can still be mitigated by minimizing overall curvature. Second, we derive a new, explicit SGD steady-state distribution showing that SGD optimizes an effective potential related to but different from train loss, and that SGD noise mediates a trade-off between low-loss versus low-curvature regions of this effective potential. Third, combining our test performance analysis with the SGD steady state shows that for small SGD noise, the shift-curvature is the dominant of the three mechanisms. Our experiments demonstrate the significant impact of shift-curvature on test loss, and further explore the relationship between SGD noise and curvature.
Item Type: | Article |
---|---|
Subjects: | European Scholar > Multidisciplinary |
Depositing User: | Managing Editor |
Date Deposited: | 09 Jul 2023 03:31 |
Last Modified: | 09 Oct 2023 06:01 |
URI: | http://article.publish4promo.com/id/eprint/2096 |