Visualize the performance differences between these popular optimization algorithms
| Aspect | SGD with Momentum | Adam |
|---|---|---|
| Convergence Speed | Slower, steady | Fast initial convergence |
| Final Performance | Often better generalization | May plateau earlier |
| Hyperparameters | Learning rate, Momentum | Learning rate, β₁, β₂, ε |
| Best Use Cases | Well-tuned models, final training | Quick prototyping, sparse data |
Uses exponentially weighted average of past gradients to accelerate convergence and reduce oscillations.
vₜ = β·vₜ₋₁ + (1-β)·∇J(θₜ)
θₜ₊₁ = θₜ - α·vₜ
Computes adaptive learning rates for each parameter using estimates of first and second moments of gradients.
mₜ = β₁·mₜ₋₁ + (1-β₁)·gₜ
vₜ = β₂·vₜ₋₁ + (1-β₂)·gₜ²
θₜ₊₁ = θₜ - α·m̂ₜ/(√v̂ₜ+ε)