Optimizer Comparison: SGD vs Adam

Optimizer Comparison: SGD vs Adam

Visualize the performance differences between these popular optimization algorithms

0.1
0.001
0.9

Training Loss Over Time

Accuracy Over Time

Key Differences

Aspect SGD with Momentum Adam
Convergence Speed Slower, steady Fast initial convergence
Final Performance Often better generalization May plateau earlier
Hyperparameters Learning rate, Momentum Learning rate, β₁, β₂, ε
Best Use Cases Well-tuned models, final training Quick prototyping, sparse data

How They Work

SGD with Momentum

Uses exponentially weighted average of past gradients to accelerate convergence and reduce oscillations.

vₜ = β·vₜ₋₁ + (1-β)·∇J(θₜ)
θₜ₊₁ = θₜ - α·vₜ

Adam (Adaptive Moment Estimation)

Computes adaptive learning rates for each parameter using estimates of first and second moments of gradients.

mₜ = β₁·mₜ₋₁ + (1-β₁)·gₜ
vₜ = β₂·vₜ₋₁ + (1-β₂)·gₜ²
θₜ₊₁ = θₜ - α·m̂ₜ/(√v̂ₜ+ε)