I was curious how well the simple momentum step-size approach shown in the first interactive example compares to alternative methods. The example function featured in the first interactive example is named bananaf ("Rosenbrok Function banana function"), defined as
var s = 3
var x = xy[0]; var y = xy[1]*s
var fx = (1-x)*(1-x) + 20*(y - x*x )*(y - x*x )
var dfx = [-2*(1-x) - 80*x*(-x*x + y), s*40*(-x*x + y)]
The interactive example uses an initial guess of [-1.21, 0.853] and a fixed 150 iterations, with no convergence test.
From manually fiddling with (step-size) alpha & (momentum) beta parameters, and editing the code to specify a smaller number of iterations, it seems quite difficult to tune this momentum-based approach to get near the minima and stay there without bouncing away in 50 iterations or fewer.
Out of curiosity, I compared minimising this bananaf function with scipy.optimize.minimize, using the same initial guess.
If we force scipy.optimize.minimize to use method='cg', leaving all other parameters as defaults, it converges to the optimal solution of [1.0, 1./3.] requiring 43 evaluations of fx and dfx,
If we allow scipy.optimize.minimize to use all defaults -- including the default method='bfgs', it converges to the optimal solution after only 34 evaluations of fx and dfx.
Under the hood, scipy's method='cg' and method='bfgs' solvers do not use a fixed step size or momentum to determine the step size, but instead solve a line search problem. The line search problem is to identify a step size that satisfies a sufficient decrease condition and a curvature condition - see Wolfe conditions [1]. Scipy's default line search method -- used for cg and bfgs -- is a python port [2] of the dcsrch routine from MINPACK2. A good reference covering line search methods & BFGS is Nocedal & Wright's 2006 book Numerical Optimization.
now try the same experiment in 1 billion dimensions.
shoo 4 hours ago [-]
It's unclear if increasing the dimensionality is in itself a challenge, provided that the objective function is still convex with a unique global minima -- like these somewhat problematic Rosenbrock test objective functions used in examples in the article.
On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.
I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)
Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was
'cg': 248
'l-bfgs-b': 40
'm-001-99': 3337
All methods converged in 100 / 100 trials.
m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.
One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.
Distill.pub has such high quality content consistently. It's a shame they don't seem to be active anymore.
imvg 15 hours ago [-]
I agree. Just the usage of animations for explanations was a huge step forward. I wonder why the flagship ML/AI conferences have not adopted the distill digital template for papers yet. I think that would be the first step. The quality would follow
gwern 14 hours ago [-]
The quality would not follow because Distill.pub publications take literally hundreds of man-hours for the Distill part. Highly skilled man-hours too, to make any of this work on the Web reliably. (Source: I once asked Olah basically the same question: "How many man-hours of work by Google-tier web developers could those visualizations possibly be, Michael? 10?")
michael_nielsen 13 hours ago [-]
I've been wondering at what point AI assistants are going to reduce that to a manageable level? It's unfortunately not obvious what the main bottlenecks are, though Chris and Shan might have a good sense.
gwern 11 hours ago [-]
It might be doable soon, you're right. But there seems to be a substantial weakness in vision-language-models where they have a bad time with anything involving screenshots, tables, schematics, or visualizations, compared to real-world photographs. (This is also, I'd guess, partially why Claude/Gemini do so badly on Pokemon screenshots without a lot of hand-engineering. Abstract pixel art in a structured UI may be a sort of worst-case scenario for whatever it is they do.) So that makes it hard to do any kind of feedback, never mind letting them try to code interactive visualization stuff autonomously.
muro 9 hours ago [-]
Only skimped through the article for now, but have to give props to the author - it's beautifully made.
From manually fiddling with (step-size) alpha & (momentum) beta parameters, and editing the code to specify a smaller number of iterations, it seems quite difficult to tune this momentum-based approach to get near the minima and stay there without bouncing away in 50 iterations or fewer.
Out of curiosity, I compared minimising this bananaf function with scipy.optimize.minimize, using the same initial guess.
If we force scipy.optimize.minimize to use method='cg', leaving all other parameters as defaults, it converges to the optimal solution of [1.0, 1./3.] requiring 43 evaluations of fx and dfx,
If we allow scipy.optimize.minimize to use all defaults -- including the default method='bfgs', it converges to the optimal solution after only 34 evaluations of fx and dfx.
Under the hood, scipy's method='cg' and method='bfgs' solvers do not use a fixed step size or momentum to determine the step size, but instead solve a line search problem. The line search problem is to identify a step size that satisfies a sufficient decrease condition and a curvature condition - see Wolfe conditions [1]. Scipy's default line search method -- used for cg and bfgs -- is a python port [2] of the dcsrch routine from MINPACK2. A good reference covering line search methods & BFGS is Nocedal & Wright's 2006 book Numerical Optimization.
[1] https://en.wikipedia.org/wiki/Wolfe_conditions [2] https://github.com/scipy/scipy/blob/main/scipy/optimize/_dcs...
On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.
I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)
Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was
All methods converged in 100 / 100 trials.m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.
One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.
[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.o...
Why Momentum Works - https://news.ycombinator.com/item?id=14034426 - April 2017 (95 comments)