11 Comments

A footnote mentions a code repository linked at the end of the essay, but I can't seem to find one. Can you provide a link?

Expand full comment

I'm confused about the first graph (and some of the subsequent ones) - how can 1 - R^2 ever be greater than 1? Are you actually just plotting mean squared error instead?

Expand full comment

R^2 can be negative! When it is, that just means the fit is worse than the naive estimator of "always predict the mean all the outputs in the dataset (ignoring the inputs)". Of course, this would imply that we have a very bad estimator indeed. That's why it only occurs at the beginning of training, where we are essentially measuring the fit of a random function.

Expand full comment

Ah, that's right, thanks! I knew this at one point but since I've never actually seen it, I forgot it was technically possible. I should have just done a quick Google search, but at least now I'll probably never forget this!

Expand full comment

As you mentioned, one of the issues with deep learning is interpretability. If working in predictive modelling or similar scenarios, is there any way we can determine what these features learned by deep learning are?

For example in your case where you engineered your new features in the regression example, you could determine what new features were 'important' or that were significant. You also mention DL is better than me at _finding_ good features but is there anyway it can tell me what these features it has learned are?

Expand full comment

Unfortunately, no. There are people who have tried to do this, and people who claim they can do this, but in my opinion, nobody has succeeded.

Expand full comment

Are you familiar with, and does that also apply to https://www.alignmentforum.org/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world ?

Or by "success" do you mean being able to do it in some more automated/reliable way?

Expand full comment

After gradient descent "memorizes the training set", why would it then move to a more elegant explanation as described here? It seems this would not improve the loss function over the training set, which is what gradient descent is optimizing

Expand full comment

If it were truly the case that it reached 0 train loss, there would be no gradient, and this would be true. But what I mean by "total memorization" is something closer to "95% accuracy on the train set", which is a much weaker condition, and still leaves room for further train-set learning. (I'm aware this is very handwavy -- unfortunately, the theory is simply not complete enough to say something more specific with any confidence.)

Also, FWIW, since the loss is usually squared-error between real values or cross-entropy, true 0 train loss is never actually achieved, merely approached, so the parameters are always changing at least somewhat.

Expand full comment

One explanation is that different optimizers seem to 'prefer' different minima in the lost landscape, even when all of those minima correspond to zero training loss - this has been called "Implicit regularization" by some.

Alternatively, if you think of 'memorizes the training set' as zero training error (but with a non-zero penalty term for model complexity), then there might be many models that achieve zero training error, but more parameterized models will be able to further reduce the complexity penalty in the loss function (which may improve generalization).

Expand full comment