As Clay Awakens

May 1, 2023

Here it is: https://github.com/jbuckman/lr-vs-dl

Expand full comment

atgabara

Apr 9, 2023

I'm confused about the first graph (and some of the subsequent ones) - how can 1 - R^2 ever be greater than 1? Are you actually just plotting mean squared error instead?

Expand full comment

Apr 10, 2023Edited

R^2 can be negative! When it is, that just means the fit is worse than the naive estimator of "always predict the mean all the outputs in the dataset (ignoring the inputs)". Of course, this would imply that we have a very bad estimator indeed. That's why it only occurs at the beginning of training, where we are essentially measuring the fit of a random function.

Expand full comment

atgabara

Apr 10, 2023

Ah, that's right, thanks! I knew this at one point but since I've never actually seen it, I forgot it was technically possible. I should have just done a quick Google search, but at least now I'll probably never forget this!

Expand full comment

A Saiyed

As you mentioned, one of the issues with deep learning is interpretability. If working in predictive modelling or similar scenarios, is there any way we can determine what these features learned by deep learning are?

For example in your case where you engineered your new features in the regression example, you could determine what new features were 'important' or that were significant. You also mention DL is better than me at _finding_ good features but is there anyway it can tell me what these features it has learned are?

Expand full comment

Unfortunately, no. There are people who have tried to do this, and people who claim they can do this, but in my opinion, nobody has succeeded.

Expand full comment

glaebhoerl

Apr 14, 2023

Are you familiar with, and does that also apply to https://www.alignmentforum.org/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world ?

Or by "success" do you mean being able to do it in some more automated/reliable way?

Expand full comment

Jonathan

After gradient descent "memorizes the training set", why would it then move to a more elegant explanation as described here? It seems this would not improve the loss function over the training set, which is what gradient descent is optimizing

Expand full comment

Reply (2)

Mar 20, 2023Edited

If it were truly the case that it reached 0 train loss, there would be no gradient, and this would be true. But what I mean by "total memorization" is something closer to "95% accuracy on the train set", which is a much weaker condition, and still leaves room for further train-set learning. (I'm aware this is very handwavy -- unfortunately, the theory is simply not complete enough to say something more specific with any confidence.)

Also, FWIW, since the loss is usually squared-error between real values or cross-entropy, true 0 train loss is never actually achieved, merely approached, so the parameters are always changing at least somewhat.

Expand full comment

Vaynshelboym