You're confusing naive Bayes and the entirety of Bayesian statistics here. In the example given, you are correct that you cannot apply naive Bayes. But you can still have a Bayesian regression model and apply Bayes rule, you're just assuming a model. There's nothing special about deep learning here, it's just assuming a very complex and flexible model but you can frame it as a regression anyway.

No, you've missed my point. There *is* something special about deep learning. It is a technique for making predictions about individuals, a regime where Bayesian updates are vacuous.

I don't know enough about deep learning to comment on that specifically, however, any type of regularisation should theoretically be able to be reformulated as a prior. I'm not sure what the standard practice is to deal with the non-identifiability you discuss in the linked post is though; that is, in the non-Bayesian setting, how do you decide which of you functions giving identical losses or likelihoods is optimal.

Your contention that uninformative priors are standard practice is pretty disputed, and in practice depends on application. Many practitioners would argue that you should fully encode your actual prior beliefs (eg: from previous studies) within the prior. This is especially true when you have non-identifiability in the model, like in the example you linked, although the lack of interpretability of the parameters within deep learning might hamper this.

Coming back to the current post, while the above discussion is interesting, I don't see how it's relevant to my previous comments. I am contending that it is perfectly possible to come up with individual-level predictions within a Bayesian framework; I am not making any claim regrading deep learning. The most generic answer here is some kind of regression model. Multilevel modelling (also know as partial pooling or hierarchical modelling) can be used to take care of individual-level covariates if needed, which allows predicting at previously unseen values of those covariates.

You can always apply Bayes' rule backwards, so yes, any learning rule "should theoretically be able to be reformulated as a prior" in the vacuous sense that there is an implicit P(f) that we can compute using P(f)=P(f|d)P(d)/P(d|f). But that doesn't mean we are doing Bayesian reasoning.

If you have a dataset which includes some datapoints <A=1, B=0>, and other datapoints <A=0, B=1>, there is no principled, data-driven way to use Bayes to make predictions about <A=1, B=1>. The Bayesian solutions either come down to hand-selecting the prior, or hand-engineering the features e.g. hard-coding "if A=1 and B=1, ignore B". (The latter is in essence what hierarchal modeling would do.)

The problem with including hand-anything in your model is its reliance on human knowledge instead of learning from data. It's not a learning algorithm, and the bitter lesson is it always loses to learning from data. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

But isn't there a lot of arbitrariness in deep learning models?

We could train different models with different architectures and different hyper-parameters, and get different results, right? Why is this better than the "ad-hoc-ness" of the Bayesian approach?

You get slightly different results, sure -- if you are trying to eke out the last 0.5% of accuracy to win a benchmark competition, maybe you think this matters. But these differences are largely irrelevant to the big picture. Almost any good model will learn more-or-less the same function from the same data. (By "good" I am referring to the fact that there are lots of ways to fuck things up completely, and we need to make sure we aren't doing any of those.)

In fact, the functions learned are *so* similar that you can actually conduct black-box adversarial attacks on trained neural networks! Meaning: if you have a trained model for e.g. telling cats from dogs, and I go and train my own model on the same dataset (with no knowledge of the arbitrary decisions you made), and then search for adversarial examples on my model, the images I find will often be adversarial examples for your model, too. This even sort-of-works if I also *construct my own dataset of cats and dogs*! Isn't that wild? The more I make sure my model is identical to yours, the more reliably adversarial examples will transfer, but it's really surprising the extent to which you can mess with things and still get decent transfer. (Ctrl-f "transferability" in https://arxiv.org/pdf/1912.01667.pdf for citations to various papers that study this.)

I think here it's important to distinguish between normative ideals and practical algorithms. Bayes' rule is the normative ideal of prediction because Bayes' rule follows from the axioms of probability, and if you don't follow the axioms of probability you get dutch booked. Solomonoff's universal induction solves the problem of where your initial prior comes from. However, there are no practical algorithms that are explicitly Bayesian and can handle lots of nonlinear features while exhibiting good generalization, so we resort to deep learning, which is a good enough approximation to solomonoff induction. Nb. https://www.inference.vc/everything-that-works-works-because-its-bayesian-2/

Solomonoff isn't a computable algorithm, and its incomputability hides precisely the interesting part: how to tell which of a variety of data-compatible functions to select first. That decision is the essence of real learning, of real belief-updates. And Solomonoff doesn't give the correct answer. (Surprisingly, length-first-order -- often described as an instantiation of Occam's razor -- plays almost no role in that argument -- it is just an arbitrary way of enumerating all functions, and the argument still follows under any enumeration, losing no strength. That should be enough to make you raise an eyebrow. The fact that Solomonoff is both uncomputable in theory and completely intractable in practice should make you raise the second.)

Solomonoff does give the correct answer. If an oracle handed to us the MDL program for a deep learning dataset like C4 *it would absolutely destroy any practically feasible deep learning approach*. Perfect compression discerns every single bit of structure in the data and would have insane and seemingly magical generalization ability.

I'm not sure which argument you're claiming doesn't fall apart if you discard the universal prior in favor of some other enumeration of programs, but the universal prior is absolutely essential for sound induction. Suppose you are observing a long sequence of coin flips, and want to to converge to the belief the coin is fair iff it is fair and your hypothesis space is over all programs. If you observe a fair coin for a very long time and see the sequence "HTTTHT...", how do you avoid converging to the hypothesis that says "this is a rigged coin that always land in the sequence HTTTHT..."? After all, such an overfit hypothesis will always achieve better likelihood than the true hypothesis "the coin is fair" for a sufficiently long sequence of flips. The way you do so is by demanding that you gain one bit of explanation for every bit required to specify your hypothesis. Every other way of trading off fit for hypothesis complexity will prevent you from converging to the truth in some circumstance.

I still think your claim that "Bayes leaves something to be desired when it comes to updating one's beliefs" is wrong, unless the "something" is computability, which I don't think you're claim it is. I also think that "most of the learning that these models do cannot be understood via Bayes’ theorem" is wrong, because Bayes' theorem is the theoretically optimal way of doing learning, so any way of doing learning that's any good must somehow be approximating Bayes's theorem. The claim that prediction is a first class abstraction for neural nets yet somehow isn't for Bayes is false; in fact, Solomonoff induction is most often posed as predicting a sequence of bits, which is a fully general way to state prediction problems.

I don't think that algorithms which are explicitly Bayesian are useful when it comes to building practical AI systems. In fact, in undergrad I worked with a Tenenbaum-style Bayesian ML group for a while and eventually quit to do pure deep learning because I realized their research programme wasn't going anywhere. However, anything that works has a explanation in terms of how it manages to approximate Bayes. For deep learning, there are actually such explanations: https://docs.google.com/presentation/d/1JLCCvE805ZVwrrdKAdW9jiVcNN4GLlYA0Qns1rIPrtc/edit?usp=sharing

> Solomonoff does give the correct answer. If an oracle handed to us the MDL program for a deep learning dataset like C4 *it would absolutely destroy any practically feasible deep learning approach*. Perfect compression discerns every single bit of structure in the data and would have insane and seemingly magical generalization ability.

That's not true in general. The performance of course depends on the program, and the MDL program is dependent on the choice of programming language. So for any specific task (say, C4 language modeling), it is the case that there are an infinitude of languages for which the MDL program for that language has terrible performance -- much worse than deep learning. Of course, there are also some languages where it will do much better. But that choice is exactly what I am identifying as the interesting part of learning, and it's the thing that deep learning gets right.

> I'm not sure which argument you're claiming doesn't fall apart if you discard the universal prior in favor of some other enumeration of programs

Here's a handwavy argument. Consider *any* enumeration of the natural numbers. (The natural enumeration is 1,2,3,...; another one might be 2,1,3,4,5,...; another is 2,1,5,4,3,9,8,7,6; etc.) Order all your programs according to this enumeration -- if we use the natural one, this is a shortest-to-longest enumeration, but if we choose a different enumeration, that's no longer true. Also: every incorrect program can be eliminated by a finite amount of (unique/counterfactual) data points. Since there are only a finite number of programs preceding the correct one in the sequence we've chosen, we will eventually eliminate all programs preceding the correct one in the sequence, and then after that we will have selected the correct program.

This is essentially the same guarantee standard shortest-to-longest Solomonoff induction gives. Indeed, the fact that this guarantee is preserved under reorderings is basically the same reason that it is preserved under the choice of programming language. And yes, this gap can be bounded, but it IMO hides literally everything interesting about learning. The guarantees of Solomonoff induction are basically just a rigorized statement of "exhaustive elimination will eventually get you what you are looking for".

To summarize, the real substance of learning is about moving beyond exhaustive elimination: how do we learn *faster* than the rate at which we learn by exhaustive elimination? This is impossible to guarantee in full generality (No Free Lunch) and yet deep learning manages to do it (on real-world problems). Bayes has nothing useful to say about this.

I think a reasonable interpretation of Solomonoff induction is that it states "you can't learn faster than exhaustive elimination in program space". If you're learning faster than exhaustive elimination in program space, you're "cheating" by exploiting prior knowledge about the structure of the problem, which at some point in the past must've been learned by someone by exhaustive elimination.

That said, I don't even think that deep learning is doing anything more interesting than "exhaustive elimination of circuits with a description length prior". You can achieve perfect training accuracy on random labels! You're clearing not exploiting any particular property of real-world problems.

One of the reasons I believed in language models back in the days when they sucked and weren't sota on anything was my conviction that fast learning is only possible after you've spent a long time doing brute force learning.

> You can achieve perfect training accuracy on random labels!

This is precisely why we know that NNs *must* be doing something beyond exhaustive search. They clearly have the capacity to represent all sorts of functions. When we train them on real data, they have plenty of functions to choose from, and yet they choose the ones that generalize well. They get the right answers well before the wrong answers have been exhausted!

Somehow, NNs are -- as you say -- exploiting something about the problems. But they solve every real-world problem we throw at them! So they must be exploiting something fundamental about the world. Whatever they are doing, it contains the essence of learning.

One final comment: your claim that Bayes can't predict or that in certain circumstances is vacuous is due to your choice of hypothesis space rather than anything to do with the learning algorithm. If you make your hypothesis space all data-generating programs your stated issues with Bayes go away. In fact, if you made your hypothesis space all neural networks of a fixed architecture, you'd do much better than SGD.

I of course agree that it's possible to reframe the examples I used here as a function learning problem, and it's certainly possible to look at learning through a Bayesian lens. But it's not useful to do so, as Bayes *has nothing interesting to say* in this regime. And the empirical success of deep learning all-but-proves that there *is something interesting to be said*.

(In contrast, in the regime where we have multiple samples and observe correlations, Bayes *does* have something interesting to say, and *should* be used. That's why I phrased it as "leaves something to be desired".)

The point of my original comment was to draw a strong distinction between "has something to say" and "should be used". The entire literature on algorithmic probability is something to say. Viewing language models as lossless compressors helps explain why larger models generalize better, which is certainly something to say. I absolutely do not think we use explicitly Bayesian algorithms for most learning problems. I am an applied researcher and my research programme is 100% committed to pure DL.

Analogy: differential equations have something to say about baseball, but you should never try to solve differential equations while playing baseball. Your neural circuits have some pretty slick approximations that will get you the answer you want.

You're confusing naive Bayes and the entirety of Bayesian statistics here. In the example given, you are correct that you cannot apply naive Bayes. But you can still have a Bayesian regression model and apply Bayes rule, you're just assuming a model. There's nothing special about deep learning here, it's just assuming a very complex and flexible model but you can frame it as a regression anyway.

No, you've missed my point. There *is* something special about deep learning. It is a technique for making predictions about individuals, a regime where Bayesian updates are vacuous.

And my point this is only true about Bayesian updates in naive Bayes, you can do Bayesian individual predictions if you use other schemes

The interesting bits of deep learning cannot be understood as Bayesian updates, https://jacobbuckman.com/2020-01-22-bayesian-neural-networks-need-not-concentrate/

If you have some specific other Bayesian method in mind for handling these cases, please by all means share it.

I don't know enough about deep learning to comment on that specifically, however, any type of regularisation should theoretically be able to be reformulated as a prior. I'm not sure what the standard practice is to deal with the non-identifiability you discuss in the linked post is though; that is, in the non-Bayesian setting, how do you decide which of you functions giving identical losses or likelihoods is optimal.

Your contention that uninformative priors are standard practice is pretty disputed, and in practice depends on application. Many practitioners would argue that you should fully encode your actual prior beliefs (eg: from previous studies) within the prior. This is especially true when you have non-identifiability in the model, like in the example you linked, although the lack of interpretability of the parameters within deep learning might hamper this.

Coming back to the current post, while the above discussion is interesting, I don't see how it's relevant to my previous comments. I am contending that it is perfectly possible to come up with individual-level predictions within a Bayesian framework; I am not making any claim regrading deep learning. The most generic answer here is some kind of regression model. Multilevel modelling (also know as partial pooling or hierarchical modelling) can be used to take care of individual-level covariates if needed, which allows predicting at previously unseen values of those covariates.

You can always apply Bayes' rule backwards, so yes, any learning rule "should theoretically be able to be reformulated as a prior" in the vacuous sense that there is an implicit P(f) that we can compute using P(f)=P(f|d)P(d)/P(d|f). But that doesn't mean we are doing Bayesian reasoning.

If you have a dataset which includes some datapoints <A=1, B=0>, and other datapoints <A=0, B=1>, there is no principled, data-driven way to use Bayes to make predictions about <A=1, B=1>. The Bayesian solutions either come down to hand-selecting the prior, or hand-engineering the features e.g. hard-coding "if A=1 and B=1, ignore B". (The latter is in essence what hierarchal modeling would do.)

The problem with including hand-anything in your model is its reliance on human knowledge instead of learning from data. It's not a learning algorithm, and the bitter lesson is it always loses to learning from data. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

But isn't there a lot of arbitrariness in deep learning models?

We could train different models with different architectures and different hyper-parameters, and get different results, right? Why is this better than the "ad-hoc-ness" of the Bayesian approach?

edited Jun 8, 2023You get slightly different results, sure -- if you are trying to eke out the last 0.5% of accuracy to win a benchmark competition, maybe you think this matters. But these differences are largely irrelevant to the big picture. Almost any good model will learn more-or-less the same function from the same data. (By "good" I am referring to the fact that there are lots of ways to fuck things up completely, and we need to make sure we aren't doing any of those.)

In fact, the functions learned are *so* similar that you can actually conduct black-box adversarial attacks on trained neural networks! Meaning: if you have a trained model for e.g. telling cats from dogs, and I go and train my own model on the same dataset (with no knowledge of the arbitrary decisions you made), and then search for adversarial examples on my model, the images I find will often be adversarial examples for your model, too. This even sort-of-works if I also *construct my own dataset of cats and dogs*! Isn't that wild? The more I make sure my model is identical to yours, the more reliably adversarial examples will transfer, but it's really surprising the extent to which you can mess with things and still get decent transfer. (Ctrl-f "transferability" in https://arxiv.org/pdf/1912.01667.pdf for citations to various papers that study this.)

I think here it's important to distinguish between normative ideals and practical algorithms. Bayes' rule is the normative ideal of prediction because Bayes' rule follows from the axioms of probability, and if you don't follow the axioms of probability you get dutch booked. Solomonoff's universal induction solves the problem of where your initial prior comes from. However, there are no practical algorithms that are explicitly Bayesian and can handle lots of nonlinear features while exhibiting good generalization, so we resort to deep learning, which is a good enough approximation to solomonoff induction. Nb. https://www.inference.vc/everything-that-works-works-because-its-bayesian-2/

Solomonoff isn't a computable algorithm, and its incomputability hides precisely the interesting part: how to tell which of a variety of data-compatible functions to select first. That decision is the essence of real learning, of real belief-updates. And Solomonoff doesn't give the correct answer. (Surprisingly, length-first-order -- often described as an instantiation of Occam's razor -- plays almost no role in that argument -- it is just an arbitrary way of enumerating all functions, and the argument still follows under any enumeration, losing no strength. That should be enough to make you raise an eyebrow. The fact that Solomonoff is both uncomputable in theory and completely intractable in practice should make you raise the second.)

Solomonoff does give the correct answer. If an oracle handed to us the MDL program for a deep learning dataset like C4 *it would absolutely destroy any practically feasible deep learning approach*. Perfect compression discerns every single bit of structure in the data and would have insane and seemingly magical generalization ability.

I'm not sure which argument you're claiming doesn't fall apart if you discard the universal prior in favor of some other enumeration of programs, but the universal prior is absolutely essential for sound induction. Suppose you are observing a long sequence of coin flips, and want to to converge to the belief the coin is fair iff it is fair and your hypothesis space is over all programs. If you observe a fair coin for a very long time and see the sequence "HTTTHT...", how do you avoid converging to the hypothesis that says "this is a rigged coin that always land in the sequence HTTTHT..."? After all, such an overfit hypothesis will always achieve better likelihood than the true hypothesis "the coin is fair" for a sufficiently long sequence of flips. The way you do so is by demanding that you gain one bit of explanation for every bit required to specify your hypothesis. Every other way of trading off fit for hypothesis complexity will prevent you from converging to the truth in some circumstance.

I still think your claim that "Bayes leaves something to be desired when it comes to updating one's beliefs" is wrong, unless the "something" is computability, which I don't think you're claim it is. I also think that "most of the learning that these models do cannot be understood via Bayes’ theorem" is wrong, because Bayes' theorem is the theoretically optimal way of doing learning, so any way of doing learning that's any good must somehow be approximating Bayes's theorem. The claim that prediction is a first class abstraction for neural nets yet somehow isn't for Bayes is false; in fact, Solomonoff induction is most often posed as predicting a sequence of bits, which is a fully general way to state prediction problems.

I don't think that algorithms which are explicitly Bayesian are useful when it comes to building practical AI systems. In fact, in undergrad I worked with a Tenenbaum-style Bayesian ML group for a while and eventually quit to do pure deep learning because I realized their research programme wasn't going anywhere. However, anything that works has a explanation in terms of how it manages to approximate Bayes. For deep learning, there are actually such explanations: https://docs.google.com/presentation/d/1JLCCvE805ZVwrrdKAdW9jiVcNN4GLlYA0Qns1rIPrtc/edit?usp=sharing

edited Jun 8, 2023> Solomonoff does give the correct answer. If an oracle handed to us the MDL program for a deep learning dataset like C4 *it would absolutely destroy any practically feasible deep learning approach*. Perfect compression discerns every single bit of structure in the data and would have insane and seemingly magical generalization ability.

That's not true in general. The performance of course depends on the program, and the MDL program is dependent on the choice of programming language. So for any specific task (say, C4 language modeling), it is the case that there are an infinitude of languages for which the MDL program for that language has terrible performance -- much worse than deep learning. Of course, there are also some languages where it will do much better. But that choice is exactly what I am identifying as the interesting part of learning, and it's the thing that deep learning gets right.

> I'm not sure which argument you're claiming doesn't fall apart if you discard the universal prior in favor of some other enumeration of programs

Here's a handwavy argument. Consider *any* enumeration of the natural numbers. (The natural enumeration is 1,2,3,...; another one might be 2,1,3,4,5,...; another is 2,1,5,4,3,9,8,7,6; etc.) Order all your programs according to this enumeration -- if we use the natural one, this is a shortest-to-longest enumeration, but if we choose a different enumeration, that's no longer true. Also: every incorrect program can be eliminated by a finite amount of (unique/counterfactual) data points. Since there are only a finite number of programs preceding the correct one in the sequence we've chosen, we will eventually eliminate all programs preceding the correct one in the sequence, and then after that we will have selected the correct program.

This is essentially the same guarantee standard shortest-to-longest Solomonoff induction gives. Indeed, the fact that this guarantee is preserved under reorderings is basically the same reason that it is preserved under the choice of programming language. And yes, this gap can be bounded, but it IMO hides literally everything interesting about learning. The guarantees of Solomonoff induction are basically just a rigorized statement of "exhaustive elimination will eventually get you what you are looking for".

To summarize, the real substance of learning is about moving beyond exhaustive elimination: how do we learn *faster* than the rate at which we learn by exhaustive elimination? This is impossible to guarantee in full generality (No Free Lunch) and yet deep learning manages to do it (on real-world problems). Bayes has nothing useful to say about this.

I think a reasonable interpretation of Solomonoff induction is that it states "you can't learn faster than exhaustive elimination in program space". If you're learning faster than exhaustive elimination in program space, you're "cheating" by exploiting prior knowledge about the structure of the problem, which at some point in the past must've been learned by someone by exhaustive elimination.

That said, I don't even think that deep learning is doing anything more interesting than "exhaustive elimination of circuits with a description length prior". You can achieve perfect training accuracy on random labels! You're clearing not exploiting any particular property of real-world problems.

One of the reasons I believed in language models back in the days when they sucked and weren't sota on anything was my conviction that fast learning is only possible after you've spent a long time doing brute force learning.

> You can achieve perfect training accuracy on random labels!

This is precisely why we know that NNs *must* be doing something beyond exhaustive search. They clearly have the capacity to represent all sorts of functions. When we train them on real data, they have plenty of functions to choose from, and yet they choose the ones that generalize well. They get the right answers well before the wrong answers have been exhausted!

Somehow, NNs are -- as you say -- exploiting something about the problems. But they solve every real-world problem we throw at them! So they must be exploiting something fundamental about the world. Whatever they are doing, it contains the essence of learning.

One final comment: your claim that Bayes can't predict or that in certain circumstances is vacuous is due to your choice of hypothesis space rather than anything to do with the learning algorithm. If you make your hypothesis space all data-generating programs your stated issues with Bayes go away. In fact, if you made your hypothesis space all neural networks of a fixed architecture, you'd do much better than SGD.

I of course agree that it's possible to reframe the examples I used here as a function learning problem, and it's certainly possible to look at learning through a Bayesian lens. But it's not useful to do so, as Bayes *has nothing interesting to say* in this regime. And the empirical success of deep learning all-but-proves that there *is something interesting to be said*.

(In contrast, in the regime where we have multiple samples and observe correlations, Bayes *does* have something interesting to say, and *should* be used. That's why I phrased it as "leaves something to be desired".)

The point of my original comment was to draw a strong distinction between "has something to say" and "should be used". The entire literature on algorithmic probability is something to say. Viewing language models as lossless compressors helps explain why larger models generalize better, which is certainly something to say. I absolutely do not think we use explicitly Bayesian algorithms for most learning problems. I am an applied researcher and my research programme is 100% committed to pure DL.

Analogy: differential equations have something to say about baseball, but you should never try to solve differential equations while playing baseball. Your neural circuits have some pretty slick approximations that will get you the answer you want.