Where Bayes Falls Short
Bayes' theorem is a useful tool in some domains, but leaves something to be desired when it comes to updating one's beliefs.
Main point: the powerful learning abilities of deep learning algorithms are the best model we have for intelligence, and most of the learning that these models do cannot be understood via Bayes’ theorem.
Simpson’s paradox is a well-known statistical phenomenon wherein separating data into different groups before analysis can wildly change the conclusions. This can be surprising and confusing at first. But with just a small twist of perspective the paradoxical nature vanishes. The phenomenon is a natural consequence of the fact that the sign of the correlation encodes the direction of a belief update, and that direction depends on one’s initial beliefs.
Let’s start by understanding the paradox itself. It is easiest to explain with a simple illustrative example. Imagine you are a highschool administrator who is considering whether to expand the athletics program, and so you are curious about the relationship between academic and athletic performance in the current student body. You survey the students on these two axes, perhaps by measuring sprinting speed and score on a standardized test. The results look like this:
All three plots show the same data1, but stratified in different ways. In the first plot, we see a positive correlation between academic and athletic performance. But in the second plot, we see that for students in each grade, there is actually a negative correlation between academic and athletic performance. However, the third plot once again shows a positive correlation in each subgroup, provided we distinguish between varsity and non-varsity athletes.
So, what’s going on here? Is athletic ability positively or negatively correlated with academic ability? Which breakdown of the data is the correct one? The key to resolving this paradox is to understanding correlations.
A correlation is a recipe for a belief update. If we observe a positive correlation between A and B (from some population), then observing that A is large for some individual (sampled from that population) should cause us to believe that B is also likely to be large for that individual.
It’s intuitively clear that a belief update needs to take into account what we already know. Learning a fact about an individual tells us more about their true nature, but the change in our beliefs (which takes us closer to that true nature) is dependent on our starting point. It follows naturally that updates from different starting-information might go in different directions.
To return to our working example, we see that “is athletic ability positively or negatively correlated with academic ability?” is an ill-posed question. If we reframe the question as “if I learn that a student is athletic, should I revise my guess of their academic performance up or down?”, the answer becomes obvious: it depends on what your initial guess was (& why).
If you know nothing about a student, you should guess that they are of average athletic & academic ability. If you then learn that they are of above-average athletic ability, you should guess that they are of above-average academic ability. (This is a revision upwards, because of a positive correlation.)
If you know only that a student is in 10th grade, you should guess that they have 10th-grade-average athletic & academic ability. If you then learn that they are of above-10th-grade-average athletic ability, you should guess that they are of below-10th-grade-average academic ability. (This is a revision downwards, thanks to an observed negative correlation.)
If you know that a student is in 10th grade and a varsity athlete, you should guess that they have 10th-grade-varsity-athlete-average athletic & academic ability. If you then learn that they are above-10th-grade-varsity-athlete-average athletic ability, you should guess that they are also above-10th-grade-varsity-athlete-average academic ability. (Update upwards; positive correlation.)
…et cetera. It is perfectly intuitive that sometimes the same piece of new information will correct an overestimate, and other times it will correct an underestimate, leading to the sign of the correlation flipping. To me, this is a fully satisfactory resolution of Simpson’s paradox.
I will refer to this as the Bayesian approach, because although I haven’t actually gone into detail about how the belief updates are computed, suffice it to say they use the same basic tools as other forms of Bayesian reasoning.
Something interesting happens when we extend this argument all the way down to arbitrarily-small categories. Consider:
If you know that a student is in 9th grade, is 6’1” tall, weighs 143 pounds, is named Manfred, was born in June, has a pet iguana, never visited Europe, loves Marvel movies (especially Spiderman), has great posture, and once got detention for lighting a fire with a magnifying glass — you should guess that they have ??? athletic & ??? academic ability. If you then learn that they are above-???-average athletic ability, you should guess that they are ??? academic ability.
The Bayesian approach just falls apart here. The group of students with these properties contains only one2 member: Manfred. It’s not clear at all how we should a priori estimate his estimated athletic & academic ability, and even less clear how we should update these estimates if we learn his score on one of those two axes. Should we just ignore most of the details, and default to the assumption that he is at the 9th-grade average? It’s probably helpful incorporate the information that he is tall, which seems related to his athletic ability. And maybe also his weight? — but is he heavy-for-a-9th-grader, or light-for-a-6’1”-student? But it seems like we can safely ignore his pet iguana and his birthday (but…can we really? what justifies this?)
These decisions feel difficult to make in a principled way. Any Bayesian approach to the analysis will require choosing the various reference classes in an ad-hoc fashion — but as we have seen, these choices can have a big impact on the outcome of the analysis, so using this methodology, we can engineer almost any conclusion we want. This makes any conclusions we might reach feel a bit…meaningless. Is there any non-arbitrary way to tackle this problem?
In fact, there is: deep learning. Taking a step back, recognize that the problem setting we have arrived at is simply regression. Each individual student is a data point; our goal is predict some quantity (academic ability), given other information about the student (grade, height, name, etc.). Each detail about the student’s life is an input feature to the regression. And the best way to solve a regression task is deep learning.
This approach absolves us of the need to make ad-hoc choices about what features to include. We can simply use all the available details about each student to make each prediction; given sufficient data, the learning algorithm itself will recognize which details are irrelevant to the task at hand, and learn a model which is invariant to those details. Note that, unlike the Bayesian approach, we don’t require data for every specific subcategory — there’s still just one Manfred. Instead, deep learning lets us leverage generalization. By training a model that can predict the academic performance of any student, we can leverage information from other, similar-but-not-identical students to make good predictions about Manfred’s academic performance.
With this approach, what is the analogue of the correlation? In other words, how might we answer the question: “if I learn that a student is athletic, should I revise my guess of their academic performance up or down?” In the deep learning approach, we see that this question is somewhat ill-posed, since the relationship could be nonlinear: for example, learning that this particular student is a little more athletic than expected could cause you to revise it upwards, but learning that they are way more athletic than expected could cause us to revise it downwards.3 But we can still find an answer, although it will not be a simple up/down. Given a student profile — say, Manfred’s — we can create many new datapoints by plugging in various values of athleticism (while keeping the rest of the profile the same), and measure the model’s predictions; subtracting off the initial uninformed prediction, the resulting plot gives us the complete picture of possible updates.
This is undoubtedly a bit of a convoluted and awkward way to extract information about how to update one’s beliefs. But rather than viewing that awkwardness as an indictment of the approach, I prefer to view it as an indictment of the abstraction. In the Bayesian framework, updating takes center stage as a first-class abstraction; but when we move to the more flexible deep-learning approach, we see that a prediction is the first-class abstraction, and an “update” is just the difference between two predictions. So yes, it’s a bit awkward to extract the update formula from the model, but there’s no reason to care about the update formula anyways. The interesting bit is figuring out how to make predictions: how to go from observed data to a set of beliefs.
Bayesian updates are not wrong. If we trained a neural network on the highschool dataset4, the resulting model would likely exhibit Bayes’ rule. That is, if we condition and query it in the exact same fashion as the Bayesian approach, it will produce the same predictions:
Given only the feature “grade=10th”, the network will output the 10th-grade-average athletic & academic ability. Given the feature set “grade=10th; athletics_percentile=65”, the network will output below-10th-grade-average academic ability.
almost as though it is internally applying a Bayesian update. Because, of course, it is: we are training the neural network to output the distribution with maximum likelihood, and Bayes’ theorem tells us how to calculate what that distribution is.
But Bayesian updates are inadequate. All of the interesting bits of what neural networks do, they do in the non-Bayesian regime. When we add a new datapoint to the dataset, we update our predictions on every possible input we might see, and the vast majority of these are datapoints like Manfred — one-off individuals, with so much detail and specificity that reference classes are everywhere and Bayesian updates are vacuous. And yet, deep learning gives us a reliable and scalable recipe for incorporating the new datapoint into the model, a way to improve our predictions for every one of these unique individuals.
Why does deep learning work so well? How does gradient descent find solutions that generalize? Nobody knows. The mystery of generalization is the deepest and most important problem in the field of deep learning. But the empirical success of this approach all but proves that it must be doing something right.
When we discover what that is, we will begin to truly understand the proper way to update beliefs in response to evidence. In the meantime…well, I guess there’s always Bayes’ rule.
This might seem like a contrived example, but it is not. This this pattern of reversals is fundamental, and can be found in nearly any data set, provided we select the right grouping. Dynomight has an excellent write-up which illustrates this point.
If we had an arbitrarily-large population, like a school with an infinite number of students, we would eventually encounter many students matching this description, and then the Bayesian approach would once again work. But this is not realistic; the number of potential groups will grow exponentially, so it will take a truly absurd amount of data to reach complete coverage.
Actually, this can be true of the Bayesian analysis too, for example by bucketing students into “very athletic”, “kind-of athletic”, “medium athletic”, etc., and then computing the updates on each group individually. However, this introduces even more ad-hoc-ness.
In practice the model I am describing here would need to be trained in a slightly different way from normal prediction models. Typical deep learning models require a fixed input and output space; in my examples, I am assuming we can give variable sets of input features, and predict any output variables. It’s a bit non-standard, but it wouldn’t be hard to train a network capable of doing these things. E.g., one might want to do dataset augmentation where certain input features are randomly masked with a “null” value, and use those features as prediction targets.
You're confusing naive Bayes and the entirety of Bayesian statistics here. In the example given, you are correct that you cannot apply naive Bayes. But you can still have a Bayesian regression model and apply Bayes rule, you're just assuming a model. There's nothing special about deep learning here, it's just assuming a very complex and flexible model but you can frame it as a regression anyway.
But isn't there a lot of arbitrariness in deep learning models?
We could train different models with different architectures and different hyper-parameters, and get different results, right? Why is this better than the "ad-hoc-ness" of the Bayesian approach?