Gradient Descent Is Not Learning, But Evolution
When a human improves their skills or knowledge, we call that process “learning”. And now that we have developed of a set of techniques for imbuing computers with knowledge, it is only natural that we call that process “learning” as well.
But a moment’s consideration reveals many differences between a human learning and a neural network being trained by gradient descent. To highlight a few:
Deep learning has two distinct phases, training and inference. Human learning seemingly has no such distinction. (Some suggest wake/sleep, but while sleep does indeed seem helpful for human learning, it is certainly not required.)
To train a broadly intelligent AI with deep learning (“pre-training”), we leverage the entire Internet’s worth of text data — several orders of magnitude more text than a comparably-linguistically-proficient human would ever see.
Humans can learn new skills from an explanation or a handful of examples. In contrast, training a new skill into a neural network with gradient descent (“fine-tuning”) requires a large dataset and meaningful computation. Doing so with reinforcement learning (“post-training”) requires vast amounts of computation and trial-and-error interaction.
Neural networks often exhibit catastrophic forgetting, a phenomenon where fine-tuning or post-training on one skill craters performance on another. Human learning, though imperfect, seems relatively durable; skills may fade over time, but will never suddenly vanish.
It is universally acknowledged among AI researchers that stark contrasts exist between human learning and AI learning, many of which favor the humans. However, there is wide diversity of opinion on what this implies.
Some view these differences as evidence of a fundamental inadequacy of deep learning, a smoking gun revealing that this paradigm cannot possibly deliver on its promise of human-level intelligence. Others view these differences as clues towards potential avenues of improvement, and seek biologically-inspired algorithmic tweaks to close some of these gaps. And many simply dismiss the differences as irrelevant, often with sentiments like, “airplanes may not flap their wings like birds, but that does not stop them from flying.”
I have a different perspective. Human learning and AI training seem so different simply because the community has been using the wrong analogy.
Gradient descent is analogous to human evolution. In-context learning is analogous to human learning.
This new analogy does not suffer from any of the issues discussed above. For example, while human learning may not have two distinct phases, human intelligence clearly does. These two phases, evolution/learning, map nicely onto the two phases of a neural model, training/inference.
The first phase, evolution, takes place over billions of years, and compresses an unfathomable amount of ancestral-human sensory-experience into DNA. It is slow to incorporate new information, and can experience catastrophic forgetting. Analogously, pre-training is an expensive and time-consuming endeavor that compresses a massive amount of human-generated text into the parameters of a neural network. There is even a nice mathematical connection: gradient descent and evolution are two examples of locally-greedy optimization.1
The second phase, learning, takes place over the course of each human’s lifetime, and results in some particular brain with its intelligence, knowledge, and skills. It can be run on a single brain with modest energy requirements, and involves processing a stream of experiences to update knowledge in real-time. Similarly, inference involves processing a stream of context tokens in real-time, on a (relatively) small number of GPUs. The choice of tokens controls the subsequent behavior of the model, quickly giving the model new skills (think e.g. prompt engineering, in-context learning from examples, etc).
Personally, I have found this switch in perspective to be illuminating. When thinking through issues in the field or making predictions about its future, one often finds oneself reasoning by analogy, and this analogy has been quite useful to reason with. Of course, I am not claiming that it is perfect,2 and ultimately, I do expect that it will need to be replaced in turn. But for now, it is the best framing I have found. It also has some interesting implications, which I will explore in a future post.
In fact, with a handful of assumptions, the limiting behavior of evolution is gradient descent: the expected change in any trait follows the gradient of fitness. Computing gradients in closed-form, as backpropagation does, is simply a vastly more efficient strategy for calculating this update than natural selection. (Although neural networks can be trained by evolution as well.)
To play devil’s advocate, here is one example of a lens through which this new analogy is worse that the old one: the parameter counts of LLMs are (already!) larger than the size of human DNA. Similarly, the capabilities of an AI at “birth” (i.e. the start of an inference session) are obviously much more progressed than that of a human baby. Perhaps we are bundling more information into our parameters than biology has into DNA.


The only reason LLM parameter counts are so high is because we are using dense layers throughout. This makes training tractable and allows big matrix multiplications to be used to easily achieve high degrees of parallelism in inference, but the "real" core model is probably much smaller and sparser and deeper. It's hard to find it, but it does seem to be in there. So I'm not convinced anything useful is shown by comparing parameter counts with genome size.
I've definitely had similar thoughts before. It's not apples-to-apples to say e.g. that CNNs need sooo many examples to learn the difference btwn a cat and a dog while we only need a handful. Because our ancestors have seen sooo many examples - meanwhile evolution imbues us with their knowledge - while the CNN starts as a blob of random weights.
In any case, it's a damn shame that we catastrophically forgot to breath underwater after tiktaalik left the sea.