Backprop, evolution, and the myth of two dogs

This short essay is about a misleading but somewhat popular argument used to dismiss artificial neural networks, and it describes an alternative view for the role of training with backprop vs evolution.

There is (literally) more than meets the eye

First, the misleading argument:

“My two-year-old nephew only had to see two dogs to then recognize all dogs thereafter. A neural net needs to see thousands of dogs to reach the same level of accuracy. Therefore neural nets are too sample inefficient to be good models of the brain.”

The sleight of hand is in the first sentence, which is only partially true. The nephew has seen millions of dogs through the eyes of his ancestors. ¹ If even just a couple of ancestors had not been able to learn to quickly recognize dogs, they would have been eaten by wolves. No ancestors, no nephew. ²

Gradient descent as taking the role of evolution

Let’s put doggos and wolves aside for a second, and think about learning across multiple timescales. For humans we typically think at least of two: evolution (acting on the DNA) and learning during a lifetime (mostly acting on synaptic plasticity). Two corresponding mechanisms are sometimes seen in artificial neural nets: the iterations that researchers do on architectures roughly play the role of evolution (e.g. Perceptron -> AlexNet -> ResNets -> etc.), while the training of neural nets on datasets corresponds to the learning that happens in a lifetime. This analogy for gradient descent is further supported by the common interpretation of backprop + gradient descent as learning by slowly adjusting synaptic weights, just like biological neurons do. If we embrace this classic view of synaptic plasticity, we would have to conclude that neural networks are indeed very sample inefficient compared to human brains within a lifetime.

But here is an alternative view: gradient descent on network weights does not correspond to the learning that occurs within the lifetime of the individual (i.e. synaptic plasticity), but acts at the same level as evolution.

Note that we are not saying here that gradient descent and evolution are mechanisms that act in a similar way, in fact, they are almost as different as it gets when it comes to how they work. Here we are saying that the role of training with gradient descent could be seen as equivalent to the role evolution had for us: distilling the right inductive biases to then recognize all dogs after seeing only two of them. This is what happens already with neural nets when fine-tuning them, i.e., they need very little data, in some cases no new data at all, and at times even no training at all (as shown e.g. by GPT-3 in the classic example below).

And indeed, GPT-3 has been the target of the same criticisms moved against large scale vision models. The dismissal usually goes along the lines of:

“Sure, GPT-3 is impressive, but it has read way more text than any human could possibly read in a lifetime, so it doesn’t count”.

Again, this ignores that:

A human of today has implicitly read, heard, and spoken, billions of words through the eyes, ears, and mouths of their ancestors;
We could instead look at GPT-3 massive pre-training as serving the same role that evolution had for humans in the development of structures for language in our brain. Again, this is not in the sense that evolution and GPT-3 pre-training were technically similar, but we could see their role to be roughly the same. They are both large optimization processes that distill a lot of experiences into an efficient learner. The playground is level only after pre-training, be it evolution for the nephew or gradient descent for neural nets.³

Looking forward

What do we get from embracing this interpretation of gradient descent on weights as taking the role of evolution?

Optimism for the future. For example, our largest vision models (pre-)trained on hundreds of millions of images have seen a tiny tiny fraction of the “images” the nephew’s ancestors saw. No reason to think performance should plateau anytime soon as data and compute increase.
Relief from guilt. No need to feel bad about drowning models in data, as if this brought them one step back from more “brain-like networks”: evolution was a costly and incredibly sample-inefficient process that led to us, and with pre-training we are doing a fast-forward and streamlined alternative to evolution.

This short essay is the result of long chats with Alex Neitz. I also thank Niki Kilbertus and Matej Balog for their kind feedback.

Footnotes:

[1] How many dogs? And other creatures and objects? And what about other people that were not the nephew’s ancestors, aren’t they part of the optimization too? More on this in an upcoming essay. ↩

[2] Sometimes this objection has been met with another objection “But what about smartphones? No-one’s ancestors had iPhones”. That’s true, and we clearly did learn to generalize beyond the set of objects our ancestors interacted with, just like dogs learn to play with plushes. Moreover, iPhones are designed by people for people, so they better be things we can learn to recognize and use quickly or they wouldn’t sell very well. It’s not by chance that iPhones fit in a hand, weigh less than 100kg, etc. ↩

[3] Btw, I think it’s really interesting that the best artificial meta-learner we have (GPT-3) is not explicitly trained with any fancy meta-learning algorithm or objective. ↩

Blog Archive

Archive of all previous blog posts

Blog Archive

Archive of all previous blog posts