Opinions on ways to improve current AI models can be coarsely grouped into 2 schools of thought.
Proponents of the scaling tenet paradigm think that declaring good incentives (the policy in the context of reinforcement learning) and scaling the number of parameters or training size[1] of the model is all is needed to improve them[2]. A somewhat more radical view is the one of Geoffrey Hinton, who think next-word prediction (i.e. only pre-training), is all you need to advance.
Conversely, proponents of the engineering paradigm think new architectures or scientific breakthroughs are required to advance the field. These include dedicated world models, bolstered by voices such as Yann LeCun.
I instead have 3 principles to decide whether to engineer or leave the network try to come up with a solution:
I will till the end of this short essay review (sometimes speculatively) ML techniques in the light of these. For the 1st principle, for example, the AI equivalent of a connectome can emerge, so one shouldn’t engineer specialised networks or hubs that connect them[0].
Considering the 2nd, reliable composability (the ability to compose representations) requires recurrence in depth, as a model would need to be infinitely deep to account for every possible combinations. So-called reasoning or thinking model add recurrence network-wise by allowing last layers to serve as input to the first ones after decoding. This is IMO an inelegant way of adding recurrence in depth, as it only allows feedbacks after full passes, and as such, doesn’t endow fine modularity. Allowing individual layers to choose which layer they connect to dynamically and exit dynamically, would likely see the emergence of subnetwork modules that can be combined, as in brains where feedbacks connections make up ~60 % of them[3]. This would also respect the first 2 principles, by thinking in latent space instead of language (and thereby being potentially more expressive).
Considering the 3 principles, another hunch is that the stacking of successively attention then feed-forward networks are unnecessary: the same low-to-high representations (the justification for it that I read[4], is that each stack is more abstract than the previous) could emerge by allowing a unique recurrent in depth network to attend an attention layer whenever it thinks it needs to[5].
Another architecture design that can be analysed via this lens is mixture of expert (MoE). Instead of using a single dense network, MoE combines smaller dense ones that are sparsely activated by a router network. Experts, however, still have broader activation, and thus are more computationally expensive, than single sparse (trained to be so via regularisation) network of the same size. It also only use 2 experts for each token, thus potentially lowering performance (the experts being not specialised on a macroscopic topic). It fails the 1st principle.
One insight from research[6] is that input dimension is correlated negatively with 0-shot generalisation. Variational auto-encoders force dimension reduction, but it is unclear whether the 3rd or 1st principle applies to using them, as the publication doesn’t study them for this purpose. My hunch is that it would be a form of connectome design and as such would be better left to the model for it to emerge, if needed.
Episodic memory is crucial to the formation of semantic knowledge in brains and to their learning efficiency, yet it is mostly unheard of outside of few research publications. It is also unclear to me whether the 3rd or 1st principle would apply to a dedicated network mimicking the role of the hippocampus, or which principle would take precedence over the other. It is however clear to me that such ‘active’ learning cannot emerge in a mere backpropagation of error gradient setting, and would need a different learning algorithm, probably involving a dedicated learning/feedback network[7] and possibly making use of neuromodulation.
Curriculum learning sidestep this framework of principles by being a training technique. It is however more performant than training using mingled data and as such should be used.
[0]: akin to the flexible hubs of the neuroscience theory of the same name
[1]: this one won’t move the needle IMO as most of the models are already trained on all the internet and books, but this is nonetheless a held opinion as some advocate training on synthetic data
[2]: see the bitter lesson essay from Richard Sutton that argue that increasing compute will overtake model that don’t leverage it, for a notable intro to it
[3]: https://www.sciencedirect.com/science/article/pii/S089360801930396X
[5]: likewise humans are free to re-read a passage whenever it helps their reflection, but don’t do so invariably
[6]: https://www.biorxiv.org/content/10.1101/2024.09.30.615925v2.full.pdf (fig. 7c); far more so than the number of trained tasks
[7]: that would modify the weights of the main one.
A schematic example of one can be found in https://www.nature.com/articles/s41583-020-0277-3
in fig. 1a under 'Backprop-like learning with feedback network' (can't add it here due to copyright)