Which ML techniques can emerge on current transformer architectures, and which ones should be engineered? A principled analysis to help decide.

Opinions on ways to improve current AI models can be coarsely grouped into 2 schools of thought.

Proponents of the scaling ~~tenet~~ paradigm think that declaring good incentives (the policy in the context of reinforcement learning) and scaling the number of parameters or training size^[0] of the model is all is needed to improve them^[1]. A somewhat more radical view is the one of Geoffrey Hinton, who think next-word prediction (i.e. only pre-training), is all you need to advance the field.
Conversely, proponents of the engineering paradigm think new architectures or scientific breakthroughs are required. These include dedicated world models, bolstered by voices such as Yann Le Cun.

I instead have 3 principles to decide whether to engineer or leave the network try to come up with a solution:

don’t architecture what can emerge
architect what can’t emerge, while leaving the most freedom for the desired behaviours to emerge
use engineered solutions if they are significantly more performant than none (e.g. the self-attention mechanism)

I will till the end of this short essay review (sometimes speculatively) ML techniques in the light of these. For the 1st principle, for example, the AI equivalent of a connectome can emerge, so one shouldn’t engineer specialised networks or hubs that connect them^[2].

Considering the 2nd, reliable composability (the ability to compose representations) requires recurrence in depth, as a model would need to be infinitely deep to account for every possible combinations. So-called reasoning or thinking model add recurrence network-wise by allowing last layers to serve as input to the first ones after decoding. This is IMO an inelegant way of adding recurrence in depth, as it only allows feedbacks after full passes, and as such, doesn’t endow fine modularity. Allowing individual layers to choose which layer they connect to dynamically and exit dynamically, would likely see the emergence of subnetwork modules that can be combined, as in brains where feedbacks connections make up ~60 % of them^[3]. This would also respect the first 2 principles, by thinking in latent space instead of language (and thereby being potentially more expressive).

Considering the 3 principles, another hunch is that the stacking of successively attention then feed-forward networks are unnecessary: the same low-to-high representations (the justification for it that I read^[4], is that each stack is more abstract than the previous) could emerge by allowing a unique recurrent in depth network to attend an attention layer whenever it thinks it help him.^[5].

Another architecture design that can be analysed via this lens is mixture of expert (MoE). Instead of using a single network, MoE combines smaller dense ones that are sparsely activated by a router network. Experts, however, still have broader activation, and thus are more computationally expensive, than single sparse (trained to be so via regularisation) network of the same size. It also only use 2 experts for each token, thus potentially lowering performance (the experts not being specialised on a macroscopic topic). It fails the 1st principle.

One insight from research is that input dimension is correlated negatively with 0-shot generalisation^[6]. Variational auto-encoders force dimension reduction, but it is unclear whether the 3rd or 1st principle applies to using them, as the publication doesn’t study them for this purpose. My hunch is that it would be a form of connectome design and as such would be better left to the model for it to emerge, if needed.

Episodic memory is crucial to the formation of semantic knowledge in brains and to their learning efficiency, yet it is mostly unheard of outside of few research publications. It is also unclear to me whether the 3rd or 1st principle would apply to a dedicated network mimicking the role of the hippocampus, or which principle would take precedence over the other. It is however clear to me that such ‘active’ learning cannot emerge in a mere backpropagation of error gradient setting, and would need a different learning algorithm, probably involving a dedicated learning/feedback network^[7] and possibly making use of neuromodulation.

Curriculum learning sidesteps this framework of principles by being a training technique. It is however more performant than training using mingled data and as such should be used.

[0]: this one won’t move the needle IMO as most of the models are already trained on all the internet and books, but this is nonetheless a held opinion as some advocate training on synthetic data

[1]: see the bitter lesson essay from Richard Sutton that argue that increasing compute will overtake model that don’t leverage it, for a notable intro to it

[2]: akin to the flexible hubs of the neuroscience theory of the same name

[3]: https://www.sciencedirect.com/science/article/pii/S089360801930396X

[4]: https://ai.stackexchange.com/questions/46983/why-not-increase-the-number-of-attention-heads-rather-than-stacking-transformer

[5]: likewise humans are free to re-read a passage whenever it helps their reflection, but don’t do so invariably

[6]: https://www.biorxiv.org/content/10.1101/2024.09.30.615925v2.full.pdf (fig. 7c); far more so than the number of trained tasks

[7]: that would modify the weights of the main one.
A schematic example of one can be found in https://www.nature.com/articles/s41583-020-0277-3
in fig. 1a under 'Backprop-like learning with feedback network' (can't add it here due to copyright)