Translation Models


Response to some bad arguments about the state of AI:

While it’s unlikely that simply scaling LLMs will get us to the fluid intelligence that we would expect for “true” AGI (I am partial to Francois Chollet’s line of reasoning there), I think a lot of the limitations you’ve outlined on common sense reasoning or world knowledge are already largely solved/invalidated - given a broad enough training set (especially with multi-modal input, self play, and synthetic data), very little is actually out of distribution, and the work on mechanistic interpretability/representation engineering (eg like Golden Gate Claude) show that LLMs are doing a lot more than autocomplete already (eg learning both abstract and concrete concepts).

Note that when you talk about Stable Diffusion’s text encoder you are talking about a tiny CLIP model that is 400M parameters (w/ far fewer layers, attention heads, and hidden dimensions to boot). As a point of reference on the scale difference, the current generation of frontier models are 1T parameter class (thousands of times larger) and the next generation will be larger still. The current crop of models also have increasingly large context windows. The latest Gemini models have 1M+ token context windows and RAG/other grounding, function calling, and multi-agent systems (remember, everything you’re hearing from ChatGPT is currently stream of consciousness, literally the first thing that pops in its head so to speak) will likely fill in the rest of the gaps. These new models have impressive working memories, access to long-term memories, and between in-context learning and dynamic loading of adapters, I think they will be a lot more flexible than you assume.

In any case, I guess we’ll see in a couple years where we land. While it’s good to be skeptical/even-keeled (especially in light of the ridiculous amounts of hype, and like you said, the amount of fraud and slop we’ll be inundated with) I think that it’s worth remembering Tesler’s Theorem: “AI is whatever hasn’t been done yet.”

​ @kipandcop1  With state space models there can be more efficient, effectively infinite context length models (see Griffin, Mamba 2, Samba for some of the more interesting recent stuff - in the case of Samba, literally published today) that I think will absolutely help us move towards tokenless/bytestream models (see also Megabyte architecture). Embodiment, MIMO, are part of it as well (incidentally, SSMs crush on continuous signal data like audio, video, time series, etc).

I think the line between in-context learning and training are going to blur, but all animals sleep, and I don’t think AI models will be be so different if look at training as a sleep phase (where memories are coalesced, etc). Also, while I think there’s a lot to be said for evolutionary/biologically inspired approaches, I agree that looking at the computational requirements of projects like OpenWorm or DeepSouth actually shows just how far biological simulation is from being very relevant to current AI timelines.

Speaking of adapters… Lamini Memory Tuning is a fundamentally different fine-tuning approach that effectively teaches any open-source LLM to be near-perfect on facts, while still maintaining its ability to be pretty good at everything else. When the model is supposed to recall a specific fact, Lamini Memory Tuning shifts the entire probability mass to that particular fact (i.e. specific tokens within a particular context), such as the exact SQL schema for your database. This results in output probabilities that are not just closer to the right result, but exactly there. To do this, Lamini Memory Tuning tunes a massive mixture of memory experts on any open-source LLM. Each memory expert acts like a LoRA adapter that functionally operates as memory for the model. Together, the memory experts specialize in a million different ways to ensure faithful and factual accuracy to the data that it was tuned on. Inspired by information retrieval, these million memory experts are equivalent to indices from which the model intelligently retrieves and routes. At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query.

Paper as well

Thoughts on big improvements:

Mechanistic Interprebility


A Comprehensive Mechanistic Interpretability Explainer & Glossary

Representation Engineering

App Frameworks





An Empirical Study of Mamba-based Language Models