In 2020, I was skeptical. By 2021, the writing was on the wall: deep learning at scale is the future. The biggest revelation came from seeing the results of models trained on massive Internet-scale data. Prior to GPT-3, it was hard to comprehend the sheer wealth of information about humanity’s collective thought processes that was embedded in our online interactions. But once we began to harness this knowledge the results were undeniable. In 2022, discussing the potential of deep learning, I wrote:
The core claim – that a large enough neural network trained on the right dataset can and will capture almost any real-world pattern – seems to be both true and universal.
This take was somewhat controversial at the time, but subsequent developments — the dramatic arrival of ChatGPT, the rise of ultra-scale “foundation model” labs, the ubiquity of AI assistants for writing and coding — make it difficult for any but the most stubborn opponents of scaling to disagree with it now. However, in that same essay, I also identified some obstacles.
There are also certainly still several major unsolved fundamental subproblems, including neural uncertainty and learning with large contexts.
These limitations are as relevant today as they were then. AI progress in the last few years has been nothing short of astonishing, but adoption has nonetheless been hamstrung by complaints of unreliability, distractions, and hallucinations. Ad-hoc “solutions” to these issues abound but uniformly fail to deliver on their promises.
Nothing makes for a better research topic than toppling a fundamental obstacle. In 2023, I joined forces with long-time collaborator Carles Gelada and founded Manifest AI, an open-source research lab focused on algorithms, architectures, and datasets for long-context learning. As described on our website:
Our mission is to train a neural network to model all human output.
There are two primary challenges in our pursuit of this goal.
Currently, it’s not technically feasible to train a model that can ingest all the data that we can collect. Limitations around context length, modality, and throughput force us to use only a small subset of the data available to us.
Much of the data we will need has never been collected, curated, and organized into datasets that we can use for training.
The past few years have been spent in pursuit of these goals. In 2024, I didn’t post on Substack, but did write several articles that you may find interesting:
Linear Transformers Are Faster. A small tweak to the formula for attention allows a dramatic computational restructuring, yielding huge speedups on both training and inference at long contexts.
Symmetric Power Transformers. A generalization of linear transformers, enabling direct control of the tradeoff between expressivity and speed.
Why Gradient Descent Minimizes Training Loss. A non-vacuous theoretical framework for understanding why larger networks learn more quicky.
Compute-Optimal Context Size. Outlining the elements required to design rigorous experiments around the value of long-context training.
LongCrawl64: A Long-Context Natural-Language Dataset. A cleaned and structured subset of Internet data containing only documents of length 64K, useful for assessing the long-context ability of architectures.
The most exciting outcome of our research thus far is Power Attention, an open-source implementation of a hardware-aware kernel that enables efficient training of strong long-context architectures. Like Flash Attention, Power Attention is a drop-in replacement for the existing attention layer of any Transformer. You should see immediate improvements in learning speed for contexts of length ≥64k.
For those curious to learn more, we’ve released a paper on ArXiv which explains the details of Power Attention, and provides the theoretical and experimental justification.
Our evidence is strong enough to leave me convinced that power attention is the future of long-context sequence modeling. But ultimately, the true signal will come from the rest of the deep learning community. Our experiments focused on autoregressive natural language modeling, but this is not the marquee domain for long context. If you are training long-context models — on any platform, task, or modality — please reach out! I would love to learn more about what you are working on and see the impact of power attention in your setting.