Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Transformer architecture powers the most popular public and private AI models today. We wonder then — what’s next? Is this the architecture that will lead to better reasoning? What might come next after transformers? Today, to bake intelligence in, models need large volumes of data, GPU compute power and rare talent. This makes them generally costly to build and maintain.
AI deployment started small by making simple chatbots more intelligent. Now, startups and enterprises have figured out how to package intelligence in the form of copilots that augment human knowledge and skill. The next natural step is to package things like multi-step workflows, memory and personalization in the form of agents that can solve use cases in multiple functions including sales and engineering. The expectation is that a simple prompt from a user will enable an agent to classify intent, break down the goal into multiple steps and complete the task, whether it includes internet searches, authentication into multiple tools or learning from past repeat behaviors.
These agents, when applied to consumer use cases, start giving us a sense of a future where everyone can have a personal Jarvis-like agent on their phones that understands them. Want to book a trip to Hawaii, order food from your favorite restaurant, or manage personal finances? The future of you and I being able to securely manage these tasks using personalized agents is possible, but, from a technological perspective, we are still far from that future.
Is transformer architecture the final frontier?
Transformer architecture’s self-attention mechanism allows a model to weigh the importance of each input token against all tokens in an input sequence simultaneously. This helps improve a model’s understanding of language and computer vision by capturing long-range dependencies and the complex token relationships. However, it means the computation complexity increases with long sequences (ex- DNA), leading to slow performance and high-memory consumption. A few solutions and research approaches to solve the long-sequence problem include:
- Improving transformers on hardware: A promising technique here is FlashAttention. This paper claims that transformer performance can be improved by carefully managing reads and writes for different levels of fast and slow memory on the GPU. It is done by making attention algorithms IO-aware which reduces the number of reads/writes between GPU’s high bandwidth memory (HBM) and static random access memory (SRAM).
- Approximate attention: Self-attention mechanisms have O(n^2) complexity where n represents the length of input sequence. Is there a way to reduce this quadratic computation complexity to linear so that transformers can better handle long sequences? The optimizations here include techniques like reformer, performers, skyformer and others.
In addition to these optimizations to reduce complexity of transformers, some alternate models are challenging the dominance of transformers (but it is early days for most):
- State space model: these are a class of models related to recurrent (RNN) and convolutional (CNN) neural networks that compute with linear or near-linear computational complexity for long sequences. State space models (SSMs) like Mamba can better handle long distance relationships but lag behind transformers in performance.
These research approaches are now out of university labs and are available in public domain for everyone to try in the form of new models. Additionally, the latest model releases can tell us about the state of the underlying technology and the viable path of Transformer alternatives.
Notable model launches
We continue to hear about the latest and greatest model launches from usual suspects like OpenAI, Cohere, Anthropic and Mistral. Meta’s foundation model on compiler optimization is notable because of effectiveness in code and compiler optimization.
In addition to the dominant transformer architecture, we’re now seeing production grade state space models (SSM), hybrid SSM-transformer models, mixture of experts (MoE) and composition of expert (CoE) models. These seem to perform well on multiple benchmarks when compared with state of the art open-source models. The ones that stand out include:
- Databricks open-source DBRX model: This MoE model has 132B parameters. It has 16 experts, out of which 4 are active at one time during inference or training. It supports a 32K context window and the model was trained on 12T tokens. Some other interesting details — it took 3-months, $10M and 3072 Nvidia GPUs connected over 3.2Tbps InfiniBand to complete pre-training, post-training, evaluation, red-teaming and refining of the model.
- SambaNova Systems release of Samba CoE v0.2: This CoE model is a composition of five 7B parameter experts out of which only one is active at inference time. The experts are all open-source models and along with the experts, the model has a router. This understands which model is best for a particular query and routes the request to that model. It is blazing fast, generating 330 tokens/second.
- AI21 labs release of Jamba which is a hybrid transformer-Mamba MoE model. It is the first production-grade Mamba-based model with elements of traditional transformer architecture. “Transformer models have 2 drawbacks: First, its high memory and compute requirements hinders the processing of long contexts, where the key-value (KV) cache size becomes a limiting factor. Second, its lack of a single summary state entails slow inference and low throughput, since each generated token performs a computation on the entire context”. SSMs like Mamba can better handle long distance relationships but lag behind transformers in performance. Jamba compensates for inherent limitations of a pure SSM model, offering a 256K context window and fits 140K context on a single GPU.
Enterprise adoption challenges
Although there is immense promise in the latest research and model launches to support transformer architecture as the next frontier, we must also consider the technical challenges inhibiting enterprises from being able to take advantage:
- Enterprise missing features frustrations: Imagine selling to CXOs without simple things like role-based access control (RBAC), single sign-on (SSO) or no access to logs (both prompt and output). Models today may not be enterprise-ready, but enterprises are creating separate budgets to make sure they don’t miss out on the next big thing.
- Breaking what used to work: AI copilots and agents make it more complex to secure data and applications. Imagine a simple use case: A video conferencing app that you use daily introduces AI summary features. As a user, you may love the ability to get transcripts after a meeting, but in regulated industries, this enhanced feature can suddenly become a nightmare for CISOs. Effectively, what worked just fine until now is broken and needs to go through additional security review. Enterprises need guardrails in place to ensure data privacy and compliance when SaaS apps introduce such features.
- Constant RAG vs fine-tuning battle: It is possible to deploy both together or neither without sacrificing much. One can think of retrieval-augmented generation (RAG) as a way to make sure facts are presented correctly and the information is latest, whereas fine-tuning can be thought of as resulting in the best model-quality. Fine-tuning is hard, which is resulting in some model vendors recommending against it. It also includes the challenge of overfitting, which adversely affects model quality. Fine-tuning seems to be getting pressed from multiple sides — as the model context window increases and token costs decline, RAG may become a better deployment option for enterprises. In the context of RAG, the recently launched Command R+ model from Cohere is the first open-weights model to beat GPT-4 in the chatbot arena. Command R+ is the state of the art RAG-optimized model designed to power enterprise-grade workflows.
I recently spoke with an AI leader at a large financial institution who claimed that the future doesn’t belong to software engineers but to creative English/art majors who can draft an effective prompt. There may be some element of truth to this comment. With a simple sketch and multi-modal models, non-technical people can build simple applications without much effort. Knowing how to use such tools can be a superpower, and it will help anyone who is looking to excel in their careers.
The same is true for researchers, practitioners and founders. Now, there are multiple architectures to choose from as they try to get their underlying models to be cheaper, faster and more accurate. Today, there are numerous ways to change models for specific use cases including fine-tuning techniques and newer breakthroughs like direct preference optimization (DPO), an algorithm that can be thought of as an alternative to reinforcement learning with human feedback (RLHF).
With so many rapid changes in the field of generative AI, it can feel overwhelming for both founders and buyers to prioritize, and I’m eager to see what comes next from anyone building something new.
Ashish Kakran is a principal at Thomvest Ventures focused on investing in early-stage cloud, data/ml and cybersecurity startups.
DataDecisionMakers
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!
Read More From DataDecisionMakers