Inferact: Building the Infrastructure That Runs Modern AI
a16z PodcastFull Title
Inferact: Building the Infrastructure That Runs Modern AI
Summary
This episode discusses the critical but often overlooked challenges of AI inference, the process of running trained AI models.
It highlights the shift from traditional, predictable computing to the dynamic and complex demands of large language models, emphasizing the role of open-source infrastructure like VLLM and the newly formed company Infraact in addressing these issues.
Key Points
- The core challenge in modern AI is not just training models, but efficiently running them, as large language models introduce unpredictable workloads and demands on hardware that was not designed for them.
- VLLM originated as a PhD project to optimize inference for models like Meta's OPT, revealing the significant and complex open problems in making large language models run efficiently.
- Unlike traditional machine learning workloads which are static and often involve batching standardized inputs, large language model inference is dynamic, with variable prompt and output lengths and continuous request streams, necessitating a step-based processing approach.
- The complexity of AI inference has increased due to factors like model scale (approaching trillions of parameters), diversity in model architectures and hardware, and the emerging paradigm of AI agents requiring more sophisticated state management.
- VLLM has grown into a major open-source project with a large and diverse contributor base, including model providers, hardware companies, and application developers, indicating the importance of a standardized inference layer.
- Infraact, founded by the creators of VLLM, aims to build a universal inference layer and support the open-source ecosystem, believing that open source is critical for AI infrastructure and fosters innovation through community collaboration.
- The company sees itself as building a horizontal layer of abstraction for inference, similar to how operating systems and databases abstract hardware, which is crucial for optimizing AI deployment across diverse models and hardware.
- Companies like Amazon (for its Rufus Assistant bot) and Capture AI are leveraging VLLM for large-scale, real-time deployments, often adopting cutting-edge features rapidly.
Conclusion
The increasing complexity of AI inference, driven by model scale, diversity, and agentic behavior, necessitates robust, open-source infrastructure.
VLLM and Infraact are positioned to lead the development of a universal inference layer, crucial for the future of AI applications across diverse hardware and models.
Open-source collaboration is essential for advancing AI infrastructure, allowing for rapid innovation and broad adoption that proprietary solutions struggle to match.
Discussion Topics
- How can open-source inference engines like VLLM become the universal standard for running AI models across various hardware and applications?
- What are the biggest technical hurdles to overcome in the next 2-3 years for AI inference to keep pace with model advancements?
- As AI agents become more sophisticated, what new infrastructure challenges will arise, and how will they impact the inference layer?
Key Terms
- Inference
- The process of running a trained machine learning model to make predictions or generate outputs based on new input data.
- VLLM
- An open-source inference engine designed to run large language models efficiently.
- Inference Engine
- Software that takes a trained AI model and runs it on computing hardware to produce results.
- Inference Server
- A server that hosts and manages inference engines to handle requests for AI model predictions.
- Autoregressive Language Model
- A type of language model that generates output one token at a time, with each new token depending on the previously generated ones.
- GPU (Graphics Processing Unit)
- Specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device; increasingly used for general-purpose computation, especially in AI.
- CNN (Convolutional Neural Network)
- A class of deep neural networks, most commonly applied to analyzing visual imagery.
- Transformer
- A deep learning model architecture that relies on self-attention mechanisms, widely used in natural language processing and other sequence-to-sequence tasks.
- KVCache
- In transformer models, a cache that stores the key and value states from previous attention layers, significantly speeding up the generation of subsequent tokens by avoiding recomputation.
- Sparse Attention
- An attention mechanism in transformer models that reduces computational complexity by only considering a subset of input tokens, rather than all of them.
- Linear Attention
- A variation of the attention mechanism that aims to reduce the quadratic complexity of standard attention to linear complexity.
- Tokenizer
- A component that converts text into a sequence of tokens (numerical representations) that an AI model can process, and vice versa.
- Pull Request (PR)
- A mechanism in version control systems like Git where a developer proposes changes to a codebase that can be reviewed and merged by project maintainers.
- ML Infra (Machine Learning Infrastructure)
- The systems, tools, and processes required to develop, deploy, and manage machine learning models at scale.
Timeline
The technical differences between traditional machine learning workloads and large language model inference, focusing on GPU usage and static vs. dynamic processing.
The origin story of VLLM, stemming from a need to optimize inference for open-weight LLMs and the realization of significant underlying technical challenges.
The growth and management of the VLLM open-source community, highlighting its diverse contributors and rapid expansion.
The significance of the first VLLM meetup and a16z's initial grant funding, fostering a culture of open-source support.
A definition of inference servers and engines, and a breakdown of the core components and request lifecycle.
The increasing difficulty of inference over time, driven by model scale, hardware and model diversity, and the rise of AI agents.
The strong belief in open-source AI over closed-source solutions, emphasizing diversity and historical precedents in computing.
Examples of significant VLLM deployments, including Amazon's Rufus Assistant and Capture AI's rapid adoption of new features.
The founding and mission of Infraact, focusing on making VLLM a universal inference engine and supporting the open-source ecosystem.
The role of Professor Ion Soika as a mentor and co-founder in Infraact, and the lessons learned from him regarding open-source adoption and research.
The major problems Infraact aims to solve, particularly inference at scale, and the types of engineers they are hiring to address these challenges.
Episode Details
- Podcast
- a16z Podcast
- Episode
- Inferact: Building the Infrastructure That Runs Modern AI
- Official Link
- https://a16z.com/podcasts/a16z-podcast/
- Published
- January 22, 2026