Back to a16z Podcast

What Comes After ChatGPT? The Mother of ImageNet Predicts The...

a16z Podcast

Full Title

What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

Summary

Fei-Fei Li and Justin Johnson discuss the launch of Marvel, a generative model for 3D worlds, highlighting the shift from language models to spatial intelligence as the next frontier in AI.

They explore the nuances of spatial intelligence, its distinction from linguistic intelligence, and the potential for Marvel to impact fields like gaming, VFX, robotics, and design, while also addressing the ongoing challenges in AI development and research ecosystems.

Key Points

  • Spatial intelligence is fundamentally different from language intelligence, requiring models that can reason, understand, move, and interact in three-dimensional space, a capability crucial for human interaction with the world that current language models do not fully capture.
  • Marvel, a new generative model from WorldLabs, represents a step towards spatial intelligence by creating explorable 3D worlds from text or image inputs, with applications in gaming, VFX, and film, while also being designed as a useful product for today.
  • The history of deep learning is closely tied to the scaling of compute, with significant advancements enabling the development of complex models like Marvel, which require vastly more computational power than earlier breakthroughs like AlexNet.
  • The balance between open science and proprietary development in AI research is evolving, with academia contributing through open datasets and benchmarks, while industry focuses on productization, creating a diverse but sometimes imbalanced ecosystem.
  • The role of academia in AI research has shifted from training state-of-the-art models to exploring novel ideas and theoretical underpinnings, given the immense computational resources now required by industry labs.
  • Transformers, while powerful, are fundamentally set models, not sequence models, with their perceived sequential nature arising from positional embeddings, suggesting potential for broader applications beyond linear data.
  • The development of spatial intelligence models like Marvel involves understanding and generating complex 3D scenes, with ongoing research into how to incorporate physics, dynamics, and causal reasoning, moving beyond simple pattern fitting.
  • The future of AI may involve multimodal models that integrate various forms of intelligence, including spatial and linguistic, potentially leading to universal models that can interact with the world more comprehensively.
  • The definition and requirements of spatial intelligence, contrasted with linguistic intelligence, highlight the importance of embodied experience and direct interaction with the physical world for true understanding.

Conclusion

Spatial intelligence represents the next significant frontier in AI, moving beyond language models to enable AI systems to understand and interact with the physical world.

Tools like Marvel are crucial for developing and demonstrating this spatial intelligence, offering practical applications while paving the way for future advancements.

Continued research and development in multimodal AI, incorporating spatial understanding alongside language and other modalities, is essential for creating more capable and human-like artificial intelligence.

Discussion Topics

  • How can we better foster collaboration between academic research and industry development in AI to ensure both open innovation and practical productization?
  • What are the most significant ethical considerations we must address as AI models become increasingly capable of generating and interacting with realistic 3D environments?
  • Beyond current applications, what are the most groundbreaking or unexpected future uses you envision for advanced spatial intelligence models?

Key Terms

Spatial intelligence
The ability to reason about, understand, and interact with three-dimensional space, including movement, perception, and object manipulation.
Generative model
An AI model that can create new data, such as images, text, or in this case, 3D worlds, based on learned patterns from training data.
Transformers
A type of neural network architecture that has proven highly effective in natural language processing and other areas, characterized by its attention mechanism and ability to process data in parallel.
AlexNet
An early convolutional neural network that significantly contributed to the deep learning revolution by achieving breakthrough performance on the ImageNet image recognition challenge.
GPUs
Graphics Processing Units, specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images for display. They are crucial for the intensive computations required by deep learning models.
ImageNet
A large dataset of labeled images used for training and evaluating computer vision algorithms, which played a pivotal role in advancing deep learning.
LLMs
Large Language Models, AI models trained on vast amounts of text data, capable of understanding, generating, and processing human language.
Gaussian splats
A rendering technique that represents scenes as a collection of semi-transparent, oriented particles, allowing for efficient real-time rendering on various devices.
Embodied agents
AI systems that interact with the physical or simulated world through a physical form, requiring a deep understanding of their environment and actions.
World models
AI models designed to understand and represent the dynamics and structure of the world, enabling them to reason about cause and effect and predict future states.

Timeline

00:00:00

Fei-Fei Li introduces the concept of spatial intelligence as the next frontier beyond language models.

00:00:31

Marvel is introduced as a generative model for 3D worlds, accepting text or image inputs.

00:01:06

Fei-Fei Li and Justin Johnson's backgrounds and the founding of WorldLabs are discussed.

00:04:43

The scaling of compute power is identified as a key factor enabling advancements in AI, including world models.

00:05:36

The evolving role of open science and industry-driven productization in the AI ecosystem is debated.

00:07:49

Concerns about academic research being influenced by commercial pressures and resourcing imbalances are raised.

00:09:39

The historical challenges in computer vision and the shift towards generative modeling are recalled.

00:14:37

The early work on image captioning by Fei-Fei Li and her team is described, highlighting the simultaneous development with Google.

00:18:55

The evolution to dense captioning and complex neural network architectures for visual understanding is explained.

00:21:46

The fundamental difference between spatial intelligence and language intelligence is asserted, with spatial intelligence being a more complex, multimodal understanding of the world.

00:22:13

The concept of "pixel maximalism" is discussed, suggesting pixels as a more general and lossless representation of visual data.

00:23:33

The "impact bias" in world models, where models can fit patterns without true causal understanding, is illustrated with an example of predicting planetary orbits.

00:24:58

The distinction between pattern fitting in current deep learning and true causal understanding of physics is emphasized.

00:25:33

Marvel's ability to generate realistic scenes versus understanding the underlying physics is questioned.

00:26:13

The practical importance of a model understanding physics versus merely rendering plausible outputs is debated based on use cases.

00:27:05

The nature of AI intelligence is contrasted with human intelligence, highlighting the lack of self-awareness and potentially different internal cognition.

00:28:36

The scalability of models and data are seen as key to achieving more robust spatial intelligence and enabling applications like CAD design generation.

00:29:09

The role of traditional physics engines in data generation for training AI models is discussed, acknowledging their limitations and the need for new approaches.

00:30:37

The naming of Marvel and its significance as an initial glimpse into WorldLabs' vision for spatial intelligence is explained.

00:31:41

Marvel's capabilities as a generative and interactive 3D world model, suitable for creative industries, are detailed.

00:33:50

The ability of Marvel to generate scenes with precise camera control is highlighted as a key feature demonstrating its sense of 3D space.

00:35:07

The native output of Marvel, Gaussian splats, and their efficiency for real-time rendering on various devices are explained.

00:36:15

The integration of physics and dynamics into Marvel, through predicting physical properties or regenerating scenes, is discussed as a future avenue.

00:38:18

The potential for Marvel's technology to be used in embodied AI training for robotics, due to the need for synthetic data, is explored.

00:41:40

The horizontal nature of Marvel's technology and its applicability across various industries, including design and interior remodeling, is emphasized.

00:42:50

The concept of spatial intelligence is defined as the capability to reason, understand, move, and interact in space, contrasted with linguistic intelligence.

00:43:33

Human intelligence is presented as multimodal, encompassing linguistic, spatial, logical, and emotional intelligence.

00:44:45

The historical deduction of DNA structure is used as an example of complex spatial reasoning that is difficult to reduce to pure language.

00:45:17

The effortless nature of human spatial perception and interaction with the world is contrasted with the effort required for language acquisition.

00:47:00

The interplay between spatial and linguistic intelligence is discussed, with language aiding in formalizing spatial understanding.

00:47:38

Spatial intelligence is characterized as the embodied experience of being in 3D space, a modality with higher bandwidth than language narration.

00:49:34

The evolutionary timelines of perception, spatial intelligence, and language development are compared, highlighting the ancient origins of spatial abilities.

00:50:02

The inability of current LLMs to grasp basic physical impossibilities, like an object falling through another, is attributed to their lack of an internal 3D representation.

00:50:49

The need for multimodal models that integrate language and spatial understanding is advocated, as language remains a crucial interface.

00:51:35

The potential for AI to discover physics independently of human knowledge and the constraints imposed by human cognition and technological evolution are pondered.

00:53:14

The emergence of Newtonian laws from observational data in LLMs is considered unlikely, as such laws represent a different abstraction level than token prediction.

00:54:05

The challenge for AI to learn heliocentric models from visual data alone, without explicit guidance, is discussed.

00:55:36

A different learning paradigm, focused on hypothesis testing and world elimination, is proposed as essential for true understanding, akin to human theory of mind.

00:56:37

The core components of Transformers, being set models rather than sequence models, are explained, with order being an add-on through positional embeddings.

00:58:44

A call for intellectually fearless individuals to work on advancing spatial intelligence is made, highlighting WorldLabs' current research and product development.

01:00:10

Advanced editing features within Marvel are highlighted, encouraging users to explore its full capabilities.

01:00:50

Intellectual fearlessness is identified as a key principle for researchers and developers in the spatial intelligence field.

Episode Details

Podcast
a16z Podcast
Episode
What Comes After ChatGPT? The Mother of ImageNet Predicts The Future
Published
December 5, 2025