Back to Y Combinator Startup Podcast

Chelsea Finn: Building Robots That Can Do Anything

Y Combinator Startup Podcast

Full Title

Chelsea Finn: Building Robots That Can Do Anything

Summary

This podcast episode explores the development of general-purpose robots by Physical Intelligence, aiming to bring advanced intelligence into the physical world. The speaker, Chelsea Finn, details how a combination of large-scale real robot data and innovative pre-training and fine-tuning strategies enables robots to perform complex tasks, generalize to novel environments, and respond to open-ended human commands.

Key Points

  • The traditional approach of building custom robot solutions for each specific application is highly inefficient, requiring bespoke hardware and software, which has historically hindered the widespread adoption of robotics.
  • While data scale is necessary for generalizable robot models, it is not sufficient; the data must possess diversity in behaviors, realism, and direct embodiment to effectively train robots for open-world conditions.
  • A crucial breakthrough in robot training involves pre-training models on all available robot data and then fine-tuning on a curated, high-quality set of task-specific demonstration data, which significantly improves the robot's reliability and performance on complex, dexterous tasks like laundry folding.
  • This foundation model approach enables a single model to be adapted to various complex tasks and even different robot hardware from other companies, reducing the need to start from scratch for new applications.
  • Robots can successfully operate in novel, unseen environments by being trained on a diverse dataset of mobile manipulation data collected from many unique locations, demonstrating a significant generalization capability beyond their training environments.
  • To improve robot comprehension and adherence to language commands, a recipe was developed that uses tokenized actions and prevents the degradation of the pre-trained vision-language model backbone during fine-tuning, leading to a much higher language-following rate.
  • Robots gain the ability to respond to open-ended prompts and situated corrections by training hierarchical policies on synthetic data, where language models generate hypothetical human interactions based on existing robot data, augmenting the real-world dataset.

Conclusion

General-purpose robot foundation models present a more scalable and effective path for physical intelligence compared to building highly specialized robots for individual tasks.

Achieving physical intelligence necessitates not only large-scale data but also sophisticated training methodologies like pre-training on diverse data and careful fine-tuning on high-quality, task-specific demonstrations.

Despite significant progress, further research and open-source contributions are essential to overcome existing limitations in robot speed, partial observability, and long-term planning for robust deployment in complex, open-world environments.

Discussion Topics

  • What are the most promising near-term applications for general-purpose robots in industries beyond domestic tasks, and what unique challenges do they present?
  • How can the ethical implications of increasingly autonomous robots in shared human environments be best addressed and regulated as their capabilities advance?
  • What are the biggest open problems in robotics research that the open-source community could contribute to, given the need for more data, better algorithms, and robust infrastructure?

Key Terms

Foundation models
Large AI models trained on vast amounts of data, designed to be adaptable to a wide range of downstream tasks, similar to how large language models work.
Teleoperation
The operation of a robot or machine from a distance, typically with human control, used to collect demonstration data for robot training.
Imitation learning
A machine learning paradigm where an agent learns a policy by observing demonstrations of a task, typically performed by a human.
Pre-training
The initial training phase of a machine learning model on a large, diverse dataset to learn general features and representations.
Fine-tuning
The subsequent training phase where a pre-trained model is further trained on a smaller, task-specific dataset to adapt it for a particular application.
Vision language model (VLM)
An AI model that processes both visual information (images/videos) and natural language, allowing it to understand and reason about both modalities simultaneously.
Diffusion
A class of generative models that learn to reverse a diffusion process (e.g., adding noise) to generate data, often used for continuous action prediction in robotics.
Tokenized actions
Representing continuous robot actions as discrete tokens, similar to how words are tokenized in language models, to make them compatible with transformer-based architectures.
Reinforcement learning (RL)
A machine learning paradigm where an agent learns to make decisions by interacting with an environment and receiving rewards or penalties.
Mobile manipulation
Robotic tasks that involve both locomotion (movement of the robot in an environment) and manipulation (object interaction with arms/grippers).
Static manipulation
Robotic tasks that involve only object interaction with arms/grippers, where the robot's base remains stationary.
Vision-Language Action (VLA) models
Hierarchical models that use a high-level policy to break down complex language prompts into atomic language commands, which are then executed by a low-level model.
World modeling
In AI, the creation of an internal model of the environment that allows an agent to predict future states or consequences of its actions.
Hallucinate
In AI, when a model generates information that is plausible but not factual or consistent with the input data or real-world constraints.
Retrieval-based systems
AI systems that access external databases or knowledge bases to retrieve relevant information rather than relying solely on internally stored knowledge.

Timeline

00:00:06

The problem with custom robotics solutions is that each application requires building a separate company from scratch, developing unique hardware and software, which has limited widespread adoption.

00:01:00

Simply scaling data from industrial automation, YouTube, or simulation is insufficient for developing general-purpose robots because it lacks the necessary diversity, realism, or direct embodiment for real-world generalization.

00:04:40

A significant breakthrough in robot training involved pre-training a model on all available robot data, then fine-tuning it on a curated, consistent, and high-quality dataset of demonstration data, leading to much more reliable performance on complex tasks.

00:07:47

This foundation model approach allows for leveraging pre-training across multiple robots and tasks, enabling the same recipe to be applied to different dexterous tasks and even other companies' robots without starting from scratch.

00:09:00

Robots can succeed in unseen environments by collecting and pre-training on diverse mobile manipulation data across over 100 unique rooms, which significantly improves generalization performance in novel homes.

00:10:05

To address language following, a modified training recipe was developed that uses tokenized actions and stops gradients from deteriorating the vision-language model backbone, resulting in an 80% language follow rate compared to 20%.

00:13:22

Robots can respond to open-ended prompts and interjections by using language models to generate synthetic human prompts and relabel existing robot data, training a hierarchical policy to break down complex instructions into atomic commands.

Episode Details

Podcast
Y Combinator Startup Podcast
Episode
Chelsea Finn: Building Robots That Can Do Anything
Published
July 22, 2025