Foundation models for universal embodiment for the real world are needed for the next step after #AGI.
We'll soon have superhuman AIs in the digital space, using all sorts of digital tools like search, knowledge curation, collaboration and optimizers of different sorts.
However the real world is more messy, as anyone who has worked with robotics knows. It's possible to build expensive, heavy, rigid and standardized bodies for robots, train control models for them with massive scale reinforcement learning, and try to get their price down with mass production.
However, this doesn't scale. Add wheels or another arm and you need to do all this again. Variability in manufacture and condition causes models to become badly calibrated. Real world situations do not resemble the lab conditions enough and tasks won't work well.
This all is easy to see by people who have worked in robotics, but typically the people who make plans for dynamic robotics multimodal foundation models do not seem to appreciate this enough. They aim for lab demos, and have a rude awakening later. Short-term plans do not extend to long term.
I would rather suggest reformulating RL in a modern fashion, without a sparse reward signal and other badly considered axioms. Instead, we can observe the living world full of agency we have and learn from all the intent we can mine from the living world.
First person is an attended part of the ocean of agency, not a special domain which is unrelated to every other first person.
Google Genie showed we can do first person recognition for the first person with a clever information bottleneck, but we can actually similarly extend this to every agent we recognize in the signal, and then define the first person by attention over all the agency in the signal, injecting actions to the attended domains.
We also get a counterfactual for free, by not injecting actions, which allows us to recognize if our injected actions actually made a difference, that is, recognize the span of first person control.