RoboPapers – Lyssna här

Avsnitt

Ep#90: From Capable Controllers to Deployable Humanoid Systems
15 jul· RoboPapers
We want humanoid robots to be able to perform complex, long-horizon tasks in the real world — putting away the groceries or cleaning a room, for example. This requires diverse loco-manipulation skills, which can be easily parameterized to handle object affordances and interact with the world around it safely.
In HANDOFF, Lizhi Yang proposes a 10-D learned whole body controller, which can be converted to whole body actions, and which can be used by a VLM-driven agentic planner to perform complex, multi-step manipulation actions in the real world.
We then discuss how it’s possible to deploy such controllers in the real world, how to make them safe around people via controlled barrier functions and safety functions. This enables humanoids which can move safely through dynamic, crowded environments in the real world.
Learn more in Episode 90 of RoboPapers, with Michael Cho and Chris Paxton.
Abstract
For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse loco-manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.
Learn More
Project page for HANDOFF: https://lzyang2000.github.io/HANDOFF/
Code: https://github.com/lzyang2000/HANDOFF
ArXiV: https://arxiv.org/abs/2606.06493
Safe-SAGE on ArXiV: https://arxiv.org/abs/2603.05497
SHIELD on ArXiV: https://arxiv.org/abs/2505.11494

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#89: Contact Grounded Policy
8 jul· RoboPapers
Contact-rich manipulation is still very challenging for robotics. Problems like opening a jar, or in-hand reorientation of an object, require making repeated contact with different parts of a robot’s hand, and this is hard to do with pure vision. Instead, research is moving towards using tactile sensors in combination with visual policies. But what’s the best way to learn how to handle multi-point contact?
Zhengtong Xu and Yeping Wang tell us about their new work Contact-Grounded Policy (CGP). CGP predicts future robot state and tactile feedback, and predicts this into actions for a compliant robot controller so that a four- or five-finger robot hand can perform complex tasks involving precise manipulation, delicate grasping, and tool use.
To learn more, watch Episode #89 of RoboPapers, with Chris Paxton and Jiafei Duan.
Abstract
Contact-rich dexterous manipulation with multi-finger hands remains an open challenge in robotics because task success depends on multi-point contacts that continuously evolve and are highly sensitive to object geometry, frictional transitions, and slip. Recently, tactile-informed manipulation policies have shown promise. However, most use tactile signals as additional observations rather than modeling contact state or how their action outputs interact with low-level controller dynamics. We present Contact-Grounded Policy (CGP), a visuotactile policy that grounds multi-point contacts by predicting coupled trajectories of actual robot state and tactile feedback, and using a learned contact-consistency mapping to convert these predictions into executable target robot states for a compliance controller. CGP consists of two components: (i) a conditional diffusion model that forecasts future robot state and tactile feedback in a compressed latent space, and (ii) a learned contact-consistency mapping that converts the predicted robot state-tactile pair into executable targets for a compliance controller, enabling it to realize the intended contacts. We evaluate CGP using a physical four-finger Allegro V5 hand with Digit360 fingertip tactile sensors, and a simulated five-finger Tesollo DG-5F hand with dense whole-hand tactile arrays. Across a range of dexterous tasks including in-hand manipulation, delicate grasping, and tool use, CGP outperforms visuomotor and visuotactile diffusion-policy baselines.
Learn More
Project page: https://contact-grounded-policy.github.io/
ArXiV: https://arxiv.org/abs/2603.05687

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Saknas det avsnitt?

Klicka här för att uppdatera flödet manuellt.
Ep#88: DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation
1 jul· RoboPapers
Human skin plays an important role in how we interact with the world and robustly manipulate objects. It’s not just important when we can’t see things with out eyes, but when we want to pick up something heavy, or apply a very specific amount of force. So, it makes sense to want to give robots skin.
Enter DexSkin: a soft, deformable electronic skin which can be applied across different surfaces and used to cover robot hands or fingers. Suzannah Wistreich and Baiyu Shi talk to us about their work building DexSkin, showing how it’s useful for policy learning, including online reinforcement learning, and how it' can be calibrated and policies transferred across sensors. They also open sourced their code and methods for building the sensors.
To learn more, watch Episode #88 of RoboPapers now, hosted by Chris Paxton and Jiafei Duan!
Abstract
Human skin provides a rich tactile sensing stream, localizing intentional and unintentional contact events over a large and contoured region. Replicating these tactile sensing capabilities for dexterous robotic manipulation systems remains a longstanding challenge. In this work, we take a step towards this goal by introducing DexSkin. DexSkin is a soft, conformable capacitive electronic skin that enables sensitive, localized, and calibratable tactile sensing, and can be tailored to varying geometries. We demonstrate its efficacy for learning downstream robotic manipulation by sensorizing a pair of parallel jaw gripper fingers, providing tactile coverage across almost the entire finger surfaces. We empirically evaluate DexSkin's capabilities in learning challenging manipulation tasks that require sensing coverage across the entire surface of the fingers, such as reorienting objects in hand and wrapping elastic bands around boxes, in a learning-from-demonstration framework. We then show that, critically for data-driven approaches, DexSkin can be calibrated to enable model transfer across sensor instances, and demonstrate its applicability to online reinforcement learning on real robots. Our results highlight DexSkin's suitability and practicality for learning real-world, contact-rich manipulation. Please see our project webpage for videos and visualizations: this https URL.
Learn More
ArXiV: https://arxiv.org/abs/2509.18830
Project Page: https://dex-skin.github.io/
Github: https://github.com/sdwistreich/dexskin
Datasets: https://huggingface.co/datasets/swistreich/dexskin

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#87: MolmoAct 2: An open foundation for robots that work in the real world
18 jun· RoboPapers
There are few truly open models in the world, including both weights and data. However, these models are crucial for research and development of new systems — they help us learn which data is important and help develop new capabilities for deploying robots in the real world.
MolmoAct2 provides a foundation for open research into robotics. It is associated with its own open dataset, an open-data action tokenizer, and a reasoning variant which predicts depth tokens. And people have actually been using it across the community, running experiments in their own labs or homes.
Haoquan Fang and Jiafei Duan tell us more. Watch Episode 87 of RoboPapers, with Michael Cho and Chris Paxton, now!
Abstract
Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today’s systems fall short for real-world deployment. Frontier models are closed; open-weight alternatives are tied to expensive hardware; reasoning-augmented policies pay prohibitive latency for their grounding; and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor, MolmoAct along five axes. (1) MolmoAct2 is built on top of our new Molmo2-ER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. (2) We release three new robot datasets spanning low-to-medium cost platforms: MolmoAct2-BimanualYAM Dataset, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date; MolmoAct2-DROID Dataset, a quality-filtered Franka subset of DROID; and MolmoAct2-SO100/101 Dataset, a quality-filtered SO-100/101 subset. (3) We train and release MolmoAct2-FAST Tokenizer, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. (4) We design a new VLA architecture to graft the discrete-token VLM into the flow-matching continuous-action expert via per-layer key-value (KV) conditioning. (5) we propose MolmoAct2-Think, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including π0.5, while Molmo2-ER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data.
Learn More
Project page: https://allenai.org/blog/molmoact2
Code: https://github.com/allenai/molmoact2
ArXiV: https://arxiv.org/pdf/2605.02881v1
And check out our episode on the original MolmoAct:

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#86: RISE: Self-Improving Robot Policy with Compositional World Model
12 jun· RoboPapers
Robot policies must be both reliable and highly capable to be useful; the best way to achieve this level of performance is with reinforcement learning. However, for reinforcement learning you are usually stuck between two difficult options: reinforcement in the real world is often risky and expensive, while reinforcement learning in a traditional simulator takes a lot of engineering work and has a persistent sim-to-real gap. What if instead you could train your robot purely in a world model?
RISE by Jiazhi Yang et al. uses a compositional world model to predict the future and evaluate progress. This allows for a self-improving pipeline, which learns a world model from real data and then learns how the robot should perform different tasks. This pipeline results in a data-driven way to improve policy performance from real data but without real-world reinforcement learning.
Watch Episode #86 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!
Abstract
Despite the sustained scaling on model capacity and data acquisition, Vision-Language-Action (VLA) models remain brittle in contact-rich and dynamic manipulation tasks, where minor execution deviations can compound into failures. While reinforcement learning (RL) offers a principled path to robustness, on-policy RL in the physical world is constrained by safety risk, hardware cost, and environment reset. To bridge this gap, we present RISE, a scalable framework of robotic reinforcement learning via imagination. At its core is a Compositional World Model that (i) predicts multi-view future via a controllable dynamics model, and (ii) evaluates imagined outcomes with a progress value model, producing informative advantages for the policy improvement. Such compositional design allows state and value to be tailored by best-suited yet distinct architectures and objectives. These components are integrated into a closed-loop self-improving pipeline that continuously generates imaginary rollouts, estimates advantages, and updates the policy in imaginary space without costly physical interaction. Across three challenging real-world tasks, RISE yields significant improvement over prior art, with more than +35% absolute performance increase in dynamic brick sorting, +45% for backpack packing, and +35% for box closing, respectively.
Learn More
Project Page: https://opendrivelab.com/RISE/
ArXiV: https://arxiv.org/abs/2602.11075

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#85: Tutor Intelligence
4 jun· RoboPapers
Collecting robot data at scale is key to deploying working manipulation policies, and the team from Tutor Intelligence is here to tell us about how to accomplish it. Their new announcement: a massive, 100-robot “data factory,” with a behind-the-scenes look at how to build a teleoperation platform and how to make robots and policies that are useful for their customers.
Tutor Intelligence is a full-stack robotics company: they build robot arms, they sell robot arms, they write the software and they train neural networks. Josh Gruenstein, Jesse Michel, Shiraz Khan, and Joe McCalmon join us to tell us more about how they scale both teleop data and human interventions from their teleoperators in order to train the policies they need.
Watch Episode #85 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!
Learn More
Blog post: https://tutorintelligence.com/blog/building-a-100-robot-data-factory-toward-factory-ready-ai

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#84: Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
2 jun· RoboPapers
Learning robust, general-purpose reward functions for robotics unlocks many potential applications, like on-robot reinforcement learning or dataset validation. However, there’s a question of how to actually train such reward functions. Training success/failure prediction leads to ambiguous signals partway through a demonstration — it’s hard to measure progress — making the method unsuitable for reinforcement learning, among other things. Predicting progress, on the other hand, does not give a good way of using failure data.
So why not do both? Robometer combines both progress and preference supervision, resulting in a stable, scalable, and highly general reward learning approach. Anthony Liang, Yigit Korkmaz, and Jesse Zhang join us to tell us more.
Watch Episode #84 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!
Abstract
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at this https URL.
Learn More
Project page: https://robometer.github.io/
ArXiV: https://arxiv.org/abs/2603.02115
Code on Github: https://github.com/robometer/robometer

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
29 maj· RoboPapers
Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images.
Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models.
Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!
Abstract
Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild.
References
Project page: https://point-world.github.io/
ArXiV: https://arxiv.org/abs/2601.03782

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#82: SimTooReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation
27 maj· RoboPapers
Humans use tools to perform almost all of the physical work that we do from day to day. However, tools come in many different sizes and shapes, and it’s very difficult to collect human data for them in general. What about training in simulation?
SimTooReal aims to address this through, unsurprisingly, sim-to-real learning. Kushal Kedia and Tyler Lum talk about how it works: they procedurally generate tool-like objects, and then train with the universal objective of moving objects around to different locations. This creates a general-purpose model which can manipulate various tools to perform a variety of tasks in the real world.
Watch episode #82 of RoboPapers, hosted by Michael Cho and Jiafei Duan, now to learn more!
Abstract
The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories.
Learn More
Project page: https://simtoolreal.github.io/
ArXiV: https://arxiv.org/abs/2602.16863

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
20 maj· RoboPapers
Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video.
Elvis Nava joins us to talk about mimic-video and Mimic Robotics. Mimic-ivdeo is part of a new class of video-action models, capable of achieving complex, dexterous bimanual robotic manipulation with relatively little robot data.
One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM.
Watch Episode #81 of RoboPapers with Michael Cho and Chris Paxton to learn more!
Abstract
Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
Learn More
Project page: https://mimic-video.github.io/
ArXiV: https://arxiv.org/abs/2512.15692

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#80: LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data
14 maj· RoboPapers
Sports like tennis are great examples of the sort of dynamic whole-body interaction that’s possible with humanoid robots. But capturing examples of fast, dynamic interactions from humans is really difficult. Enter LATENT, which uses lower-quality human data plus reinforcement learning to teach a robot to play tennis, able to complete back-and-forth volleys at a human level.
LATENT has three steps: (1) collecting imperfect human data like a backswing, (2) using these to learn a latent action space, and (3) they train a high-level policy in simulation which can compose these actions and execute tennis skills on a robot.
Haofei Lu and Yunrui Lian join us to tell us about their method. Watch Episode #80 of RoboPapers, with Chris Paxton and Jiafei Duan, now to learn more!
Abstract
Human athletes demonstrate versatile and highly-dynamic tennis skills to successfully conduct competitive rallies with a high-speed tennis ball. However, reproducing such behaviors on humanoid robots is difficult, partially due to the lack of perfect humanoid action data or human kinematic motion data in tennis scenarios as reference. In this work, we propose LATENT, a system that Learns Athletic humanoid TEnnis skills from imperfect human motioN daTa. The imperfect human motion data consist only of motion fragments that capture the primitive skills used when playing tennis rather than precise and complete human-tennis motion sequences from real-world tennis matches, thereby significantly reducing the difficulty of data collection. Our key insight is that, despite being imperfect, such quasi-realistic data still provide priors about human primitive skills in tennis scenarios. With further correction and composition, we learn a humanoid policy that can consistently strike incoming balls under a wide range of conditions and return them to target locations, while preserving natural motion styles. We also propose a series of designs for robust sim-to-real transfer and deploy our policy on the Unitree G1 humanoid robot. Our method achieves surprising results in the real world and can stably sustain multi-shot rallies with human players.
Learn More
Project page; https://zzk273.github.io/LATENT/
ArXiV: https://arxiv.org/pdf/2603.12686
Code: https://github.com/GalaxyGeneralRobotics/LATENT

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#79: Rhoda AI - Causal Video Models Are Data-Efficient Robot Policy Learners
6 maj· RoboPapers
Training robot foundation models faces two key hurdles: how to get enough data to train an effective model, and how to make sure that new skills can be acquired quickly. The team at Rhoda AI believes that the answer is training Direct Video Action models from web data.
Web data is plentiful, to the point where Rhoda can train their base model on hundreds of years of video data. And then, with the addition of robot data, they can quickly adapt it to new tasks with as little as 20 hours of in-domain data, performing complex, multi-step manipulation tasks with their purpose-built video foundation model. Tongzhou Mu, Eric Chan, and Changan Chen joined us to talk more about their approach.
Watch Episode #79 of RoboPapers, with Michael Cho, Chris Paxton, and Jiafei Duan, to learn more!
Learn More
Blog post: https://www.rhoda.ai/research/direct-video-action

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#78: Three Eras of Robot Learning
5 maj· RoboPapers
Robotics has changed dramatically over the last eight years. Ted has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning.
These eras are:
* The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL
* The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning)
* The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer
The only reason something succeeds is if everything goes right. Behavior cloning, for example, seemed stuck at 60-70% success rate on key tasks until his team rewrote their learning stack — at which point it hit 95-99%+ success rates.
For most of those eight years, something was wrong. The stack wasn’t quite right, the learning algorithms were wrong, the data didn’t exist. Hardware and operations are not mature enough. But they kept working on these problems, over and over, until finally they have arrived at amazing breakthrough.
Some key trends now:
* Reasoning models for robotics
* Long-horizon, precision-oriented tasks, like making coffee from Physical Intelligence or GPU assembly from Skild
* Cross-embodiment transfer
* Hardware and model co-design
* Results are nice, but capabilities are even more — and academics are going to have trouble keeping up with compute and resources available to companies
Watch Episode 78 of RoboPapers, with Michael Cho and Jiafei Duan, to learn more!

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#77: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
29 apr· RoboPapers
World models have many different uses, from evaluation to training data generation to robot planning. DreamDojo is a new foundation world model that allows for impressively general and long-horizon interaction, generating coherent videos for interaction sequences over a minute long. It works in a wide range of environments and even generalizes to previously-unseen environments.
We talked to Shenyuan Gao and William Liang about how they built DreamDojo, and about what tricks were necessary to scale world model learning on data with sparse action labels, pretraining on 44,000 hours of human data and adapting to a wide variety of robots, environments, and skills.
Watch Epsiode #77 of RoboPapers with Michael Cho and Chris Paxton now to learn more!
Abstract
Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.
Learn More
Project Page: https://dreamdojo-world.github.io/
ArXiV: https://arxiv.org/abs/2602.06949
Github: https://github.com/NVIDIA/DreamDojo
Original thread on X

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#76: OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control
27 apr· RoboPapers
We’ve seen lots of incredible videos of humanoid robots dancing, doing martial arts, running up walls — but these extreme behaviors are usually from individual, highly specialized policies. But now OmniXtreme shows us how to achieve incredible behaviors that push the limits of humanoid motion, by (1) training a flow-based motion generative model, and (2) doing residual RL post-training to handle complex real-world dynamics.
Yunsheng Wang and Shaohang Zhu join us to talk about their work towards general-purpose high performance humanoid robot control.
Watch Episode #76 of RoboPapers, with Michael Cho and Jiafei Duan, now!
Abstract
High-fidelity motion tracking serves as the ultimate litmus test for generalizable, human-level motor skills. However, current policies often hit a "generality barrier": as motion libraries scale in diversity, tracking fidelity inevitably collapses - especially for real-world deployment of high-dynamic motions. We identify this failure as the result of two compounding factors: the learning bottleneck in scaling multi-motion optimization and the physical executability constraints that arise in real-world actuation. To overcome these challenges, we introduce OmniXtreme, a scalable framework that decouples general motor skill learning from sim-to-real physical skill refinement. Our approach uses a flow-matching policy with high-capacity architectures to scale representation capacity without interference-intensive multi-motion RL optimization, followed by an actuation-aware refinement phase that ensures robust performance on physical hardware. Extensive experiments demonstrate that OmniXtreme maintains high-fidelity tracking across diverse, high-difficulty datasets. On real robots, the unified policy successfully executes multiple extreme motions, effectively breaking the long-standing fidelity-scalability trade-off in high-dynamic humanoid control.
Learn More
Project Page: https://extreme-humanoid.github.io/
Github: https://github.com/Perkins729/OmniXtreme
ArXiV: https://arxiv.org/abs/2602.23843
Original thread on X:

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#75: TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
23 apr· RoboPapers
Reinforcement on robots is highly limited by our ability to design good reward functions; this means that designing strong, generalizable reward functions is a key enabler to progress on real-world reinforcement learning.
But we already have a very general class of models: VLMs. Wouldn’t it be great if you could just use a VLM to generate rewards, then? TOPReward directly generates rewards from the probability of the “True” token of a VLM question-answering response; this makes it easy to implement, incredibly general, and surprisingly powerful. We talked to Shirui Chen and Cole Harrison to learn more.
Watch Episode#75 of RoboPapers now to learn more, with Chris Paxton and Jiafei Duan!
Abstract
While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.
Learn More
Project Page: https://topreward.github.io/webpage/
ArXiV: https://arxiv.org/abs/2602.19313

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#74: Weave Robotics
20 apr· RoboPapers
Do you want to never fold clothes again? Weave is a robotics startup founded in early 2024, aiming to build useful home robots as a product. We talked with co-founder Kaan Doğrusöz, and learned about his journey building a home robotics startup. We covered building products out of end-to-end learning, the ideal form factor of a home robot, and what the important prerequisites are for deploying AI-enabled robotics in the real world.
Watch epiosde #74 of RoboPapers, with Chris Paxton and Jiafei Duan, now!
Learn More
Weave Robotics: https://www.weaverobotics.com/
And you can order your Isaac today:

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#73: VideoManip: Dexterous Manipulation Policies from RGB Human Videos via 3D Hand-Object Trajectory Reconstruction
18 apr· RoboPapers
Teaching robots to perform dexterous manipulation tasks currently requires teleoperation, which limits demonstration quality, speed, and scalability. Instead, why not use human videos? The problem is that a human hand isn’t a robot hand, so data must be retargeted using simulation to resolve issues like collisions and interpenetration when controlling the hand.
In VideoManip, Hongyi Chen and co-authors built a system to solve this problem, taking in RGB videos of humans performing manipulation tasks and using them to create accurate simulations with which to learn robot policies.
Watch episode #73 of RoboPapers, hosted by Michael Cho and Chris Paxton, now to learn more!
Abstract
Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 3D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at this http URL.
Learn More
Project page: https://videomanip.github.io/
ArXiV: https://arxiv.org/abs/2602.09013

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#72: SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
15 apr· RoboPapers
How can we build a general-purpose “foundation model” for robot motion? Zhengyi Luo joitns us to talk about SONIC, which uses motion tracking as a foundational task for humanoid robot control, and scales humanoid control training to 9k GPU hours and 100 million frames worth of data. The result: a model with a generally-useful embedding space that can be controlled by a VLA, or from human video, to perform a wide variety of humanoid whole-body-control tasks, including with zero-shot transfer to previously unseen motions.
Watch episode 72 of RoboPapers, with Michael Cho and Jiafei Duan, now!
Abstract
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
Learn More
Project Page: https://nvlabs.github.io/GEAR-SONIC/
ArXiV: https://arxiv.org/abs/2511.07820
Paper PDF: https://nvlabs.github.io/GEAR-SONIC/static/pdf/sonic_paper.pdf

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Ep#71: Build Your Own Robot
8 apr· RoboPapers
Robots, unfortunately, tend to be expensive. And finding a robot that’s both capable of performing a wide variety of mobile manipulation tasks, and is affordable and “hackable”, is extremely difficult. Many different problems need to be addressed, from arm control to navigation to integrating your data collection strategy into hardware design. This can make it difficult for all but the most well-funded teams to “scale” real-world robotics research.
Fortunately, the team behind Build Your Own Robot has a solution. Manan Anjaria, Mahi Shafiullah, Jeff Cui, and Enes Erciyes joined us to talk about how they build a fully open-source mobile manipulator out of off-the-shelf parts, which has humanlike range of motion, and can perform a wide variety of tasks, all while being only roughly $10,000 to build.
Watch Episode 71 of RoboPapers, with Michael Cho and Chris Paxton, today to learn more!
Abstract
Recent advances in robot learning have generated significant interest in capable platforms that may eventually approach human-level competence. This interest, combined with the commoditization of actuators, has propelled growth in low-cost robotic platforms. However, the optimal form factor for mobile manipulation, especially on a budget, remains an open question. We introduce YOR, an open-source, low-cost mobile manipulator that integrates an omnidirectional base, a telescopic vertical lift, and two arms with grippers to achieve whole-body mobility and manipulation. Our design emphasizes modularity, ease of assembly using off-the-shelf components, and affordability, with a bill-of-materials cost under 10,000 USD. We demonstrate YOR's capability by completing tasks that require coordinated whole-body control, bimanual manipulation, and autonomous navigation. Overall, YOR offers competitive functionality for mobile manipulation research at a fraction of the cost of existing platforms. Project website: this https URL
Learn More
Project Page: https://yourownrobot.ai/
ArXiV: https://arxiv.org/abs/2602.11150

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Visa fler

Avsnitt

Ep#90: From Capable Controllers to Deployable Humanoid Systems

Ep#89: Contact Grounded Policy

Ep#88: DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation

Ep#87: MolmoAct 2: An open foundation for robots that work in the real world

Ep#86: RISE: Self-Improving Robot Policy with Compositional World Model

Ep#85: Tutor Intelligence

Ep#84: Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Ep#82: SimTooReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Ep#80: LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data

Ep#79: Rhoda AI - Causal Video Models Are Data-Efficient Robot Policy Learners

Ep#78: Three Eras of Robot Learning

Ep#77: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Ep#76: OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control

Ep#75: TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Ep#74: Weave Robotics

Ep#73: VideoManip: Dexterous Manipulation Policies from RGB Human Videos via 3D Hand-Object Trajectory Reconstruction

Ep#72: SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Ep#71: Build Your Own Robot