Summary
Transcript
On top of this, the hand is equipped with a total of 978 self-developed tactile sensors, offering up to 15 types of multi-dimensional tactile abilities, with these sensors generating a total of 3,912 channels of tactile signals for the robot hand to most accurately perceive texture, pressure, and other properties. And because it’s paired with an 8-megapixel HD hand-eye camera, the system also handles precise spatial calculations and object recognition, including shape, position, and orientation. Additionally, it’s supported by a zero-sample position estimation vision algorithm, and as for power, the DEX H13 Gen2 is built for industrial applications to handle up to five kilograms of payload capacity, with a lifespan rated for over one million cycles.
And when it comes to cost, the DEX H13 Gen2 is priced at just under 700 US dollars and features a modular design for easy integration and low maintenance operation. But that’s just the beginning, as Stanford researchers also just released a new system called human-object interaction from human-level instructions that lets robots perform complex tasks with human-like precision while combining language comprehension and advanced motion generation, allowing intelligent agents to interpret detailed instructions and execute them like a real person. It works by using large language models to break down human instructions into step-by-step action plans, where a low-level motion generator then synchronizes hand, finger, and full-body motions to manipulate objects naturally.
But what sets the system apart is really its reinforcement learning-based physics tracker, which ensures that every movement follows the laws of physics. Plus, issues like hand-object collisions or foot-floating are corrected in real time. And when compared to older models like CNET and GRIP, Stanford’s system delivers much more precise hand-object interactions and lifelike finger motions, with ablation studies even further highlighting its superiority. But another new AI framework called Okami was just released, finally allowing humanoid robots to replicate complex human manipulation tasks by learning from just a single video. Additionally, this two-stage method allows robots to not only understand human actions in a task but also recreate them in varied environments.
It starts with state 1, where Okami processes a video of a human performing a task to extract a reference manipulation plan, which is a spatiotemporal abstraction that captures the movement of objects and the human’s trajectory between subgoals. To do this, Okami uses vision-language models to identify task-relevant objects and track their motion throughout the video. A human reconstruction model then maps the human’s body movements using the trajectories to understand how the task is performed. Subgoals, such as picking or placing an object, are then identified based on changes in object motion. Next, this combination of object tracking, human motion analysis, and subgoal detection forms a detailed plan that the robot can later execute.
In the second stage, Okami translates the human action plan into humanoid robot motion, where its object-aware retargeting adapts the human motion to fit the robot’s structure and the real-world task layout. The system first localizes the task-relevant objects in the robot’s environment and retrieves the subgoals. Then, trajectories from the human video are retargeted to the robot using inverse kinematics and dexterous motion adaptations. As a result, the robot’s trajectory is adjusted based on the current placement of objects during deployment, ensuring the motion is practical and accurate for the specific scenario. And Okami has been tested on six diverse manipulation tasks, including picking, pouring, pushing, and bimanual coordination, with its method proving effective across different visual backgrounds, object types, and layouts.
By generating a large data set of successful task executions without relying on human teleoperation, Okami further enables the training of closed-loop visuomotor policies, scoring test success rates of 83.3% in bagging tasks and a 75% success rate in salt-sprinkling tasks. But despite its success, Okami is not without its limitations, as failures can still occur due to inaccuracies in robot controllers, vision models, or inverse kinematics, leading to issues like missed grasps or unwanted collisions, highlighting further areas for refinement in the future. But there’s another breakthrough coming for robot control, as OpenAI just announced its newest O3 reasoning model as this year’s frontier problem-solving AI with several new abilities.
And unlike GPT-4, which relied on reinforcement learning from human feedback, the new O3 model now uses a goal-driven RL approach, which enables it to perform well in tasks where solutions are clearly defined, such as mathematics and programming. But the O3 model can also learn through specific objectives, then refine its strategies over time to construct logical pathways to reach correct solutions. Importantly, this is in contrast with traditional language models that simply predict the next word or token in a sequence. Thus, O3 achieves exceptional performance in both coding and mathematical benchmarks, where outputs can be directly verified as either right or wrong.
As for O3’s power, OpenAI credits O3’s success to improvements in computational scaling. Furthermore, O3’s reliance on large-scale compute has led to the development of the O3 Mini. But as for its intelligence, the O3 model has already achieved extremely impressive results in structured problem-solving by solving approximately 25% of all problems on the rigorous frontier math benchmark compared to just 2% for previous top AI models. Incredibly, this level of performance was expected to take at least another year to even come close to achieving. And finally, to address growing demands for tackling complex robotic challenges and faster simulation speeds, intricate environments, and robust datasets, researchers have introduced MSHAB, a cutting-edge benchmark for low-level manipulation.
Importantly, MSHAB builds on the Home Assistant benchmark with a GPU-accelerated implementation that offers over three times the speed of previous methods, all while maintaining efficient GPU memory usage, with the benchmark supporting realistic low-level robotic controls for more precise manipulation tasks. Additionally, the benchmark provides reinforcement learning and imitation learning baselines to serve as comparison points for future research, with a novel rule-based trajectory filtering system refining its RL policy demonstrations for safer and more predictable robot behaviours. This approach, when paired with rapid simulation, enables large-scale, controlled data generation and sets a new gold standard for embodied artificial intelligence.
For more information on MSHAB, please visit MSHAB.com
[tr:trw].