OpenAI and figure robotics just unveiled the world's most advanced combination of robot dexterity, learning, communication, and more, now allowing machines to engage in full conversations, understand their surroundings, plan ahead, and execute complex tasks with human like precision. On top of this, all actions performed by the robot in this demo are learned behaviors which were meticulously acquired through rigorous training regimens and not remotely controlled by human operators. And here's the first part of the demo. Hey, figure one. What do you see right now? I see a red apple on a plate in the center of the table, a drying rack with cups and a plate, and you standing nearby with your hand on the table. Great. Can I have something to eat? Sure thingy. Great. Can you explain why you did what you just did while you pick up this trash? On it. So I gave you the apple because it's the only edible item I could provide you with from the table. Great. So, based on the scene right now, where do you think the dishes in front of you go next? The dishes on the table, like that? Plate and cup are likely to go into the drying rack next. Great. Can you put them there? Of course. All right. So how do you think you did? I think I did pretty well. The apple found its new owner, the trash is gone, and the tableware is right where it belongs. I agree. Thanks, figure one. You're welcome. If there's anything else you need help with, just let me know. This remarkable feat, once thought to be decades away, has been achieved by integrating OpenAI's state of the art multimodal AI, which incorporates its most powerful language system to seamlessly understand and process both visual and textual information. And by connecting to this cutting edge model, the figure one robot has gained the incredible ability to describe its environment in rich detail, interpret everyday situations with nuanced understanding, and perform actions based on highly ambiguous, context dependent requests from humans. But what sets this robot even further apart from others is its capacity to not only listen and respond, but also to plan, reason, and verbally explain its thought processes, leading to its actions, much like a human collaborator. In fact, by processing the entire conversation history, including past images and contextual cues, OpenAI's multimodal model enables the robot to generate spoken responses that humans can engage with, naturally fostering a seamless and intuitive interaction experience. The same powerful model also determines which learned behavior the robot should execute to fulfill a given command, making it a truly autonomous and capable assistant. For instance, when prompted with a seemingly simple request like, can you put that there? The robot can accurately interpret the intended action by referring to previous context and visual information to understand what that and there refer to, which is an extremely humanlike cognitive ability that bridges the gap between artificial and natural intelligence. And at the heart of the robot is its advanced visual motor, transformers, which are complex neural networks that translate the robot's visual inputs directly into coordinated actions. These transformers process the robot's camera feeds at a rapid 10 Hz, capturing and analyzing visual information in real time, and generate fluid actions with 24 degrees of freedom, wrist positions and finger angles at an astonishing 200 Hz, enabling smooth natural movements that seamlessly blend perception and execution. Powering this cutting edge system is a robust computational infrastructure, leveraging the latest advancements in cloud computing. The AI models that drive figure one's capabilities are trained and run on Microsoft's azure cloud platform, harnessing the tech giant's vast resources and cutting edge infrastructure to crunch the massive amounts of data required for such sophisticated machine learning endeavors. And these advances are partially the result of a strategic partnership between OpenAI and figure robotics, with the stated goal being to develop next generation AI models specifically designed for humanoid robots. Their aim is to accelerate the time to market for these advanced machines by teaching them to process natural language with unprecedented fluency and draw logical conclusions with human like reasoning capabilities. These models, which will combine the latest advancements in natural language processing, computer vision and robotic control systems, will be trained and run on Microsoft's azure cloud platform, leveraging the tech giant's cutting edge infrastructure to unlock new realms of possibility. With companies like Google DeepMind also making significant strides in robotics using natural language processing, the race to develop truly intelligent autonomous robots capable of understanding and assisting humans in unprecedented ways is heating up rapidly. To this end, Google's DeepMind AI just introduced its new scalable, instructable multi world agent, or SeMA for short, designed to follow natural language instructions to play video games. But this AI doesn't just understand and execute various complex tasks within digital environments, but it also interprets highly nuanced human language, setting a new standard for AI interaction. On top of this, Sema was developed from analyzing gameplay, where players linked their actions to verbal instructions, which in turn helped SeMa learn to mimic these behaviors through a method known as behavioral cloning. Plus. It combines image and text encoders with a transformer architecture, allowing it to process and integrate visual and linguistic inputs alongside a memory mechanism for recalling past actions. This design enables SEMA to perform over 600 distinct skills, from navigation to menu navigation, all through interpreting screen images and language commands, all without any direct access to the game's code. Beyond its gaming prowess, SemA's potential applications span from virtual training and education to robotics and industrial automation, illustrating a future where AI can offer intuitive language based interaction across various domains. This innovation underscores the growing importance of creating AI that can work alongside humans, understanding and acting upon instructions in a natural language driven manner. DeepMind's creation of SEMA not only represents a leap towards more sophisticated language understanding AI agents, but also emphasizes the need for ethical development and deployment. As these technologies advance, prioritizing transparency and safety will be crucial to their acceptance and beneficial integration into society. Overall, Google's SEMA AI provides yet another level of natural language understanding to pave the way for the next generation of unprecedented collaborations between humans and machines, hinting at a future where various AI agents could become integral to our daily lives, enhancing our capabilities, and assisting in overcoming complex challenges. DeepMind's work with SEMA is a glimpse into a future where the line between human and artificial intelligence continues to blur, promising a world enriched by diligent, adaptable and communicative AI partners. .