Summary
Transcript
Onex, in collaboration with OpenAI, just shared its developments showcasing a series of Android robots capable of operating completely autonomously, being able to learn from raw endtoend data in order to perform a variety of tasks without human guidance, and the results are shocking. Incredibly, the demonstration video released by one X features these androids executing tasks such as driving and object manipulation, all while being directed by a singular visionbased neural network.
This network processes visual input and then outputs actions at a rate of 10 times per second, controlling the robot’s limbs, grippers, and other body parts. The entire video illustrates these capabilities without using teleoperation, video editing techniques or preprogramming movements so as to demo its abilities in a real world environment. But even more interestingly, the underlying machine learning models that enable these behaviors were trained using a data set compiled from 30 Eve robots.
This data set was then used to develop a base model capable of understanding a wide array of physical interactions, including domestic chores and social engagement with humans and other robots. The training process then involves refining this base model for specific task categories such as door handling or warehouse operations, and then further tailoring it to execute particular tasks efficiently. But this training methodology is a departure from traditional programming, relying instead on data driven development.
In fact, these programmers, referred to as software 20 engineers, utilize data to impart new skills to the robots, bypassing the need for conventional coding. As a result, this approach not only streamlines the robot’s skill acquisition process, but also broadens the potential utility of these androids across various sectors. Furthermore, Onex plans to augment the availability of physical labor through the deployment of these intelligent, safe androids being designed specifically to operate in environments that are built for humans.
Plus, these robots humanlike form factor is designed for a new level of automated versatility and adaptability across a wide range of tasks. Specifically, the company’s strategy is to continue working towards full autonomy through endtoend learning of motor behaviors using visual inputs and neural networks. As the foundation and as one X progresses in its endeavors, the potential impact on multiple industries, including manufacturing and healthcare, is massive. For humans, these androids offer the possibility to address labor shortages, undertake hazardous tasks, and adapt to new challenges efficiently.
And when compared to other notable humanoid robots like Tesla’s Optimus, the contrasts are striking. To start, unlike Optimus, which has demonstrated capabilities in performing preprogrammed tasks and interacting in controlled environments, Onex’s androids showcase a deeper level of autonomy through their ability to learn and adapt from raw endtoend data. This approach allows these androids to perform a broader range of tasks without direct human oversight or the need for specific programming for each new task.
Tesla’s Optimus, for instance, has been highlighted for its potential in performing useful tasks in industrial settings, leveraging Tesla’s expertise in automation and battery technology. However, its autonomy has been primarily showcased through demonstrations of preset tasks. Similarly, Ameka, developed by engineered arts, is a leap in humanoid robotics, with its expressive facial movements and interaction capabilities being designed more for social interactions and entertainment rather than complex task execution. In contrast, Onex’s androids are directed by a singular visionbased neural network processing visual inputs to control movements and actions in real time, a methodology that underscores a significant departure from the reliance on teleoperation or scripted movements seen in other humanoid robots.
This use of endtoend learning is a paradigm shift which allows robots to learn tasks from raw data directly to action, bypassing the traditional more segmented stages of processing such as feature extraction, interpretation, and action planning, in favor of a unified model that handles everything from perception to action. But endtoend learning is also advantageous in complex, unstructured environments where predefined rules fall short. Overall, the ongoing development of onex’s autonomous robots marks a significant step forward in the field of robotics, emphasizing the role of advanced machine learning in achieving practical and adaptable robotic solutions.
This not only accelerates the pace at which robots can be deployed across various sectors, but also opens up new possibilities for personalized and contextaware robotics applications. Moving away from rigid taskspecific programming, this promises a future with a more flexible, learning based approach to turn robots into adaptable partners that are capable of evolving with their human counterparts. And it’s not just autonomous robots shocking the tech world, but also a series of new autonomous web agents, leading to Carnegie Mellon University’s introduction of visual web arena to test the capabilities of these budding multimodal ais.
This new benchmark measures what AI can achieve when navigating and performing tasks on the web. By focusing on realistic and visually demanding challenges, the new wave of web based autonomous agents has opened up possibilities for automating a wide array of computer operations. These agents, equipped with the ability to reason, plan, and execute tasks, could significantly streamline how we interact with digital environments. Yet the development of such agents has been hindered by the complexity of integrating visual and textual understanding with the execution of nuanced, goal oriented actions.
Previous benchmarks have largely focused on text based challenges, overlooking the critical component of visual information processing. But visual web arena addresses this gap by presenting agents with tasks that require an understanding of both image and text inputs as well as the execution of actions based on natural language instructions. The benchmark encompasses 910 tasks spread across three distinct web environments, Reddit, shopping, and classifieds, with the latter being a novel addition that introduces real world complexity to the challenges.
The evaluation conducted by the CMU team sheds light on the limitations of current large language model based agents, particularly those that do not incorporate multimodal data processing. Through both quantitative and qualitative analysis, the researchers have identified areas where these agents fall short, especially in tasks that demand a deep understanding of visual content. Interestingly, though, the study reveals that vision language models, which are designed to process and interpret both visual and textual information, perform better on the visual web arena tasks than their textonly counterparts.
However, even the best performing vlms achieved a success rate of only 16. 4%, a stark contrast to the 88. 7% success rate demonstrated by human participants. This gap underscores the challenges that lie ahead in achieving humanlike performance in web based tasks. The research also highlights a significant disparity between open sourced vlms and those available through APIs, pointing to the need for comprehensive metrics that can accurately assess the performance of these agents.
A promising development is the introduction of a new VLM agent inspired by the set of Mark’s prompting strategy, which has shown potential in improving performance on visually complex web pages by simplifying the action space. As AI continues to evolve, benchmarks like Visualweb arena will play a pivotal role in shaping the next generation of web based AI applications, driving innovation and expanding the horizons of what AI can achieve in complex real world environments.
Due to all of this, it’s expected that these advances will lead to robots and computers that don’t need humans for any thing. Within just the next couple of years. .