Summary
Transcript
As for specs, the fabrication and assembly of Adam involves a comprehensive 317-hour process spanning from 3D modelling to manufacturing. And Adam features 37 degrees of freedom and is powered by a total of 20 actuators, plus a PMC control system, while its 163 CNC machined components, plus its other 397 3D printed parts all enable its physical operation. Altogether, seven types of composite materials and 13 different processes were used and were all produced at the PND manufacturing centre. But while these humanoid robots have advanced significantly, challenges in designing complex systems have remained. This is why PND Botix developed an imitation learning framework that uses human locomotion data, marking a first for full-size humanoid robots.
Importantly, this framework gives Adam its human-like characteristics when executing locomotion tasks. And while it’s unclear how much Adam will cost, the robot is built for real-world applications and human interaction, with the potential to automate new tasks as time goes on. As for the future, PND Botix remains focused on advancing its robot’s capabilities, so that Adam can compete with other AI-powered humanoids moving forward. Meanwhile, a breakthrough visual language AI model named QEN2VL was just released with several new record-setting abilities. But how smart is it? In terms of intelligence, QEN2VL achieves state-of-the-art performance in visual comprehension tests like MathVista, DocVQA and MTVQA.
It excels in interpreting images of various resolutions and ratios and offers deep understanding of videos over 20 minutes long, ideal for tasks such as video-based question answering, dialogue and content creation. And beyond visual analysis, QEN2VL also integrates with mobile devices and robots, enabling autonomous operations. Its sophisticated reasoning and decision-making allow seamless interaction with environments through visual and text instructions. Plus, in order to serve a global audience, it supports multiple languages, including European languages, Japanese, Korean, Arabic and Vietnamese, as well as English and Chinese. Furthermore, when it comes to collaboration, the team has also open-sourced both the QEN2VL2B and 7B models under the Apache2 license, plus they’ve released the API for QEN2VL72B.
Importantly, these resources are integrated into platforms like Hugging Face Transformers and VLLM, inviting global developers to explore and enhance the model’s capabilities. And its performance is stunning, as QEN2VL is evaluated across six dimensions, problem solving, mathematical abilities, document comprehension, multilingual text image understanding, general question answering, video comprehension and agent interactions. Specifically, the 72B model consistently excels, often even outperforming renowned models like GPT40 and Claude 3.5 Sonnet, particularly in document understanding. But the model’s recognition capabilities extend beyond object identification, as it even understands complex relationships between multiple objects and recognizes handwritten text across languages.
These visual reasoning skills enable it to solve real-world problems and interpret complex mathematical problems using chart analysis. On top of this, its abilities extend beyond just static images by excelling in video content analysis too. It can summarize videos, answer related questions, and engage in real-time conversations, offering live chat support. This could even position it to become a valuable personal assistant, providing insights drawn directly from video content. Furthermore, as a visual agent, QEN2VL excels in function calling, using external tools for real-time data retrieval, like flight statuses or weather forecasts through visual cues.
This integration enhances its utility, transforming it into a powerful tool for information management and decision-making, with capabilities in visual interactions similar to human perception. But, despite its advanced features, QEN2VL can’t extract audio from videos, and its knowledge is current only as of June 2023. It also struggles with tasks involving counting, character recognition, and 3D spatial awareness. The model uses the QENVL architecture, combining a vision transformer with QEN2 language models. With a VIT of about 600 million parameters, it handles image and video inputs seamlessly. Innovations like mRope enable it to integrate textual visual and video positional information, ensuring consistent input-output dynamics.
Because of this, QEN2 is a breakthrough in AI vision language models, with its capabilities in visual understanding, multilingual support, and intelligent agent functionalities being set to transform AI guidance and interactions. And as it integrates into various platforms, QEN2VL will likely help unlock new possibilities for users worldwide. And finally, Chinese AI startup Minimax has introduced Video01, an innovative AI model that creates high-resolution videos from text prompts. This marks the company’s first major step into the competitive AI video generation market, challenging established players like OpenAI Sora. During a company event, Video01 was highlighted its ability to produce videos at a resolution of 1280 by 720 pixels and 25 frames per second, with the model also offering virtual camera control to add an extra layer of creative flexibility for users.
And while the videos are capped at 6 seconds for now, Minimax plans to extend this limit to 10 seconds soon. As for specs, the model’s technical details and details around its architecture and training data remain undisclosed. In the future, enhancements are scheduled to include support for image inputs alongside text prompts for more granular control. For those interested in using Video01 today, it’s currently available for free on the company’s website, requiring users to register with a mobile number. Minimax also offers an API for developers, emphasizing their focus on accessibility rather than immediate commercialization.
As for styles, users can choose from anime, CGI, and video game graphics. Impressively, the model produces relatively few image errors and can even display text within the videos. To use the service, Minimax requires users to accept terms of use prohibiting illegal content creation. The company is vigilant about preventing the spread of rumors, privacy violations, and illegal information. However, the AI is versatile, capable of generating videos featuring well-known personalities, including political figures like Donald Trump and Vladimir Putin. It does, however, block explicit content and discreetly watermarks all videos. Founded in late 2021, Minimax has quickly expanded its AI offerings, which include a large language model and a text-to-speech model as well.
And beyond OpenAI’s Sora, competitors like Kling, Vidoo, and Jimeng AI have also made their tools accessible to users, leveling the playing field in text-to-video generation. [tr:trw].