首页 News 正文

Recently, researchers from the University of California and NVIDIA jointly released a new visual language model called "NaVILA". The highlight is that the NaVILA model provides a new solution for robot navigation.
Related papers on the NaVILA model
Visual Language Model (VLM) is a multimodal generative AI model that can infer text, image, and video prompts. It combines Large Language Models (LLMs) with visual encoders to give LLMs the ability to 'see'.
Traditional robot actions often rely on pre drawn maps and complex sensor systems. The NaVILA model does not require a pre map, and robots only need to "understand" human natural language instructions, combined with real-time visual images and LiDAR information, to perceive paths, obstacles, and dynamic targets in the environment in real time, and can autonomously navigate to designated locations.
Not only does NaVILA break away from its dependence on maps, but it also further extends navigation technology from wheeled to legged robots, hoping to enable robots to cope with more complex scenarios and have the ability to overcome obstacles and adaptive path planning.
In the paper, researchers from the University of California conducted experiments using the Yushu Go2 robotic dog and G1 humanoid robot. According to the team's actual test results, NaVILA has a navigation success rate of up to 88% in real environments such as home, outdoor, and work areas, and a success rate of 75% in complex tasks.
Go2 robot dog accepts action command: turn left a little and walk towards the portrait poster, you will see an open door
G1 humanoid robot receives action command: immediately turn left and go straight, step on the mat and continue moving forward until it approaches the trash can and stops
It is reported that the characteristics of the NaVILA model are:
Optimizing accuracy and efficiency: The NVILA model has reduced training costs by 4.5 times and memory requirements for fine-tuning by 3.4 times. Almost doubled the delay in pre filling and decoding (compared to another large visual model LLaVa OneVision).
High resolution input: The NVILA model does not optimize input by reducing the size of photos and videos, but instead uses multiple frames from high-resolution images and videos to ensure that no details are lost.
Compression technology: Nvidia pointed out that the cost of training visual language models is very high, and fine-tuning such models is also very memory intensive. A 7B parameter model requires over 64GB of GPU memory. Therefore, Nvidia has adopted a technology called "expand first, compress later", which reduces the size of input data by compressing visual information into fewer tokens and grouping pixels to preserve important information, balancing the accuracy and efficiency of the model.
Multimodal reasoning ability: The NVILA model is capable of answering multiple queries based on a single image or video, demonstrating strong multimodal reasoning capabilities.
In video benchmark testing, NVILA outperforms GPT-4o Mini and also performs well compared to GPT-4o, Sonnet 3.5, and Gemini 1.5 Pro. NVILA still achieved a slight victory in comparison to Llama 3.2.
Nvidia stated that it has not yet released the model on the Hugging Face platform and has promised to release the code and model soon to promote its reproducibility.
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

楚一帆 注册会员
  • 粉丝

    0

  • 关注

    0

  • 主题

    38