The Vision-Language-Action models (VLA) have achieved significant advances in robotic manipulation recently. However, vision-only VLA models create fundamental limitations, particularly in perceiving interactive and manipulation dynamic processes. This paper proposes Audio-VLA, a multimodal manipulation policy that leverages contact audio to perceive contact events and dynamic process feedback. Audio-VLA overcomes the vision-only constraints of VLA models. Additionally, this paper introduces the Task Completion Rate (TCR) metric to systematically evaluate dynamic operational processes. Audio-VLA employs pre-trained DINOv2 and SigLIP as visual encoders, AudioCLIP as the audio encoder, and Llama2 as the large language model backbone. We apply LoRA fine-tuning to these pre-trained modules to achieve robust cross-modal understanding of both visual and acoustic inputs. A multimodal projection layer aligns features from different modalities into the same feature space. Moreover RLBench and LIBERO simulation environments are enhanced by adding collision-based audio generation to provide realistic sound feedback during object interactions. Since current robotic manipulation evaluations focus on final outcomes rather than providing systematic assessment of dynamic operational processes, the proposed TCR metric measures how well robots perceive dynamic processes during manipulation, creating a more comprehensive evaluation metric. Extensive experiments on LIBERO, RLBench, and two real-world tasks demonstrate Audio-VLA's superior performance over vision-only comparative methods, while the TCR metric effectively quantifies dynamic process perception capabilities.
To train and evaluate our proposed Audio-VLA, we augment LIBERO and RLBench simulation environments with realistic acoustic feedback via integrating real-world audio recordings triggered through physics-based collision detection. Our approach systematically collects contact sounds from physical manipulation, for each simulated task, we identify target objects by material properties and dimensions, then perform equivalent manipulations on similar real-world objects while recording contact audio at 48kHz using gripper-mounted microphones. These recordings are organized into a structured library indexed by material pairs, interaction types, and force magnitudes. The collected audio recordings are organized into a structured library indexed by material pairs, interaction types, and force magnitudes. During simulation, two types of collision events are monitored, including direct gripper–object contacts and interactions between grasped objects and the environment. When a collision is detected, the participants, impact velocity, and force magnitude are identified. The identified collision parameters are then used to query the audio library and retrieve appropriate sound samples. The retrieved audio is dynamically modulated, with amplitude scaled by collision force, pitch shifted according to object size, and duration adjusted for continuous contacts.
In standard settings, Audio-VLA achieves the highest success rates on LIBERO and RLBench, surpassing all baselines. The integration of acoustic perception effectively complements visual input, allowing precise recognition of contact events and improving decision-making in long-horizon and contact-intensive tasks. When lighting, color, and texture vary, Audio-VLA maintains stable performance with minimal degradation compared to vision-only models. Its use of domain-invariant acoustic cues enables consistent contact perception and task execution even under visual disturbances.
Under seen conditions, Audio-VLA achieves up to threefold higher success and completion rates than vision-only baselines on EAWM and S5GO. By capturing acoustic cues of friction and depth-weight dynamics, it effectively handles contact-dependent transitions that visual models fail to perceive. When visual appearances change, Audio-VLA maintains stable task performance while vision-only methods collapse. The domain-invariant nature of contact audio preserves physical interaction cues, enabling consistent understanding and generalization across unseen environments.
coming soon