The AVNet model supports audio-only input, visual-only input, or combined audio-visual input. Missing modalities are handled via an added null embedding.
As autonomous driving advances, vehicles must reliably detect and respond to critical road events to operate safely.
Recognizing emergency vehicles is essential so the vehicle can yield promptly as required by traffic laws and to protect first responders and the public.
Current methods exist to detect emergency vehicles based on either visual data or audio recordings. Existing bimodal sensor fusion models rely on the presence of both modalities and can’t be used for unimodal perception.
Robust detection requires fusing both audio and visual cues to handle challenging conditions such as occlusions, poor lighting, or elevated noise levels that may obscure sirens or line of sight.