What are the key points?

Researchers developed Audio-Interaction, a unified streaming model for real-time audio instruction following and offline task execution. The SoundFlow framework enables an end-to-end "perceive-decide-respond" loop through asynchronous low-latency inference and comprehension-aware training. The team released the 2.6M-item StreamAudio-2M dataset and Proactive-Sound-Bench to evaluate proactive audio intervention capabilities.

New Audio-Interaction Model Enables Real-Time Streaming Responses

•Researchers developed Audio-Interaction, a unified streaming model for real-time audio instruction following and offline task execution.
•The SoundFlow framework enables an end-to-end "perceive-decide-respond" loop through asynchronous low-latency inference and comprehension-aware training.
•The team released the 2.6M-item StreamAudio-2M dataset and Proactive-Sound-Bench to evaluate proactive audio intervention capabilities.

•Researchers developed Audio-Interaction, a unified streaming model for real-time audio instruction following and offline task execution.
•The SoundFlow framework enables an end-to-end "perceive-decide-respond" loop through asynchronous low-latency inference and comprehension-aware training.
•The team released the 2.6M-item StreamAudio-2M dataset and Proactive-Sound-Bench to evaluate proactive audio intervention capabilities.

Researchers from the National University of Singapore have introduced Audio-Interaction, a unified streaming model designed for real-time audio interaction. While existing Large Audio Language Models (LALMs) typically operate in offline modes or handle isolated tasks like speech recognition, this new framework enables models to perform continuous, real-time audio instruction following. The system operates via a "perceive-decide-respond" loop, allowing it to process environmental sounds and user instructions simultaneously to provide immediate, context-aware reactions.

To facilitate this capability, the team developed the SoundFlow framework, which manages the entire lifecycle from data construction to training and deployment. SoundFlow integrates streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference to maintain system stability during live usage. These techniques allow the model to autonomously determine when to generate a response based on the semantics of an incoming audio stream.

To support training and evaluation, the researchers constructed StreamAudio-2M, a corpus containing 2.6M items that cover 7 fundamental audio abilities and 28 sub-tasks. They also introduced the Proactive-Sound-Bench to assess proactive audio intervention capabilities. Testing across 8 benchmarks indicates that Audio-Interaction maintains performance on traditional audio tasks while enabling new functionalities, such as real-time Automatic Speech Recognition (ASR) and proactive assistant features that are not available in standard offline LALMs.

Researchers from the National University of Singapore have introduced Audio-Interaction, a unified streaming model designed for real-time audio interaction. While existing Large Audio Language Models (LALMs) typically operate in offline modes or handle isolated tasks like speech recognition, this new framework enables models to perform continuous, real-time audio instruction following. The system operates via a "perceive-decide-respond" loop, allowing it to process environmental sounds and user instructions simultaneously to provide immediate, context-aware reactions.

To facilitate this capability, the team developed the SoundFlow framework, which manages the entire lifecycle from data construction to training and deployment. SoundFlow integrates streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference to maintain system stability during live usage. These techniques allow the model to autonomously determine when to generate a response based on the semantics of an incoming audio stream.

To support training and evaluation, the researchers constructed StreamAudio-2M, a corpus containing 2.6M items that cover 7 fundamental audio abilities and 28 sub-tasks. They also introduced the Proactive-Sound-Bench to assess proactive audio intervention capabilities. Testing across 8 benchmarks indicates that Audio-Interaction maintains performance on traditional audio tasks while enabling new functionalities, such as real-time Automatic Speech Recognition (ASR) and proactive assistant features that are not available in standard offline LALMs.