Wan-Streamer v0.1 Launches for Real-time Multimodal Interaction
- •Wan-Streamer v0.1 released as an end-to-end model for real-time audio-visual interaction.
- •Unified Transformer architecture delivers 200 ms model-side latency and 550 ms total interaction time.
- •System supports full-duplex operation, simultaneously perceiving and generating synchronized video and audio at 25 fps.
Lianghua Huang and a team of researchers introduced Wan-Streamer v0.1 on June 23, 2026, as a native-streaming, end-to-end foundation model built for real-time, low-latency audio-visual interaction. By integrating language, audio, and video into a single Transformer architecture, the model handles perception, reasoning, generation, and turn management jointly without relying on external modular systems such as ASR (Automatic Speech Recognition) or TTS (Text-to-Speech). The design uses block-causal attention to manage interleaved input and output tokens, allowing for incremental streaming as short as 160 ms at 25 fps.
The system achieves approximately 200 ms model-side response latency and a total interaction latency of approximately 550 ms when accounting for 350 ms of bidirectional network delay. This enables full-duplex communication, where the system continuously perceives user inputs while simultaneously generating synchronized audio-visual responses. Unlike cascaded pipelines that connect separate models for vision, speech, and rendering, Wan-Streamer performs these tasks within one unified framework, which reduces pipeline errors and overall synchronization latency.
According to the researchers, Wan-Streamer stands out as the only end-to-end interactive model that outputs synchronized video content while maintaining sub-second performance. While other systems often measure latency by excluding the underlying dependencies of language models or speech processing, Wan-Streamer operates as a singular, unified unit to deliver real-time feedback. The development team noted that this architectural approach ensures that the system is capable of both understanding and responding to complex human inputs with high efficiency, marking a shift toward unified models for streaming interaction.