What are the key points?

Wan-Streamer v0.1 released as an end-to-end model for real-time audio-visual interaction. Unified Transformer architecture delivers 200 ms model-side latency and 550 ms total interaction time. System supports full-duplex operation, simultaneously perceiving and generating synchronized video and audio at 25 fps.

Wan-Streamer v0.1 Launches for Real-time Multimodal Interaction

•Wan-Streamer v0.1 released as an end-to-end model for real-time audio-visual interaction.
•Unified Transformer architecture delivers 200 ms model-side latency and 550 ms total interaction time.
•System supports full-duplex operation, simultaneously perceiving and generating synchronized video and audio at 25 fps.

Lianghua Huang and a team of researchers introduced Wan-Streamer v0.1 on June 23, 2026, as a native-streaming, end-to-end foundation model built for real-time, low-latency audio-visual interaction. By integrating language, audio, and video into a single Transformer architecture, the model handles perception, reasoning, generation, and turn management jointly without relying on external modular systems such as ASR (Automatic Speech Recognition) or TTS (Text-to-Speech). The design uses block-causal attention to manage interleaved input and output tokens, allowing for incremental streaming as short as 160 ms at 25 fps.

The system achieves approximately 200 ms model-side response latency and a total interaction latency of approximately 550 ms when accounting for 350 ms of bidirectional network delay. This enables full-duplex communication, where the system continuously perceives user inputs while simultaneously generating synchronized audio-visual responses. Unlike cascaded pipelines that connect separate models for vision, speech, and rendering, Wan-Streamer performs these tasks within one unified framework, which reduces pipeline errors and overall synchronization latency.

According to the researchers, Wan-Streamer stands out as the only end-to-end interactive model that outputs synchronized video content while maintaining sub-second performance. While other systems often measure latency by excluding the underlying dependencies of language models or speech processing, Wan-Streamer operates as a singular, unified unit to deliver real-time feedback. The development team noted that this architectural approach ensures that the system is capable of both understanding and responding to complex human inputs with high efficiency, marking a shift toward unified models for streaming interaction.

Lianghua Huang and a team of researchers introduced Wan-Streamer v0.1 on June 23, 2026, as a native-streaming, end-to-end foundation model built for real-time, low-latency audio-visual interaction. By integrating language, audio, and video into a single Transformer architecture, the model handles perception, reasoning, generation, and turn management jointly without relying on external modular systems such as ASR (Automatic Speech Recognition) or TTS (Text-to-Speech). The design uses block-causal attention to manage interleaved input and output tokens, allowing for incremental streaming as short as 160 ms at 25 fps.

The system achieves approximately 200 ms model-side response latency and a total interaction latency of approximately 550 ms when accounting for 350 ms of bidirectional network delay. This enables full-duplex communication, where the system continuously perceives user inputs while simultaneously generating synchronized audio-visual responses. Unlike cascaded pipelines that connect separate models for vision, speech, and rendering, Wan-Streamer performs these tasks within one unified framework, which reduces pipeline errors and overall synchronization latency.

According to the researchers, Wan-Streamer stands out as the only end-to-end interactive model that outputs synchronized video content while maintaining sub-second performance. While other systems often measure latency by excluding the underlying dependencies of language models or speech processing, Wan-Streamer operates as a singular, unified unit to deliver real-time feedback. The development team noted that this architectural approach ensures that the system is capable of both understanding and responding to complex human inputs with high efficiency, marking a shift toward unified models for streaming interaction.