ERNIE 4.5 VL 424B A47B is a multimodal vision-language MoE model developed by Baidu. It accepts text and image inputs with text output, activating 47B of its 424B total parameters per token across a 131K-token context window. Built on a heterogeneous MoE architecture jointly pre-trained on text and vision, it applies modality-isolated routing so that one modality does not hinder the learning of another. The model supports both thinking and non-thinking modes: in non-thinking mode it excels at visual perception, document and chart understanding, and visual knowledge, while in thinking mode it retains those perception strengths and adds stronger multimodal reasoning, narrowing or even surpassing the gap to OpenAI-o1 on reasoning-centric benchmarks such as MathVista, MMMU, and VisualPuzzle. Post-trained with SFT, DPO, UPO, and RLVR, it supports English and Chinese and is released under the Apache 2.0 license.
ERNIE 4.5 VL 424B A47B is a multimodal vision-language MoE model developed by Baidu. It accepts text and image inputs with text output, activating 47B of its 424B total parameters per token across a 131K-token context window. Built on a heterogeneous MoE architecture jointly pre-trained on text and vision, it applies modality-isolated routing so that one modality does not hinder the learning of another. The model supports both thinking and non-thinking modes: in non-thinking mode it excels at visual perception, document and chart understanding, and visual knowledge, while in thinking mode it retains those perception strengths and adds stronger multimodal reasoning, narrowing or even surpassing the gap to OpenAI-o1 on reasoning-centric benchmarks such as MathVista, MMMU, and VisualPuzzle. Post-trained with SFT, DPO, UPO, and RLVR, it supports English and Chinese and is released under the Apache 2.0 license.