Guava Framework Enhances Embodied Manipulation Capabilities
- •Researchers launched Guava, a harness framework for embodied manipulation using high-level reasoning and external modules.
- •The 4B parameter model achieved performance comparable to frontier proprietary models using fewer than 2,000 simulated trajectories.
- •Guava identifies iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations as key ingredients for effective embodied agents.
Researchers introduced Guava, a harness framework designed to enhance embodied manipulation tasks by integrating high-level reasoning with specialized external modules for perception, planning, and control. Published on June 16, 2026, the study explores the design space of agent workflows, action spaces, and observation spaces to determine the requirements for effective embodied systems. The team identified three fundamental components for performance: iterative perception-reasoning-action loops (continuous cycles of updating world state and planning), semantic action abstractions (grouping low-level motor movements into higher-level instructions), and multimodal observations.
To test the universality of these principles, the authors developed an end-to-end training pipeline that distills embodied capabilities into a 4B parameter open-source model. This model was trained using fewer than 2,000 trajectories gathered entirely in simulation. Experimental results across both simulated and real-world environments demonstrate that the compact 4B model achieves performance comparable to frontier proprietary models. Furthermore, the system exhibits strong generalization capabilities when faced with unseen objects, novel instructions, and long-horizon tasks, which require sequences of multiple steps.
The findings indicate that a well-architected harness functions as a scalable, model-agnostic interface, allowing smaller language models to exhibit emergent embodied capabilities. This approach offers a practical alternative to end-to-end vision-language-action systems by requiring significantly less training data while maintaining high effectiveness in complex manipulation scenarios.