What are the key points?

GateMem evaluates shared-memory AI agents across medical, office, educational, and household domains. The benchmark includes 91 multi-party episodes and 2,218 hidden evaluation checkpoints for rigorous testing. Current models fail to balance utility, secure access control, and reliable forgetting of deleted information.

GateMem Benchmarks Shared-Memory AI Agent Governance

•GateMem evaluates shared-memory AI agents across medical, office, educational, and household domains.
•The benchmark includes 91 multi-party episodes and 2,218 hidden evaluation checkpoints for rigorous testing.
•Current models fail to balance utility, secure access control, and reliable forgetting of deleted information.

Researchers introduced GateMem, a benchmark designed to evaluate how AI agents manage shared memory in multi-user settings. Unlike standard benchmarks that assume a single-user environment, GateMem examines the performance of agents deployed in institutional contexts such as hospitals, workplaces, schools, and households. In these environments, multiple users interact with a common memory pool, necessitating not only information recall but also strict governance regarding access rights and privacy.

The benchmark assesses three core competencies: utility for long-horizon requests, access control based on user authorization, and active forgetting (the ability to securely remove information after explicit deletion requests). It includes 91 long-form multi-party episodes, 2,218 hidden evaluation checkpoints, and covers four domains. Results across seven memory-agent baselines and six backbone models indicate that current systems struggle to balance these requirements. While long-context prompting provides superior governance, it incurs high costs. Conversely, retrieval-based and external-memory methods offer lower costs but remain prone to leaking deleted or unauthorized information.

Researchers introduced GateMem, a benchmark designed to evaluate how AI agents manage shared memory in multi-user settings. Unlike standard benchmarks that assume a single-user environment, GateMem examines the performance of agents deployed in institutional contexts such as hospitals, workplaces, schools, and households. In these environments, multiple users interact with a common memory pool, necessitating not only information recall but also strict governance regarding access rights and privacy.

The benchmark assesses three core competencies: utility for long-horizon requests, access control based on user authorization, and active forgetting (the ability to securely remove information after explicit deletion requests). It includes 91 long-form multi-party episodes, 2,218 hidden evaluation checkpoints, and covers four domains. Results across seven memory-agent baselines and six backbone models indicate that current systems struggle to balance these requirements. While long-context prompting provides superior governance, it incurs high costs. Conversely, retrieval-based and external-memory methods offer lower costs but remain prone to leaking deleted or unauthorized information.