What are the key points?

Salesforce presents 21 papers at ICLR 2026 focusing on reliable enterprise AI agents. New study identifies 'echoing' phenomenon where autonomous agents mirror partners, ignoring assigned roles. CoAct-1 system achieves 60.76% success rate on complex OSWorld tasks using multi-agent coordination.

Salesforce Debuts New Research on Autonomous AI Agents

•Salesforce presents 21 papers at ICLR 2026 focusing on reliable enterprise AI agents.
•New study identifies 'echoing' phenomenon where autonomous agents mirror partners, ignoring assigned roles.
•CoAct-1 system achieves 60.76% success rate on complex OSWorld tasks using multi-agent coordination.

At the Fourteenth International Conference on Learning Representations (ICLR 2026), Salesforce AI Research has unveiled a substantial body of work aimed at maturing enterprise artificial intelligence. The collection of 21 accepted papers pivots away from simple chatbot functionality, moving instead toward the difficult engineering problem of reliability in autonomous systems. For students and researchers alike, this signals a shift in the field: the focus is no longer just on what an AI can generate, but on how it can behave predictably in uncontrolled, real-world digital environments.

One of the most compelling insights comes from the 'ECHOING' paper, which examines identity stability in multi-agent environments. When LLM agents interact autonomously, they sometimes fall into a psychological-like trap of mirroring their conversational partners, effectively abandoning the specific instructions or roles they were assigned. The researchers found that in over 70% of conversations, agents would echo the responses of their partner, even when the agent was meant to be the authority or the specialist. This finding is a crucial wake-up call for designers of agentic systems, as it reveals that standard performance metrics often fail to catch these subtle, yet significant, failures in role-adherence.

The research also addresses how agents interact with computer interfaces through a suite of new frameworks, most notably GTA1 and CoAct-1. GTA1 introduces the concept of test-time scaling to the domain of GUI agents, allowing the system to propose and evaluate multiple potential actions before committing to a click or keystroke. This effectively gives the AI a moment to 'think' before it acts, which is a significant departure from standard, single-shot generation methods. When combined with the multi-agent coordination found in CoAct-1—where specialized agents take on different roles like 'Operator' or 'Programmer'—these systems are reaching unprecedented levels of proficiency, achieving over 60% success on the rigorous OSWorld benchmark.

Perhaps most importantly for the industry, the release of the SCUBA benchmark marks a dedicated effort to test agents against the gritty reality of enterprise software, such as CRM (Customer Relationship Management) workflows. By demonstrating that open-source agents often struggle compared to closed-source alternatives in zero-shot settings, the research highlights a clear gap in current capability. However, the data also shows that with proper instruction and demonstration, these agents can significantly reduce the time and operational costs associated with administrative tasks. This body of work underscores that while the potential for autonomous enterprise assistance is immense, the path forward requires rigorous evaluation of failure modes and a commitment to building more reliable, controllable agentic architectures.

At the Fourteenth International Conference on Learning Representations (ICLR 2026), Salesforce AI Research has unveiled a substantial body of work aimed at maturing enterprise artificial intelligence. The collection of 21 accepted papers pivots away from simple chatbot functionality, moving instead toward the difficult engineering problem of reliability in autonomous systems. For students and researchers alike, this signals a shift in the field: the focus is no longer just on what an AI can generate, but on how it can behave predictably in uncontrolled, real-world digital environments.

One of the most compelling insights comes from the 'ECHOING' paper, which examines identity stability in multi-agent environments. When LLM agents interact autonomously, they sometimes fall into a psychological-like trap of mirroring their conversational partners, effectively abandoning the specific instructions or roles they were assigned. The researchers found that in over 70% of conversations, agents would echo the responses of their partner, even when the agent was meant to be the authority or the specialist. This finding is a crucial wake-up call for designers of agentic systems, as it reveals that standard performance metrics often fail to catch these subtle, yet significant, failures in role-adherence.

The research also addresses how agents interact with computer interfaces through a suite of new frameworks, most notably GTA1 and CoAct-1. GTA1 introduces the concept of test-time scaling to the domain of GUI agents, allowing the system to propose and evaluate multiple potential actions before committing to a click or keystroke. This effectively gives the AI a moment to 'think' before it acts, which is a significant departure from standard, single-shot generation methods. When combined with the multi-agent coordination found in CoAct-1—where specialized agents take on different roles like 'Operator' or 'Programmer'—these systems are reaching unprecedented levels of proficiency, achieving over 60% success on the rigorous OSWorld benchmark.

Perhaps most importantly for the industry, the release of the SCUBA benchmark marks a dedicated effort to test agents against the gritty reality of enterprise software, such as CRM (Customer Relationship Management) workflows. By demonstrating that open-source agents often struggle compared to closed-source alternatives in zero-shot settings, the research highlights a clear gap in current capability. However, the data also shows that with proper instruction and demonstration, these agents can significantly reduce the time and operational costs associated with administrative tasks. This body of work underscores that while the potential for autonomous enterprise assistance is immense, the path forward requires rigorous evaluation of failure modes and a commitment to building more reliable, controllable agentic architectures.