Since Anthropic launched the "Computer Use" feature for Claude in October, the capability of AI agents to simulate human interactions has garnered widespread attention. Recently, the Show Lab at the National University of Singapore conducted a new study providing a comprehensive overview of the expected capabilities of the current generation of Graphical User Interface (GUI) agents.
As the first cutting-edge model capable of interacting with devices through the same interfaces as humans, Claude operates solely based on desktop screenshots and interacts by triggering keyboard and mouse actions. This feature promises users the ability to automate tasks through simple commands without the need to access application APIs.
Researchers tested Claude across various tasks, including web searches, workflow completion, office efficiency, and video gaming. In web search tasks, Claude was required to browse and interact with websites, such as searching for and purchasing products or subscribing to news services. Workflow tasks involved interactions across multiple applications, such as extracting information from websites and inserting it into spreadsheets. Office efficiency tasks assessed the agent's ability to perform common operations like formatting documents, sending emails, and creating presentations. Video gaming tasks evaluated the agent's capability to execute multi-step tasks that require understanding game logic and planning actions.
The testing comprehensively evaluated the model's abilities from three dimensions: planning, execution, and evaluation. First, the model must develop a coherent plan to complete the task. Next, it needs to translate each step into specific actions, such as opening a browser, clicking elements, and entering text. Finally, the evaluation component determines whether the model can assess its progress and success during task completion. If the model makes an error, it should be able to correct it; if the task cannot be completed, it should provide a reasonable explanation. Researchers created a framework based on these three components and had humans review and rate all the tests.
Overall, Claude performed excellently in executing complex tasks. It was able to reason and plan the multiple steps required to complete tasks, perform corresponding actions, and evaluate progress at each step. It could also coordinate between different applications, such as copying information from a webpage and pasting it into a spreadsheet. Additionally, in some cases, it would recheck the results at the end of a task to ensure everything aligned with the objectives. The model's reasoning trajectory indicated a general understanding of how different tools and applications operate and effective coordination among them.
However, Claude also made some minor errors that ordinary users could easily avoid. For example, in one task, the model failed to complete a subscription because it did not scroll the webpage to find the corresponding button. In other instances, it failed at very simple and straightforward tasks, such as selecting and replacing text or changing bullet points to numbers. Moreover, the model either did not recognize its mistakes or made incorrect assumptions about the reasons for not achieving the expected goals.
Researchers noted that the model's misjudgment of its progress highlights the "insufficiency of the model's self-assessment mechanism" and suggested that "fully addressing this issue may still require improvements to the GUI agent framework, such as an internal rigorous critique module." From the results, GUI agents cannot replicate all the fundamental nuances of human computer usage.
The promise of automating tasks using basic text descriptions is highly attractive for businesses. However, at least for now, this technology is not ready for large-scale deployment. The model's behavior is unstable, which may lead to unpredictable outcomes, potentially causing serious consequences in sensitive applications. Executing operations through interfaces designed for humans is also not the fastest way to complete tasks that APIs can handle.
There is still much to learn regarding the security risks of granting large language models (LLMs) control over mice and keyboards. For example, a study has shown that web agents are highly susceptible to adversarial attacks that humans can easily overlook.
Nevertheless, tools like Claude's "Computer Use" still hold value. They can help product teams explore ideas and iterate on different solutions without spending time and money developing new features or services to automate tasks. Once a viable solution is found, teams can focus on developing the necessary code and components to deliver efficiently and reliably. Large-scale task automation still requires robust infrastructure, including APIs and microservices that can connect securely and provide large-scale services. In the future, as technology continues to advance and improve, we have reason to believe that GUI agents will play an important role in more areas.