Despite OpenAI's ongoing efforts to strengthen the security of its Atlas AI browser against network attacks, the company acknowledges that prompt injection—attacks where malicious instructions hidden in web pages or emails manipulate AI agents into harmful actions—will remain a persistent threat in the near term. This reality raises significant concerns about the safety of AI agents operating across open networks.
In a blog post published Monday, OpenAI stated, “Prompt injections, much like online scams and social engineering, are unlikely to be completely ‘solved.’” The company is actively enhancing Atlas’s defense mechanisms to counter continuous threats, admitting that the “agent mode” in ChatGPT Atlas inherently expands the attack surface.
Launched in October, ChatGPT Atlas quickly drew attention from security researchers who demonstrated how inserting just a few words into a Google Doc could alter the browser agent’s behavior. On the same day, Brave released a report highlighting indirect prompt injection as a systemic vulnerability affecting AI-powered browsers, including Perplexity’s Comet.
OpenAI is not alone in recognizing the enduring nature of prompt-based threats. Earlier this month, the UK’s National Cyber Security Centre (NCSC) warned that prompt injection attacks targeting generative AI applications “may never be fully mitigated,” leaving websites vulnerable to data breaches. The government body advises cybersecurity professionals to focus on reducing the impact and likelihood of such attacks rather than assuming they can be entirely blocked.
“We view prompt injection as a long-term AI safety challenge requiring sustained defensive improvements,” OpenAI emphasized.
How is the company tackling this Sisyphean task? OpenAI claims its proactive, rapid-response testing cycle shows early promise in identifying novel attack vectors internally before they emerge in real-world scenarios.
This approach aligns with strategies described by rivals like Anthropic and Google: defending against prompt-based attacks demands layered safeguards and continuous stress-testing. For instance, Google has recently focused on architectural design and policy-level controls within agent systems.
However, OpenAI has adopted a distinct tactic—an LLM-driven automated attacker. This internal “red-team” agent, trained via reinforcement learning, simulates adversarial behavior by discovering covert ways to feed malicious prompts to AI agents.
The system enables simulated pre-deployment attacks, modeling how the target AI interprets inputs and reacts to potential exploits. By analyzing these responses, the attacker agent iteratively refines its methods. Since external attackers lack insight into the AI’s internal reasoning, OpenAI’s simulator theoretically uncovers vulnerabilities faster than real adversaries could.
This method reflects a growing trend in AI security: deploying autonomous agents to identify edge cases and accelerate testing in controlled environments.
“Our reinforcement learning-trained attackers can induce agents to execute complex, multi-step harmful workflows spanning dozens or even hundreds of actions,” OpenAI reported. “We’ve also uncovered novel attack strategies not observed in human-led red teaming or external reports.”
In one demonstration (partially illustrated above), OpenAI showed how its automated attacker inserted a malicious email into a user’s inbox. When the AI agent later scanned messages, it followed concealed instructions to send a resignation letter instead of drafting an out-of-office reply. However, after security updates, the agent mode successfully detected and flagged the injection attempt.
While acknowledging that prompt injection cannot be perfectly prevented, OpenAI relies on large-scale simulation and accelerated patching cycles to fortify systems before real attacks occur.
An OpenAI spokesperson declined to confirm whether Atlas security updates have led to a measurable drop in successful injections but noted that the company has collaborated with third parties since development began to reinforce defenses against such threats.
Rami McCarthy, Chief Security Researcher at cybersecurity firm Wiz, noted that while reinforcement learning helps adapt to evolving attacker tactics, it represents only part of the broader security picture.
“A useful framework for assessing AI risk is autonomy multiplied by accessibility,” McCarthy explained.
“Agent-based browsers often occupy a high-risk quadrant: moderate autonomy combined with extremely high access levels,” he added. Current mitigation strategies reflect this trade-off—limiting login permissions reduces exposure, while requiring user approvals curtails autonomy.
These are precisely the types of safeguards OpenAI recommends. A spokesperson confirmed that Atlas is designed to seek user confirmation before sending messages or making payments. The company also advises users to provide specific instructions to agents rather than granting broad authority like “take whatever actions are needed” with full mailbox access.
“Excessive freedom increases the likelihood that concealed or malicious content can influence the agent, even with protective measures in place,” OpenAI cautioned.
While protecting Atlas users from prompt injection remains a top priority, McCarthy questions the return on investment for high-access browser agents given their current risk profile.
“For most everyday use cases, agent-based browsers haven’t yet delivered enough value to justify their inherent risks,” McCarthy said. With access to sensitive data like emails and payment details, the stakes are high—yet that same access is what makes them powerful. The balance is constantly shifting, but for now, the trade-offs remain very real.”