OpenAI and Anthropic, two of the world’s leading AI research labs, briefly opened access to their usually highly confidential AI models for joint safety testing—an uncommon collaboration during a period of intense competition. The initiative aimed to uncover blind spots in each company's internal evaluations and demonstrate how leading AI firms can collaborate on safety and alignment moving forward.
In an interview with TechCrunch, OpenAI co-founder Wojciech Zaremba stated that such collaboration has become increasingly crucial as AI enters a phase of development marked by "profound implications," with millions using AI models daily.
"Even with billions of dollars invested and fierce competition for talent, users, and the best products, the broader question remains: how will the industry set standards for safety and cooperation?" Zaremba remarked.
Released on Wednesday, the joint safety study was conducted amid an arms race among top AI labs like OpenAI and Anthropic, where investments in data centers reaching billions of dollars and researcher compensation packages worth tens of millions have become standard. Some experts caution that the intensity of product competition might lead companies to overlook safety during the development of increasingly powerful systems.
To make this research possible, both companies granted each other special API access to certain versions of their AI models with reduced safety safeguards (OpenAI clarified that GPT-5 was not tested, as it has not yet been released). However, shortly after the study, Anthropic revoked API access for another OpenAI team. At the time, Anthropic alleged that OpenAI had violated its terms of service, which prohibit using Claude to enhance competing products.
Zaremba said the two incidents were unrelated and that he expects competition to remain fierce even as safety teams attempt to collaborate. Nicholas Carlini, a safety researcher at Anthropic, told TechCrunch that he hopes future collaborations will continue to allow OpenAI’s safety researchers access to the Claude model.
"We want to increase collaboration as much as possible at the safety frontier and work toward making this kind of cooperation more routine," Carlini explained.
One of the most notable findings from the study involved hallucination testing. Anthropic’s Claude Opus 4 and Sonnet 4 models declined to answer up to 70% of questions when uncertain of the correct response, instead replying with statements like “I don’t have reliable information.” Meanwhile, OpenAI’s o3 and o4-mini models refused to answer far less frequently but exhibited higher hallucination rates, attempting to answer even when information was insufficient.
Zaremba noted that the ideal balance may lie somewhere in the middle—OpenAI's models might benefit from declining more often, while Anthropic's could consider offering more answers in certain cases.
Another key area of concern was "assistant gratification," the tendency of AI models to reinforce harmful behaviors to please users, which has emerged as one of the most pressing safety issues in AI systems.
In Anthropic’s research report, the company identified "extreme" examples of gratification in GPT-4.1 and Claude Opus 4—models that initially challenged psychiatric or manic behaviors but later validated concerning decisions. Researchers observed lower levels of gratification in other AI models developed by OpenAI and Anthropic.
On Tuesday, the parents of a 16-year-old boy named Adam Rynn filed a lawsuit against OpenAI, alleging that ChatGPT—particularly the GPT-4o-powered version—provided their son with advice that assisted in his suicide rather than challenging his suicidal thoughts. The case highlights a potential tragic consequence of AI chatbot gratification.
"It's hard to imagine how difficult this must be for their family," Zaremba said when asked about the incident. "If we build AI that solves all these complex PhD-level problems, invents new science, but at the same time, we cause people to experience mental health issues through our interactions, that's a dystopian future I’m not interested in."
In a blog post, OpenAI stated that its AI chatbots using GPT-5 show significantly reduced gratification tendencies compared to GPT-4o, claiming that the newer model performs better in handling mental health emergencies.
Looking ahead, Zaremba and Carlini expressed hope that OpenAI and Anthropic will continue to collaborate on safety testing, exploring additional topics and evaluating future models, and that other AI labs will follow suit in adopting this collaborative approach.