Large language models (LLMs), such as those powering ChatGPT, are increasingly being used by people for information retrieval or text editing, analysis, and generation. As these models become more advanced and widespread, some computer scientists are exploring their limitations and vulnerabilities to inform future improvements.
Two researchers from the University of St. Louis, Zhen Guo and Reza Tourani, have recently developed and demonstrated a new type of backdoor attack that can manipulate LLMs' text generation without easy detection. This attack, named DarkMind, was recently published on the arXiv preprint server, highlighting vulnerabilities in existing LLMs.
"Our research stems from the growing popularity of personalized AI models, such as OpenAI’s GPT Store, Google’s Gemini 2.0, and HuggingChat, which now host over 4,000 customized LLMs," senior author Tourani told Tech Xplore.
"These platforms represent a significant shift toward agent-based AI and inference-driven applications, making AI models more autonomous, adaptable, and widely accessible. However, despite their transformative potential, their security against emerging attack vectors—especially vulnerabilities embedded in the reasoning process—has not been thoroughly examined."
The primary goal of Tourani and Guo's recent study was to explore the security of LLMs, revealing existing vulnerabilities in the chain-of-thought (CoT) reasoning paradigm. This is a widely-used computational approach enabling dialogue agents based on LLMs, like ChatGPT, to break down complex tasks into sequential steps.
"We identified a significant blind spot: reasoning-based vulnerabilities that do not surface in traditional static prompt injection or adversarial attacks," said Tourani. "This led us to develop DarkMind, a backdoor attack where embedded adversarial behaviors remain dormant until activated during specific reasoning steps of the LLM."
The stealthy backdoor attack developed by Tourani and Guo leverages the step-by-step reasoning process through which LLMs handle and generate text. Unlike traditional backdoor attacks that require manipulating user queries to alter model responses or retraining the model, DarkMind embeds "hidden triggers" within custom LLM applications, such as those found on OpenAI's GPT Store.
"These triggers are invisible in the initial prompts but activate during intermediate reasoning steps, subtly altering the final output," explained Guo, the first author of the paper and a doctoral student. "As a result, the attack remains latent and undetectable, allowing the LLM to function normally under standard conditions until a specific reasoning pattern triggers the backdoor."
In preliminary tests, the researchers found that DarkMind offers several advantages, making it a highly effective backdoor attack. Since it operates within the model's reasoning process without requiring manipulation of user queries, it is difficult to detect, and the changes it causes may evade standard security filters.
Because it dynamically modifies the reasoning of LLMs rather than altering their responses, the attack is effective and persistent across various language tasks. In other words, it could compromise the reliability and security of LLMs when performing tasks across different domains.
"DarkMind has broad implications since it applies to various reasoning domains, including mathematics, common sense, and symbolic reasoning, and remains effective on state-of-the-art LLMs like GPT-4o, O1, and LLaMA-3," said Tourani. "Moreover, attacks like DarkMind can be easily designed with simple instructions, even by users without expertise in language models, increasing the risk of widespread misuse."
GPT4 by OpenAI and other LLMs are now integrated into a wide range of websites and applications, including critical services such