Recent research has identified significant issues with OpenAI’s audio transcription tool, Whisper: the application is generating content that was never actually spoken, potentially leading to hazardous consequences. According to the Associated Press, this AI model detects nonexistent patterns within its training data, thereby producing meaningless outputs—a behavior commonly referred to as “hallucination.”
American researchers have discovered that Whisper's errors may include racial remarks, violence, and fictitious medical treatments. Although Whisper has been integrated into certain versions of ChatGPT and is a built-in product within Microsoft's and Oracle’s cloud computing platforms, researchers caution against its use due to these identified issues across various studies.
In a study focused on public meetings, a researcher from the University of Michigan found that out of ten audio transcriptions examined, eight exhibited hallucinations by Whisper. Additionally, a machine learning engineer discovered that about half of over 100 hours of transcriptions contained such errors, and another developer reported that nearly every transcription from 26,000 created using Whisper had hallucinations.
Last month, Whisper was downloaded over four million times on the open-source AI platform HuggingFace, making it the website’s most popular speech recognition model. However, while analyzing materials from TalkBank hosted by Carnegie Mellon University, researchers determined that 40% of Whisper-generated hallucinations could be harmful, as speakers were either misunderstood or inaccurately represented.
In the Associated Press report, several instances were highlighted. In one case, a speaker described “two girls and a lady,” but Whisper fabricated racial remarks, stating “two girls and a lady, uh, they are Black.” In another example, the tool invented a fictitious drug named “super-activated antibiotics.”
Princeton University Professor Arundra Nelson told the Associated Press that such errors could have “extremely serious consequences,” particularly in medical settings, where “no one wants misdiagnoses.” Calls have been made for OpenAI to address this issue. Former employee William Sanders told the AP, “If you release this tool and people become overly confident in it, integrating it into all other systems, that’s a problem.”
Hallucinations are a common issue with AI transcription tools. While many users anticipate that AI tools might make errors or misspellings, researchers have found that other programs exhibit similar issues as Whisper. For example, Google’s AI summary was criticized earlier this year for suggesting the use of non-toxic glue to prevent cheese from sliding off pizzas and for citing a satirical Reddit comment as its source.
Apple CEO Tim Cook also acknowledged in an interview that AI hallucinations could pose issues in future products, including the Apple Intelligence suite. Cook told The Washington Post that his confidence in these tools not being susceptible to hallucinations isn’t at 100%. Nevertheless, the company continues to advance AI tools and programs.
Regarding OpenAI’s response to the hallucinations, it advises against using Whisper in “decision-making contexts,” as inaccuracies could lead to significant flaws in outcomes. This incident has once again raised concerns about the accuracy and reliability of AI technologies, highlighting the need for companies to handle potential issues and risks with greater caution while advancing AI development.