Researchers Expose GPT-5 Jailbreak Flaws, Outperformed by Hardened GPT-4o

GPT-5 ChatGPT jailbreak

In the massive AI explosion of the past several years, AI companies and developers are constantly attempting to improve the performance and capabilities of their products. OpenAI recently announced the release of its newest LLM, GPT-5, intended as a major leap forward from previous models. The updated model is designed to improve the speed and accuracy of responses.

With scrutiny of AI safety and guardrail robustness on the rise, it is vital to assess the security and integrity of new releases. Tests performed by two independent research teams have revealed significant vulnerabilities in the model, highlighting some of the most pressing security concerns in generative AI environments.

SPLX Red-Teaming Results

AI security startup SPLX conducted extensive tests on the new model, using over 1,000 adversarial prompts across varied configurations. In these tests, GPT-5 demonstrated an 89% jailbreak success rate with no system prompt, with an overall score of 11. GPT-4o showed a score of 29 under the same conditions. The addition of a basic system prompt improves the GPT-5 score to 57, while it brings GPT-4o up to a score of 81. With a hardened system prompt, GPT-5 scores a 55, while GPT-4o scores a 97.

According to these tests, GPT-4o demonstrates significantly better resilience across almost all prompt layers in the areas of security, safety, and business alignment. The one exception is the business alignment measure with no system prompt, in which GPT-5 scored 1.74 and GPT-4o scored 0, a much smaller variation than the many categories in which GPT-4o outstrips GPT-5.

NeuralTrust’s “Echo Chamber” & “Storytelling” Attacks

Generative AI security platform provider and security research company NeuralTrust also carried out tests to analyze the safety of GPT-5. By utilizing a method of context-poisoning dubbed an Echo Chamber Attack, researchers were able to continuously emphasize a narrative context to persuade GPT-5 to violate safety protocols and provide dangerous information.

In this manner, iterative reinforcement of the storytelling context acts as narrative camouflage for malicious intent, as GPT-5 is convinced to consider the unsafe content “continuity-preserving elaborations” rather than directly replying to malicious requests. Combining methods allowed researchers to bypass built-in safety features that rely on analyzing intent and discovering keywords.

Why GPT-5 Fared Worse Than GPT-4o

With the new model still fresh, it is difficult to know exactly why GPT-5 performed so much worse than GPT-4o in these tests. There could be potential differences in the architecture or safety layers affecting the model’s ability to adhere to established security parameters, both under normal circumstances and in the face of prompt manipulation designed to evade detection.

Some are also speculating that the advances in GPT-5’s performance and creativity may come with trade-offs in control and security. “Models will get stronger in some areas, and will probably see loss of progress in other ways,” according to Trey Ford, Chief Strategy and Trust Officer at Bugcrowd, highlighting the difficulty of testing new models for the various areas in which businesses, consumers, and stakeholders expect to see growth.

Implications for AI Safety and Deployment

The critical flaws found in GPT-5 by researchers and testers have far-reaching implications for organizations and individuals alike. Multi-turn vulnerabilities like the context-poisoning echo chamber attack are an overlooked attack surface that could present serious dangers if not addressed. Enterprise and consumer applications relying on LLM safety are at risk if attempting to use GPT-5. Mitigating vulnerabilities like these demands adaptive, context-aware red-teaming to account for threats beyond simple keyword filtering.

The research into GPT-5’s security capabilities reveals significant gaps in built-in model security measures, while users and organizations continue to run into the pitfall of relying heavily on these measures for protection. “Enterprises can’t assume model-level alignment will protect them,” says Satyam Sinha, CEO and founder at Acuvity. “They need layered, context-aware controls and continuous red-teaming to detect when a model’s behavior is drifting toward unsafe territory.”

Path Forward – Building Context-Resilient LLMs

In order to move forward, accounting for the issues identified in GPT-5, it is important to take steps to mitigate and ameliorate these vulnerabilities. This requires a layered approach implementing defenses against security incidents, especially insidious threats that are designed to evade simple protections. Some of the recommended moves for preventing these vulnerabilities from causing significant damage include integrating real-time context auditing and poisoning detection, implementing continuous adversarial testing in production environments, and balancing model utility with robust safety measures.

Author
  • Contributing Writer, Security Buzz
    PJ Bradley is a writer from southeast Michigan with a Bachelor's degree in history from Oakland University. She has a background in school-age care and experience tutoring college history students.