OpenAI uses automated red teaming powered by reinforcement learning to discover and patch prompt injection vulnerabilities in ChatGPT Atlas's browser agent before attackers exploit them. The system trains an AI attacker to find novel exploits that could trick the agent into performing unauthorized actions like forwarding sensitive emails or sending money. When new attack patterns are discovered, they're used to adversarially train updated agent models and strengthen system-level safeguards. This proactive rapid response loop has already led to security updates deployed to all Atlas users, helping defend against attacks where malicious instructions embedded in web content could hijack the agent's behavior.

11m read timeFrom openai.com
Post cover image

Sort: