Researchers have developed a machine learning technique to improve red-teaming for large language models. By training a red-team model to generate diverse prompts that elicit toxic responses from a chatbot, they achieved better coverage and effectiveness compared to human testers and other automated methods. The method provides

5m read time From news.mit.edu
Post cover image

Sort: