How AI summaries can be steered: bilingual prompts, hidden policies, and why multilingual LLM guardrails fail in human rights contexts.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

A researcher at Taraaz presents three interconnected projects exposing critical weaknesses in LLM safety systems across non-English languages. The 'Bilingual Shadow Reasoning' technique demonstrates how customized non-English system prompts can steer a model's hidden chain-of-thought to bypass safety guardrails while producing seemingly neutral outputs — tested on GPT-OSS-20B using a Farsi policy mirroring authoritarian framing of human rights. A Multilingual AI Safety Evaluation Lab (built at Mozilla Foundation) benchmarked GPT-4o, Gemini 2.5 Flash, and Mistral Small on refugee/asylum scenarios in Arabic, Farsi, Pashto, and Kurdish, finding significant quality drops versus English, inconsistent safety disclaimers, and LLM-as-a-Judge inflating scores and hallucinating disclaimers. A third project tested whether guardrail tools (FlowJudge, Glider, AnyLLM) actually enforce multilingual policies, finding Glider produced 36–53% score discrepancies based solely on policy language. The core argument: evaluation insights must flow directly into guardrail design, and the safety tools themselves are failing in non-English contexts.

Don't Trust the Salt: AI Summarization, Multilingual Safety, and the LLM Guardrails That Need Guarding

Project 2: Multilingual AI Safety Evaluation Lab

Project 3: Evaluating Multilingual, Context-Aware LLM Guardrails