Fully automatic censorship removal for language models - p-e-w/heretic

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Heretic is an automated tool that removes safety alignment from transformer-based language models using directional ablation without expensive retraining. It combines advanced abliteration techniques with TPE-based parameter optimization to automatically find optimal parameters that minimize both refusals and divergence from the original model. The tool supports most dense and MoE architectures, requires minimal configuration, and can produce decensored models with lower KL divergence than manually created alternatives while maintaining the same refusal suppression levels.

p-e-w/heretic: Fully automatic censorship removal for language models