OpenAI has released Privacy Filter, a bidirectional token-classification model for detecting and redacting PII in text. It runs locally on a browser or laptop (1.5B total parameters, 50M active), supports up to 128K tokens in a single pass, and achieves a 96% F1 score on the PII-Masking-300k benchmark. It covers eight PII categories including names, addresses, emails, phone numbers, URLs, dates, account numbers, and secrets. Unlike regex or NLP rule-based tools, it uses context-aware language understanding to handle nuanced cases like distinguishing public business addresses from private home addresses. Available on Hugging Face and GitHub under Apache 2.0, it can be fine-tuned with as little as 10% of a dataset. OpenAI positions it as one component in a broader privacy-by-design system, not a full anonymization solution, and recommends human review for high-sensitivity domains.

5m read timeFrom thenewstack.io
Post cover image
Table of contents
Scanning text in a single pass for emails, numbers, and moreGreater context-awareness, run locallyHow it compares to the competitionWhat this means for developersOne more piece in OpenAI’s stack

Sort: