To protect private information stored in text embeddings, it’s essential to de-identify the text before embedding and storing it in a vector database. In this article, we'll demonstrate how to de-identify and chunk text using Tonic Textual, and then easily embed these chunks and store the data in a Pinecone vector database to use for semantic search in RAG or other LLM applications.

Security Boulevard is a leading cybersecurity news and information portal, offering articles, analysis, and insights on cybersecurity threats, vulnerabilities, and best practices. From the latest trends in cyber threats to expert commentary on security technologies and compliance frameworks, Security Boulevard provides resources for security professionals, IT leaders, and business executives seeking to protect their organizations from cyber attacks and data breaches.

Security Boulevard

Text embeddings can leak nearly as much private information as raw text, making de-identification essential before storing data in vector databases. This guide walks through using Tonic Textual to extract, chunk, and de-identify text from a PDF (an American Express 10-K filing) via named entity recognition—replacing PII with redacted placeholders—then embedding the sanitized chunks with OpenAI's text-embedding-3-small model and storing them in Pinecone. A sample RAG query demonstrates that semantic search still works correctly on de-identified embeddings, with the relevant chunk retrieved despite redaction. The approach is recommended for RAG pipelines handling sensitive documents.

How to create de-identified embeddings with Tonic Textual & Pinecone

The need for de-identified data in Pinecone