Text embeddings can leak nearly as much private information as raw text, making de-identification essential before storing data in vector databases. This guide walks through using Tonic Textual to extract, chunk, and de-identify text from a PDF (an American Express 10-K filing) via named entity recognition—replacing PII with

8m read timeFrom securityboulevard.com
Post cover image
Table of contents
The need for de-identified data in PineconeSetting up Tonic TextualDe-identifying textChunking textEmbedding and storing with PineconeQuerying the databaseConclusion

Sort: