#VoxxedDaysCERN26

We all love real-time data — clicks, payments, rides, messages — but most of it comes with a catch: it contains personal information we’re not supposed to leak, such as names, emails, locations, or even small clues that can identify someone. The challenge: how do we keep streaming data useful and safe at the same time?

In this talk, we’ll explore practical ways to protect privacy in streaming systems using Apache Kafka, Apache Flink, and Apache Iceberg. We’ll cover:
- simple tricks like masking and tokenizing PII;
- why “anonymous” data often isn’t anonymous (the re-identification problem);
- techniques like bucketing, k-anonymity, and adding noise;
- how to balance privacy with data utility (too much hiding makes data useless).

Along the way, we’ll look at real-world stories: from public data leaks to surprising deanonymization attacks, and show live demos of pipelines that anonymize data before it’s written to storage.
If you’ve ever wondered how to build privacy-aware pipelines, this talk will give you practical patterns you can use right away.

Devoxx

A conference talk covering privacy challenges in real-time data pipelines. Using examples like the Netflix prize dataset, AOL search logs, Strava military base exposure, and NYC taxi data, the speaker illustrates how removing obvious identifiers is insufficient to protect privacy. Techniques covered include data masking, tokenization (with a clear distinction from hashing), k-anonymity, bucketing, noise addition, and synthetic data generation. The talk uses Apache Kafka and Apache Flink as concrete examples and emphasizes that privacy controls should be applied as early as possible in the pipeline to prevent sensitive data from propagating to unknown consumers.

Keeping data private in real-time pipelines by Olena Kutsenko