A conference talk covering privacy challenges in real-time data pipelines. Using examples like the Netflix prize dataset, AOL search logs, Strava military base exposure, and NYC taxi data, the speaker illustrates how removing obvious identifiers is insufficient to protect privacy. Techniques covered include data masking, tokenization (with a clear distinction from hashing), k-anonymity, bucketing, noise addition, and synthetic data generation. The talk uses Apache Kafka and Apache Flink as concrete examples and emphasizes that privacy controls should be applied as early as possible in the pipeline to prevent sensitive data from propagating to unknown consumers.

32m watch time

Sort: