Safety
Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data
The article presents a customizable empirical auditing framework for detecting and explaining data disclosures in synthetic data generated by LLMs. It differentiates between "true disclosures" and "phantom disclosures" using statistical hypothesis testing on partitioned input data, requiring no model access or additional training. This model-agnostic approach offers tighter empirical lower bounds on privacy leakage compared to existing methods, making it a significant tool for practitioners concerned with privacy in synthetic data generation.
synthetic dataprivacyauditing