Research
REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection
REDACT is a new multilingual benchmark for personally identifiable information (PII) detection, featuring 13,427 records, 324,078 entity annotations, and 51 entity types across 25 languages. It employs a strength-2 covering-array sampler to control nine generation axes, allowing for a nuanced evaluation of PII detection models, including Presidio, GLiNER, OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6. This benchmark is significant for practitioners as it provides a structured way to assess model performance on high-stakes data and sensitivity tiers, highlighting the limitations of rule-based detectors compared to LLMs.
piibenchmarkmultilingual