Safety
Estimating Tail Risks in Language Model Output Distributions
The paper presents a method for estimating tail risks in language model outputs, focusing on the probability of harmful outputs using importance sampling rather than brute-force sampling. This approach allows for sample-efficient estimation, achieving results comparable to Monte Carlo methods with 10-20 times fewer samples, demonstrating the ability to estimate harmful output probabilities as low as 10^-4 with just 500 samples. This method is significant for practitioners as it enhances safety evaluations of language models by providing insights into model sensitivity and potential deployment risks.
language modelsriskestimation