Research
The Impossibility of Eliciting Latent Knowledge
The paper introduces the concept of Eliciting Latent Knowledge (ELK), which addresses the challenge of training AI systems to accurately report their beliefs about latent variables that are not observable to humans. Utilizing Causal Influence Diagrams (CIDs), the authors formalize the relationship between an agent's training environment and its subjective understanding, and they define the conditions under which an agent can be incentivized to provide honest responses. The authors present an impossibility theorem demonstrating that no feedback-based training strategy can guarantee the production of an honest agent, highlighting significant implications for the development of trustworthy AI systems.
latent-knowledgehonestycausal-influence