Agents
EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis
EpiBench is a newly introduced benchmark designed for evaluating AI agents in short-horizon epigenomics analysis, featuring 106 evaluations across various workflows such as CUT&Tag/CUT&RUN, ATAC-seq, ChIP-seq, and DNA methylation. The benchmark tested 5,088 trajectories from 16 model-harness pairs, with the highest performance from GPT-5.5 / Pi achieving a success rate of 45.0%, indicating that while agents can retrieve relevant data, they struggle with tasks requiring deeper scientific judgment. This benchmark highlights the limitations of current AI models in performing complex, domain-specific analyses, which is crucial for practitioners developing AI applications in epigenomics.
benchmarkepigenomicsevaluation