Agents
GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models
The GeoNatureAgent Benchmark has been introduced as the first evaluation framework for LLM agents performing environmental geospatial analysis through structured tool calls to a production-style API. It includes 93 tasks across 18 categories, evaluated using seven LLMs, revealing that Claude Sonnet 4 achieves the highest accuracy at 60.8%, while open-weight models like DeepSeek V3.2 provide significant cost efficiency, achieving 93% of Claude's capability at a fraction of the cost. This benchmark is crucial for practitioners as it validates the performance of LLMs in real-world geospatial tasks, highlighting the limitations in reasoning capabilities and the need for further advancements in structured tool integration.
geospatial analysisbenchmarkenvironmental science