MultimodalarXiv cs.AI — 21 h ago

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

Earth-OneVision is a newly introduced 2 billion parameter remote sensing multimodal large language model (RS-MLLM) that integrates six sensor modalities—optical, SAR, infrared, multispectral, temporal, and video—within a single autoregressive framework, significantly broadening the scope of tasks it can address. It employs three innovative mechanisms: Full-Granularity Vision-Language Alignment (FGVLA), Spatial-Linguistic Isomorphic Serialization (SLIS), and Progressive Cross-Modality Adaptation (PCMA), to enhance performance and facilitate joint training with a dataset of approximately 34 million QA pairs. The model achieves competitive performance, surpassing larger RS-MLLMs (4B-72B parameters) across multiple benchmarks, making it a valuable tool for practitioners focused on cross-modal geoscientific applications and advancing remote sensing capabilities.

remote sensingLLMsensor modalitiesrelevance 0.00 · engagement 0.00

Read at source ↗← all news