Multimodal — AI news — AI News Digest

SDXL running locally in the browser on WebGPU, open-source

An open-source browser extension has been released that enables local image generation using the SDXL model via WebGPU, eliminating the need for complex installations. It supports two model versions: SDXL-Lighting fp16 (approximately 7 GB storage, requiring around 8 GB VRAM) and a 4-bit variant for lower-spec hardware (about 3.6 GB storage, needing 4-5 GB VRAM). This development allows practitioners to run image generation models directly in the browser, albeit with noted performance limitations due to synchronous WebGPU shader compilation, providing a new approach to leveraging AI models without extensive setup.

Reddit r/LocalLLaMA32 d agofound 12 d ago#sdxl#webgpu#image generation

Social Structure Matters in 3D Human-Human Interaction Generation

The paper introduces a novel framework for generating 3D human-human interactions (HHI) by addressing the modeling of social structures that dictate interaction dynamics. It presents a planner-executor paradigm, "Think with LLM, Move with Motion Skill," where a large language model (LLM) is utilized to decompose interactions into phases and assign roles, while a motion executor, enhanced with LoRA and conditioning techniques, translates this structure into coordinated motion. This approach improves the generation of physically plausible and interaction-aware 3D motions, which is critical for practitioners aiming to create realistic simulations in AI-driven environments.

arXiv cs.AI33 d agofound 10 d ago#3d#human interaction#text-to-motion

DramaDirector: Geometry-Guided Short Drama Generation

DramaDirector is a geometry-guided framework designed for generating short dramas by transforming global plots and local contexts into visually grounded multi-shot videos. It utilizes schema-constrained supervised fine-tuning (SFT) and geometry-reinforced planning optimization (GRPO) to decouple static visual and dynamic narrative conditions, enhancing first-frame generation and image-to-video synthesis through depth-pose references. The framework is evaluated against a newly introduced benchmark, DramaBoard, consisting of 35 live-action dramas and 81K shots, demonstrating improved performance in faithfulness, consistency, and controllability over existing multi-agent and video generation baselines, making it a significant advancement for practitioners in video generation and narrative-driven AI applications.

arXiv cs.AI33 d agofound 10 d ago#drama generation#geometry-guided#video synthesis

OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis

OrbitForge is a novel adapter designed for text-to-3D scene generation, utilizing frozen video priors and Gaussian Splatting reconstruction optimization to convert text-generated videos into consistent 3D Gaussian Splatting scenes. It leverages Deformable Gaussian Splatting for initial reconstruction and completes missing views using the text-to-video model, achieving a median span of 359.0 degrees on the T3Bench-derived audit and significantly improving the ImageReward metric from 8.07 to 16.36. This approach streamlines the process without requiring task-specific fine-tuning, making it a valuable tool for practitioners aiming to enhance 3D consistency in generated scenes.

arXiv cs.AI33 d agofound 10 d ago#text-to-3d#video-synthesis

Beyond U-Net: A Latent-Representation-Aligned Skip-Free Backbone for Flow-Matching Speech Enhancement

The article presents a novel skip-free encoder-decoder backbone for flow-matching speech enhancement that utilizes Latent Representation Alignment (LRA) to improve the efficiency of the process. By avoiding U-Net skip connections, the model aligns its representations with clean latent features from a Descript Audio Codec, enabling compact clean-speech representation and real-time inference with only five function evaluations. Benchmark results demonstrate enhanced PESQ and perceptual quality on datasets like WSJ0-CHiME3 and VoiceBank-DEMAND, making it a significant advancement for practitioners focused on efficient speech enhancement techniques.

arXiv cs.AI33 d agofound 10 d ago#speech-enhancement#generative-models#flow-matching

Listening makes Vision Clear for VLMs

The paper introduces Prompt-Vision Token Activation Map (PV-TAM), a novel approach for evaluating vision-language model (VLM) consistency by addressing issues of decoding drift and bias from structural tokens. PV-TAM enhances alignment measurement by incorporating peak attention distribution rather than solely relying on overlap masks, leading to improved performance in localization metrics across multiple datasets. This method is significant for practitioners as it provides a more reliable evaluation of VLMs, potentially leading to better model training and deployment strategies.

arXiv cs.AI33 d agofound 10 d ago#vision_language_models#attention#semantic_evaluation

The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models

The article introduces TheProfessor, a multi-teacher unsupervised prompt distillation method that enhances the compression of vision-language models (VLMs) by utilizing a two-teacher ensemble approach. This method employs a domain-finetuned PromptSRC ViT-L/14 and a zero-shot EVA-CLIP-L/14, demonstrating that confidence-weighted ensembling yields significant performance improvements on various datasets, with an average HM increase from 87.52 to 89.28. This advancement is particularly relevant for practitioners as it highlights the effectiveness of multi-teacher strategies in adapting models to domain shifts, potentially leading to better generalization in real-world applications.

arXiv cs.AI33 d agofound 10 d ago#prompt_distillation#vision_language_models

G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models

The G$^3$VLA model introduces a camera-aware geometric module for Vision-Language-Action (VLA) systems, enhancing visual-token processing by incorporating calibrated geometric information without modifying the action space. It employs intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion, achieving significant performance improvements across various benchmark suites, including LIBERO and RoboCasa24, particularly in spatially sensitive tasks. This development is crucial for practitioners as it addresses the limitations of traditional VLA models in multi-camera environments, enabling more accurate robot manipulation through better alignment of visual information with physical geometry.

arXiv cs.AI33 d agofound 10 d ago#vla#robotics#geometry

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

The paper introduces Head-Wise Representation Alignment (HeRA), a novel technique for improving Multimodal Large Language Models (MLLMs) by enforcing alignment at the individual attention head level rather than a fixed layer. HeRA utilizes the Mutual K-Nearest Neighbor (MKNN) alignment metric and a contrastive objective to enhance cross-modal representation alignment, leading to improved performance on vision-centric tasks and reducing visual hallucinations. This method is significant for practitioners as it offers a more granular approach to multimodal training, potentially leading to better model robustness and accuracy in vision-related applications.

arXiv cs.AI33 d agofound 10 d ago#mllm#representation_alignment#transformers

CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning

CineCap is a newly proposed framework for cinematographic video captioning that integrates structured reasoning with spatio-temporal anchors and employs reinforcement learning to enhance caption comprehensiveness and accuracy. It addresses the challenges of inferring professional cinematographic concepts from visual cues and generating precise descriptions across multiple dimensions. The framework is evaluated using CineCap Bench, a new benchmark of 472 annotated video-caption pairs, demonstrating superior performance over existing models and setting a new state of the art in this domain. The code and model checkpoint are publicly accessible, facilitating further research and development in video understanding and generation.

arXiv cs.AI33 d agofound 10 d ago#video captioning#cinematography#structured reasoning

Real-Time Interactive Music Generation via Data-Free Streaming Consistency Distillation

The paper presents a novel framework for real-time interactive music generation that leverages a streaming autoregressive latent space, allowing for low-latency performance without the need for paired audio-latent datasets. It introduces music-aware consistency objectives to maintain acoustic fidelity, achieving a low real-time factor through parameter-efficient adaptation. This approach transforms generative music models into responsive instruments capable of integrating dynamic human inputs seamlessly, enhancing the potential for live human-AI collaboration in music creation.

arXiv cs.AI33 d agofound 10 d ago#music generation#interactive#real-time

Inclusive Interactive Collisions for Multi-View Consistent Compositional 3D Generation

The article introduces I2C-3D, an optimization-based method aimed at generating multi-view consistent compositional 3D assets that address the challenges of interaction modeling among Gaussian primitives and cross-view inconsistency. Key innovations include the Inclusive Interactive Collisions strategy for physically plausible interactions and a Multi-View Adaptive Score Distillation Sampling technique that enhances multi-view consistency by modulating attention maps across viewpoints. This advancement is significant for practitioners as it allows for the creation of high-fidelity 3D scenes with improved interaction realism and flexibility in 3D editing.

arXiv cs.AI33 d agofound 10 d ago#3d generation#text-to-image#compositional

Audio-visual Contrastive Alignment for Diffusion-based Visual-conditioned Speech Enhancement

The paper presents a novel approach to audio-visual speech enhancement (AVSE) by integrating a contrastive audio-visual loss into a diffusion-based model that utilizes cross-attention for visual conditioning. This method enhances the model's ability to leverage visual cues, resulting in improved interference suppression and signal reconstruction, particularly in low signal-to-noise ratio (SNR) scenarios. The findings are significant for practitioners as they demonstrate a method to enhance speech recovery in challenging auditory environments, with the code made available for further exploration.

arXiv cs.AI33 d agofound 10 d ago#audio-visual#speech enhancement#diffusion

UniDrive: A Unified Vision-Language and Grounding Framework for Interpretable Risk Understanding in Autonomous Driving

UniDrive is a novel unified vision-language and grounding framework designed for interpretable risk understanding in autonomous driving, addressing the limitations of existing multimodal large language models (MLLMs) in temporal reasoning and spatial precision. The architecture features a dual-branch system: a temporal reasoning branch for multi-frame scene dynamics and a high-resolution perception branch for fine-grained spatial details, integrated via a gated cross-attention fusion module. Benchmark results on the DRAMA-Reasoning dataset indicate that UniDrive surpasses image-based and video-based baselines in risk-object localization and interpretability, highlighting its potential for enhancing safety in autonomous driving systems.

arXiv cs.AI33 d agofound 10 d ago#autonomous-driving#risk-understanding#vision-language

Something’s off with Midjourney’s pivot to body scanners

Midjourney announced a pivot from image generation to medical imaging with the introduction of a novel ultrasound scanner designed to create high-quality images comparable to MRI technology. The scanner operates by submerging users in water, aiming to make medical imaging more accessible and user-friendly. This shift could impact practitioners by providing a new approach to non-invasive imaging, potentially integrating AI-driven analysis in medical diagnostics.

The Verge — AI33 d agofound 21 d ago#midjourney#medical imaging#ultrasound

Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

The article discusses the release of Lift4D, a new framework designed for harmonizing single-view 3D estimation to enable 4D reconstruction in dynamic, real-world environments. It leverages advanced techniques in neural networks to improve the accuracy and robustness of 3D reconstructions from single images, addressing challenges such as occlusions and varying scene complexities. This framework is significant for practitioners as it enhances the capabilities of existing models in 3D perception tasks, allowing for more reliable integration of 3D data in applications like robotics and augmented reality.

Hacker News33 d agofound 12 d ago#3d#reconstruction

ByteDance's Seedance 2.5 breaks the 30-second barrier for AI video generation

ByteDance unveiled Seedance 2.5, a new AI video generation model capable of producing videos longer than 30 seconds, at the FORCE conference. This model represents a significant advancement in video generation capabilities, potentially enhancing creative applications and content production workflows for practitioners in AI and multimedia domains.

The Decoder34 d agofound 21 d ago#bytedance#video_generation#ai_models

VideoAgent: All-in-One Framework for Video Understanding and Editing

VideoAgent is a newly proposed framework for video understanding and editing that addresses limitations in existing automated systems by enabling coherent narrative creation and diverse editing operations. It features automated video shot creation through shot planning agents and a multi-agent orchestration framework that integrates over thirty specialized editing agents, achieving an orchestration success rate of 87-95% and reducing API costs by 60%. This framework outperforms current multimodal LLMs and offers human-like video quality, making it a significant advancement for practitioners in video editing and AI-driven content creation.

arXiv cs.AI34 d agofound 15 d ago#video understanding#editing

The Impact of VAE Design on Latent Pose Representations for Diffusion-based Sign Language Production

This study examines the impact of variational autoencoder (VAE) design on latent pose representations for diffusion-based sign language production. The authors analyze how architectural choices and training objectives influence the latent space structure and subsequently affect the performance of a latent diffusion model, revealing that variations in generative performance, assessed via back-translation BLEU scores, are often more closely linked to latent space properties than to VAE reconstruction accuracy. This insight is crucial for practitioners as it suggests that optimizing latent space characteristics may enhance the efficacy of text-to-sign generation models.

arXiv cs.AI34 d agofound 20 d ago#sign language#vae#diffusion

Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Model

The article presents a layered framework for knowledge infusion in multimodal iterative generative models, identifying four distinct intervention layers: surface, trajectory, latent, and parametric. It applies this framework to diffusion models, demonstrating that implementing multiple layers cumulatively can significantly reduce knowledge-violating outputs, achieving a 70.97% reduction compared to standard generation methods. This framework is crucial for practitioners as it provides a structured approach to enhance the reliability of generative models in safety-critical applications.

arXiv cs.AI34 d agofound 13 d ago#knowledge#generative#models

Human and AI collaboration for pulmonary nodule segmentation

The article presents Hi-Seg, a human-in-the-loop segmentation framework for pulmonary nodules that integrates the Segment Anything Model (SAM) with human collaboration. In a study involving chest CT scans from 1,179 patients, Hi-Seg achieved a mean Dice score of nearly 85%, surpassing five leading deep learning models by 10-22% and 13 SAM variants by 1-29%. This approach demonstrates the potential to enhance segmentation accuracy while decreasing annotation time, suggesting a transformative impact on clinical workflows and the integration of AI in medical practices.

arXiv cs.AI34 d agofound 15 d ago#ai#segmentation#medical#collaboration

Balancing Performance and Diversity in GRPO Autoregressive Text-to-Image Post-Training

The paper presents a novel approach to autoregressive text-to-image generation by employing a GRPO-style online reinforcement learning framework that dynamically addresses reference-policy divergence using a unified f-divergence framework. Key findings indicate that using Jensen-Shannon (JS) divergence for policy optimization enhances both performance and diversity in generated outputs, outperforming existing methods in experiments conducted on LlamaGen and Janus-7B. This work is significant for practitioners as it provides a theoretically grounded method to improve alignment with human preferences while maintaining generation diversity, which is crucial for developing more robust T2I models.

arXiv cs.AI34 d agofound 20 d ago#text-to-image#autoregressive#alignment

SteerVTE: Seamless Video Text Editing with Style and Glyph Control

SteerVTE is a novel framework for video text editing that utilizes a frozen video diffusion model to achieve precise modifications while maintaining stylistic consistency. It incorporates a lightweight text context adapter with a style encoder and dual-granularity glyph encoders, along with a glyph-aware spatial-focal loss and a three-stage training curriculum. This approach, supported by the SteerVTE-1M dataset of one million triplets, significantly enhances text accuracy, style consistency, and temporal coherence compared to existing baselines, making it a valuable tool for practitioners in video editing and AI-driven content creation.

arXiv cs.AI34 d agofound 15 d ago#video editing#text editing

RS-Gen: A Multi-Stage Agentic Framework for Reasoning and Search-Augmented Image Generation

RS-Gen is a novel multi-stage agentic framework designed for reasoning and search-augmented image generation, addressing limitations in existing models when faced with ambiguous intentions and Out-of-Distribution knowledge. It features a "Questioning-and-Solving" closed-loop mechanism that enhances logical reasoning and fills knowledge gaps without the need for additional training. Experimental results show RS-Gen achieves significant performance improvements on the WISE Verified and RISEBench benchmarks, elevating the Qwen-Image and Qwen-Image-Edit-2511 models to state-of-the-art status among open-source solutions, which is crucial for practitioners seeking to enhance image generation capabilities in dynamic contexts.

arXiv cs.AI34 d agofound 15 d ago#image generation#agents

Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Bagpiper is an 8 billion parameter audio foundation model designed to address open-ended audio tasks using rich captions, which are detailed natural language descriptions that capture cognitive concepts from audio signals. Pre-trained on a dataset of 600 billion tokens, Bagpiper employs a caption-then-process approach during fine-tuning, enabling it to outperform existing models like Qwen-2.5-Omni, CosyVoice3, and TangoFlux in audio understanding and generation tasks. This model's holistic approach to audio processing represents a significant advancement for practitioners, facilitating the synthesis and understanding of complex audio compositions without relying on task-specific supervision.

arXiv cs.CL34 d agofound 12 d ago#audio foundation models#rich captions

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

Pocket-Dentist introduces an efficient benchmark for dental multimodal question answering, utilizing three datasets from BRAR and MetaDent to evaluate 14 vision-language models (VLMs). Notably, a compact 2B-parameter model, Pocket-Dentist-2B, demonstrates competitive performance with larger models while achieving a 4.9x reduction in latency and 2.3x lower memory usage when deployed on an iPhone 17 Pro. This development is significant for practitioners as it enables practical, privacy-preserving dental screening on consumer devices, enhancing accessibility and efficiency in clinical settings.

arXiv cs.AI34 d agofound 13 d ago#vision-language#dental#llm

LK Jam: System Architecture and Implementation of a Real-Time Human-AI Interactive Music Generation System using Role-Aware GRU

The LK_Jam system is a real-time, bidirectional human-computer interactive music generation framework utilizing a lightweight Gated Recurrent Unit (GRU) architecture. It features a multi-dimensional sparse event stream for dynamic music interaction, a strict multithreaded lock-free communication bridge, and employs the RTNeural inference engine to maintain low-latency performance with $O(1)$ complexity in autoregressive decoding. This approach enables high-quality musical coherence and role-aware interaction, making it a significant advancement for practitioners developing AI systems in live music settings.

arXiv cs.AI34 d agofound 15 d ago#music generation#interactive AI

Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

Moshi-Face is introduced as the first full-duplex dialogue model that integrates audio and facial expression processing, enhancing natural communication in voice conversations. It employs a vector-quantized variational autoencoder (VQ-VAE) for encoding 3D head meshes into discrete face tokens and incorporates a Face Transformer module for non-autoregressive generation of these tokens. This advancement allows for real-time synchronization of speech and facial motion, achieving low-latency audiovisual alignment while maintaining the dialogue quality of the original Moshi model.

arXiv cs.CL34 d agofound 12 d ago#dialogue systems#facial generation#audio

Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment

The study presents a benchmark for multimodal large language models (MLLMs) applied to pulmonary embolism (PE) risk assessment, utilizing the INSPECT dataset comprising 23,248 CTPA studies. The research evaluates models like Gemma4 E4B and Gemma4 E2B across various input modalities (CTPA only, EHR only, and combined) using zero-shot and few-shot prompting, revealing better performance in diagnostic tasks compared to prognostic ones. This work highlights the potential of compact multimodal models in enhancing early-stage PE risk detection and clinical decision-making.

arXiv cs.AI34 d agofound 20 d ago#clinical question answering#pulmonary embolism#MLLM

Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders

The article presents a framework utilizing Sparse Autoencoders (SAEs) to extract and analyze visual, textual, and multimodal concepts from Vision Language Models (VLMs), addressing the limitations of existing methods that treat these modalities separately. Experiments conducted on the LLaVA-NeXT VQA dataset show an improvement in visual concept quality by up to 45% compared to previous SAE-based approaches, while maintaining high quality for textual concepts. This work enhances understanding of VLMs' internal processing, facilitating better interpretation and utilization of multimodal concepts for practitioners in AI.

arXiv cs.AI34 d agofound 16 d ago#vlm#sparse_autoencoders#concept_analysis

Deep Learning-Based Sign Language Recognition from Videos and Cross-Lingual Translation to Indian Vernaculars

A new two-stage deep learning pipeline for sign language recognition and translation has been developed, utilizing a fine-tuned VideoMAE video transformer for classifying Indian sign language videos into English labels, followed by translation into Hindi, Telugu, and Bengali using Meta AI's NLLB-200 model. The classification model achieved 99% training accuracy and 78% validation accuracy on a 13-class subset of the AI4Bharat Indian Sign Language corpus, processing 16-frame clips at 224 x 224 resolution. This work addresses the lack of automated tools for low-resource Indian languages, highlighting significant implications for accessibility and communication within the deaf and hard-of-hearing community while acknowledging its limitations and future development needs.

arXiv cs.AI34 d agofound 20 d ago#sign_language#video#translation#deep_learning

Render-FM: Feedforward Model for Real-time Photorealistic Volumetric Rendering

Render-FM is a novel feedforward model for photorealistic volumetric rendering of CT scans, achieving a 500x speedup by regressing 6D Gaussian Splatting (6DGS) parameters in just 2.8 seconds per scan, compared to hours for traditional methods like NeRF. It incorporates Anatomy-Guided Priming (AGP) to leverage segmentation masks and transfer functions, enhancing its ability to generalize across different anatomies and support real-time rendering without extensive preparation. This advancement facilitates clinical workflows by providing immediate, high-quality visualizations, significantly improving the efficiency of medical imaging applications.

arXiv cs.AI34 d agofound 14 d ago#volumetric rendering#neural networks#medical imaging

HaineiFRDM: Structure-Preserving Diffusion for Film Restoration under Fast Motion and Diverse Defects

The HaineiFRDM model has been introduced for film restoration, addressing challenges in fast motion and diverse defects. It employs a patch-wise strategy with position-aware global fusion modules to preserve scene structure and enhance texture consistency through a frequency-based module, achieving high-resolution restoration on a single 24GB-VRAM GPU. This model is significant for practitioners as it improves restoration quality and reduces memory requirements, facilitating better handling of film defects in dynamic scenes.

arXiv cs.AI34 d agofound 14 d ago#film-restoration#diffusion-models

Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

The paper introduces PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark designed to assess orthographic bias in Vision-Language Models (VLMs) by evaluating 375 image-reasoning tasks across three scripts of Punjabi. The study reveals significant discrepancies in model performance, with accuracy differences of up to 16% between scripts and Script Consistency Rates (SCR) as low as 24.8%, highlighting the limited transferability of reasoning across scripts. This work emphasizes the need for script-agnostic evaluation metrics, challenging existing multilingual assessment methods and advocating for more equitable AI solutions.

arXiv cs.AI34 d agofound 16 d ago#vlm#bias#multilingual

Improving Text-to-Music Generation with Human Preference Rewards

The article presents an entry to the Academic Text-to-Music (ATTM) Grand Challenge, introducing a system that integrates a learned human-preference reward from TuneJury into a 120M-parameter FluxAudio-S model. Key innovations include a training-time reward conditioning method, a variety of score-conditioning architectures, and a preference-tuning pass for improved audio-text alignment. This approach enhances text-to-music generation by leveraging human preferences, which could lead to more refined outputs in practical applications of AI music generation.

arXiv cs.AI34 d agofound 16 d ago#text-to-music#human-preference#audio

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

The paper introduces Kamera, a novel approach for enhancing position-invariant multimodal key-value (KV) caching, which enables the reuse of cached data without the need for re-encoding. By implementing a low-rank conditioning patch alongside each cached chunk, Kamera addresses the loss of cross-chunk conditioning that traditional KV caches suffer from, significantly improving multi-hop reasoning accuracy while reducing computational overhead. This method demonstrates substantial performance gains on benchmarks like MM-NIAH and two-page doc-QA, making it particularly valuable for practitioners working on multimodal AI systems that require efficient resource utilization during processing.

arXiv cs.AI34 d agofound 15 d ago#multimodal#kv cache

Mitigating Cross-Image Information Leakage in Multi-Image Understanding with Large Vision-Language Models

The article presents FOCUS, a training-free and architecture-agnostic method designed to mitigate cross-image information leakage in Large Vision-Language Models (LVLMs) when processing multi-image inputs. By masking all but one image with random noise, FOCUS enables the model to concentrate on a single clear image, leading to improved performance on various multi-image benchmarks and demonstrating generalization to video understanding. This approach is significant for practitioners as it enhances multi-image reasoning capabilities without requiring additional training or changes to the model architecture.

arXiv cs.AI34 d agofound 14 d ago#vision-language-models#information-leakage

One Image is All You Need: Agentic One-Shot Image Generation via Text-Based World Models for Long-Tail Spatial Perception

The article introduces WMGen-v1, a novel agentic text-based world model framework designed for generating long-tail spatial data, crucial for applications like autonomous driving. This framework leverages a Large Vision-Language Model (LVLM) to create structured scene representations from a single reference image, while a Large Language Model (LLM) guides the scene expansion under physical and commonsense constraints. Benchmark results indicate that detectors trained on WMGen-v1 synthetic data can achieve performance comparable to those trained on real-world data, addressing the challenges posed by data scarcity in safety-critical scenarios.

arXiv cs.AI34 d agofound 16 d ago#image-generation#spatial-perception#one-shot

Semantic Browsing: Controllable Diversity for Image Generation

The article introduces a method called Semantic Browsing, which enhances diversity in image generation by allowing users to navigate structured image galleries through meaningful axes of variation. This approach leverages Vision Language Models (VLMs) to decouple semantic decision-making from pixel generation, enabling controlled diversity directly at the text level rather than relying on stochastic variations. This innovation is significant for practitioners as it facilitates more interpretable and user-driven exploration of generated images, addressing the common issue of output collapse in traditional text-to-image models.

arXiv cs.AI34 d agofound 15 d ago#image generation#semantic browsing#diversity

Text-to-Image Generative AI for Modeling and Simulation: Methods, Opportunities, and Applications

The article introduces a tutorial on text-to-image generative AI, aimed at the modeling and simulation (M&S) community, highlighting its potential applications such as visualizing simulation outcomes and generating educational materials. It provides practical workflows and conceptual guidance on integrating text-to-image generation into M&S tasks, emphasizing the translation of prompts and outputs into visual scenes. This resource is significant for practitioners as it equips them with the knowledge to effectively evaluate and incorporate image generation techniques into their simulation processes.

arXiv cs.AI34 d agofound 16 d ago#text-to-image#generative ai#modeling

MIRCaps: A Large-Scale Mixed-Domain Dataset with Image-Level and Region-Level Captions for Fine-Grained Vision-Language Learning

The article introduces MIRCaps, a large-scale multimodal dataset designed for fine-grained vision-language learning, featuring 141,364 images, 981,947 image-level captions, and 1,742,264 region-level captions with 1,391,779 bounding box annotations. This dataset aims to improve Vision-Language Models (VLMs) by providing diverse caption types that enhance the learning of visual attributes. Experimental results indicate that lightweight VLMs such as SmolVLM-256M-Instruct, BLIP, and Qwen2.5-VL 3B-Instruct can be effectively fine-tuned using MIRCaps, making it a valuable resource for practitioners in the field.

arXiv cs.AI34 d agofound 16 d ago#vision-language#dataset#fine-grained learning

EnTrust: Modeling Inter-Modal Conflict for Trustworthy Multimodal Medical Image Analysis

EnTrust is a novel framework for multimodal medical image analysis that addresses inter-modal conflict to enhance predictive reliability. It features an EnFuse module that disentangles multimodal features into anatomical consensus, modality-specific cues, and conflict signals, and employs a diffusion-based generative segmentation model called SegDiff. Achieving state-of-the-art segmentation accuracy across four medical benchmarks while reducing calibration error by 40% compared to leading methods, EnTrust offers a more efficient alternative to deep ensembles, operating with a single model at approximately half the memory usage, which is crucial for practitioners seeking reliable and interpretable AI solutions in clinical settings.

arXiv cs.AI34 d agofound 16 d ago#multimodal#medical#image analysis

Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

The article introduces STREAM (Structural-Temporal Rhythmic Energy-based Attention for Motion), a novel diffusion transformer designed for choreographic motion generation that effectively separates conditioning pathways for text and music. It utilizes Adaptive Layer Normalization (AdaLN) for kinematic structure control and a Bimodal Energy-Based Attention Module (BEAM) to align musical beats without compromising semantic integrity. The accompanying Motorica++ dataset enhances training with domain-specific vocabulary and annotations, while the Exchange Evaluation Protocol and Editable Dance Score (EDS) provide metrics for evaluating zero-shot editability, making STREAM a significant advancement for practitioners focused on controllable AI in artistic applications.

arXiv cs.AI34 d agofound 20 d ago#motion_generation#dance#attention

Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

The paper introduces Hierarchical Concept-to-Appearance Guidance (CAG) for multi-subject image generation, addressing identity inconsistency and compositional control in existing diffusion models. The framework employs a VAE dropout training strategy to enhance semantic signal reliance and integrates a correspondence-aware masked attention module within the Diffusion Transformer (DiT) to ensure precise attribute binding. This approach achieves state-of-the-art results in multi-subject image generation, improving prompt adherence and subject consistency, which is crucial for practitioners aiming to enhance image synthesis quality in AI applications.

arXiv cs.AI34 d agofound 14 d ago#image-generation#guidance#multi-subject

Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

The article introduces Skeleton-to-Image Encoding (S2I), a new method that converts 3D human skeleton sequences into image-like representations, facilitating the application of vision-pretrained models for self-supervised learning of skeleton data. By organizing joints based on body-part semantics and standardizing image dimensions, S2I addresses the challenges of heterogeneous skeleton formats and the lack of large-scale datasets. Experimental results on NTU-60, NTU-120, and PKU-MMD datasets show that S2I effectively enhances skeleton representation learning and supports cross-modal action recognition, making it a significant advancement for practitioners in multi-modal AI applications.

arXiv cs.AI34 d agofound 14 d ago#skeleton representation#vision models#action recognition

Happy Young Women, Grumpy Old Men? Emotion-Driven Demographic Biases in Synthetic Face Generation

This study audits demographic biases in synthetic faces generated by text-to-image (T2I) models, analyzing outputs from eight models (four Western and four Chinese) across 56,000 generated images under various emotional prompts. The research reveals a significant overrepresentation of young, White-coded faces, particularly under negatively valenced emotional conditions, which skew outputs towards male, middle-aged individuals and reduce perceived attractiveness for certain demographic combinations, such as young Black females. These findings highlight the necessity for comprehensive, intersectional audits of T2I models that consider emotional context to address pervasive biases prior to deployment.

arXiv cs.AI34 d agofound 14 d ago#t2i#demographic-biases#synthetic-faces

PIVOTSBench: Evaluating Fine-Grained Interpersonal Relationship Reasoning in Multimodal Large Language Models

PIVOTSBench is a newly introduced benchmark designed to evaluate the fine-grained interpersonal relationship reasoning capabilities of multimodal large language models (MLLMs), utilizing data from Social-IQ 2.0 and YouTube. The benchmark includes auxiliary tasks that assess models' ability to identify visual cues critical for predicting interpersonal dimensions, and it features evaluations on both proprietary and open-source MLLMs through ablation studies focusing on visual modalities and social role information. This development is significant for practitioners as it addresses a gap in understanding how MLLMs can leverage multimodal inputs for nuanced social reasoning, potentially enhancing applications in social AI.

arXiv cs.CL34 d agofound 13 d ago#benchmark#reasoning#mllm

Synthesizing the Lombard Effect: Multi-Level Control of Speech Clarity and Vocal Effort in TTS

The article presents a flow-matching based text-to-speech (TTS) model designed to simulate the Lombard effect, enabling continuous control over vocal effort and articulation. The model incorporates pseudo-labels for these parameters and allows for word-level emphasis, which enhances clarity in challenging acoustic environments. Experimental results indicate that the model effectively improves speech intelligibility in noisy conditions, making it significant for practitioners focused on developing more adaptive and human-like TTS systems.

arXiv cs.CL34 d agofound 13 d ago#tts#speech synthesis#vocal effort

Beyond Templates: Revisiting Zero-Shot Remote Sensing through Meta-Prompting

The article presents a comprehensive evaluation of 17 vision-language model (VLM) variants applied to zero-shot Earth Observation tasks using Meta-Prompting for Visual Recognition (MPVR) across 12 remote sensing datasets. It highlights the sensitivity of zero-shot performance to the design of textual prompts and class descriptions, demonstrating that while LLM-generated descriptions are semantically richer, they can introduce noise that undermines robustness. The study emphasizes the effectiveness of lightweight query embedding calibration in enhancing zero-shot classification and retrieval, providing valuable insights for practitioners in optimizing model performance in remote sensing applications.

arXiv cs.AI34 d agofound 16 d ago#vision-language#zero-shot#remote-sensing

GroundShot: Visually Consistent Multi-Shot Long Video Generation via Entity-Grounded Shot Scheduling

GroundShot is a novel framework for generating visually consistent multi-shot videos, addressing the challenge of entity drift across shots without requiring additional training or model modifications. It employs an entity-level visual memory system to schedule shot generation based on the reliability of entities, thereby enhancing consistency in their appearances. Additionally, the introduction of GroundBench provides a diagnostic benchmark to evaluate consistency at the entity level, demonstrating that GroundShot significantly outperforms existing methods in maintaining visual coherence across multi-shot sequences.

arXiv cs.AI34 d agofound 16 d ago#video#generation#entities

TailorMind: Towards Preference-Aligned Multimodal Content Generation

TailorMind is a novel framework for personalized multimodal content generation that integrates collaborative preference modeling with controllable generation techniques. It employs hypergraph collaborative filtering to enhance user profiles and utilizes retrieval-augmented style control to align outputs with user-generated content patterns, achieving improved coherence, novelty, and aesthetic quality compared to existing generation baselines. The accompanying TailorBench benchmark evaluates performance across five dimensions, with TailorMind demonstrating up to 29% gains in recall, making it a significant advancement for practitioners focused on generating user-tailored content without relying on existing datasets.

arXiv cs.AI34 d agofound 20 d ago#multimodal#content-generation#preference-alignment

TriMotion: Modality-Agnostic Camera Control for Video Generation

TriMotion is a newly proposed modality-agnostic framework for camera-controlled video generation that integrates video, pose, and text inputs into a unified motion embedding space. It utilizes a Motion Triplet Dataset for synchronized supervision and introduces a latent motion consistency objective to ensure generated videos adhere to target camera trajectories without pixel-space decoding. This approach enhances flexibility in applications such as sequential motion composition and cross-modal motion interpolation, making it a significant advancement for practitioners in generative systems.

arXiv cs.AI34 d agofound 16 d ago#video#generation#camera

Scaling Audio Models Efficiently: A Joint Study of Compute Constraints and Optimization Behavior

This study presents a unified framework for optimizing compute allocation in speech processing tasks, specifically Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER), by analyzing model size, input length, and representation resolution. Experiments on the LibriSpeech and CREMA-D datasets reveal that scaling model size from Tiny (39M) to Small (244M) significantly reduces word error rate (WER) by 8.22%, but further scaling to Medium (769M) yields diminishing returns of only 2.35%. The findings indicate an optimal audio duration for SER and suggest that reducing encoder token resolution can effectively lower inference costs with minimal impact on performance, providing valuable guidelines for practitioners designing efficient speech models.

arXiv cs.AI34 d agofound 15 d ago#speech synthesis#tts#natural language

VideoLatent: Video-Language Learning via Latent Self-Forcing

VideoLatent is a novel multimodal large language model (MLLM) designed for video understanding and reasoning, introducing a latent injection module that employs a latent self-forcing training paradigm. This model achieves significant computational efficiency, reducing training and inference overhead by approximately 6x and 68x, respectively, while outperforming existing MLLMs across 14 benchmarks in both general video understanding and complex reasoning tasks. Its reliance on standard video-question-answer triplets enhances scalability and transferability, making it a valuable tool for practitioners in the field of AI who require efficient video processing capabilities.

arXiv cs.AI34 d agofound 14 d ago#video-language#reasoning#ml

DreamUV: Unwrap Artist-like UV by End-to-End Flow Matching

DreamUV is an end-to-end learning framework designed for UV parameterization in 3D content creation, addressing the gap between geometric distortion objectives and artistic preferences. It formulates UV unwrapping as a generative Flow Matching problem, employing a boundary-aware training strategy and a Model-in-the-Loop Finetuning scheme to enhance seam geometry and account for discretization errors. Evaluated on a large-scale dataset of artist-authored UVs, DreamUV achieves superior boundary straightness and axis-aligned island tightness compared to classical and learning-based methods, making it a significant advancement for practitioners seeking to produce artist-like UV layouts efficiently.

arXiv cs.AI34 d agofound 15 d ago#uv mapping#3d content#generative models

Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

S5-TTS, a streaming variant of T5-TTS, has been introduced to address latency issues in text-to-speech synthesis by enabling word-by-word incremental speech generation. It utilizes encoder-decoder language modeling and a lookahead-causal masking mechanism with Conv-based auxiliary attention, achieving comparable quality to full-context T5-TTS while significantly reducing end-to-end response latency. This development is crucial for practitioners in conversational AI, as it allows for immediate speech generation and maintains intelligibility and speaker similarity, enhancing user experience in real-time applications.

arXiv cs.AI34 d agofound 16 d ago#text-to-speech#llm#streaming#synthesis

Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding

The article introduces Gold Points Sniper (GPS), a novel framework designed to enhance lightweight vision-language models (VLMs) with self-guided multimodal reasoning for fine-grained human action understanding. It includes three modules: Gold Points Extractor for identifying critical action-relevant details, Selective Socratic Questioner for refining these details, and Semantic Entailment Evaluator for assessing factual consistency. Experimental results show that GPS significantly improves performance on a curated dataset, achieving results comparable to proprietary models like GPT-4o while ensuring higher factual accuracy, which is crucial for reliable human-robot interaction in everyday environments.

arXiv cs.AI34 d agofound 15 d ago#visual reasoning#action understanding#vlm

ThermoLLM: Thermodynamics-Aware HVAC Control with Spatial-Semantic Knowledge Graph

The paper introduces ThermoLLM, a thermodynamics-aware HVAC control framework that leverages a physics-informed spatial knowledge graph for managing a five-zone EnergyPlus building simulation. By integrating building semantics with recent interaction history, the model enhances decision-making regarding thermal dynamics and zone coupling, outperforming standard control baselines and LLM-based alternatives in energy-comfort trade-offs and PMV violations. This framework is significant for practitioners as it demonstrates a novel approach to HVAC control that incorporates structured spatial reasoning, potentially leading to more efficient and responsive building management systems.

arXiv cs.AI34 d agofound 20 d ago#hvac#control#llm

CulMind: Benchmarking Multimodal Understanding and Reasoning in Chinese Cultural Heritage

CulMind and CulMind-R are newly introduced benchmarks for evaluating Multimodal Large Language Models (MLLMs) in the context of Chinese Cultural Heritage (CCH), encompassing 50 tasks from over 100 museums and a 24-task reasoning subset. The benchmarks utilize ReaScore, a task-adaptive metric for assessing reasoning quality by weighting task-specific dimensions, revealing significant discrepancies between model answers and reasoning quality, particularly on complex tasks. This resource enables practitioners to conduct more nuanced evaluations of MLLMs' understanding of cultural heritage, fostering advancements in multimodal reasoning capabilities.

arXiv cs.CL34 d agofound 13 d ago#llm#benchmark#cultural heritage

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

Bagpiper-TTS is a newly introduced universal speech synthesis system that interprets natural language prompts to generate comprehensive captions, which guide speech synthesis. This model supports a wide range of applications beyond traditional TTS, such as multi-talker synthesis and singing voice synthesis, achieving a 1.7% Word Error Rate (WER) on the Seed-TTS-Eval benchmark and matching performance with specialized models in evaluations. Its flexibility in handling diverse user requests is significant for practitioners seeking to develop more adaptive and capable TTS systems.

arXiv cs.AI34 d agofound 15 d ago#video generation#language models#video understanding

2D Versus 3D Diffusion for In Silico Training of Interventional X-ray AI Models

This study introduces two methods for synthesizing training data for interventional X-ray AI models: a 3D conditional latent diffusion model that generates CT volumes and a 2D diffusion model that produces synthetic X-ray images. Experiments reveal that models trained on synthetic 2D X-rays can achieve performance comparable to those trained on real X-ray data for anatomical landmark detection. This approach potentially alleviates the bottleneck of obtaining annotated high-resolution anatomical models, offering a scalable solution for generating diverse datasets necessary for robust AI model development in medical imaging.

arXiv cs.AI34 d agofound 16 d ago#diffusion#x-ray#training data

MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

The MMOU benchmark has been introduced to evaluate multimodal understanding and reasoning in long, complex videos, consisting of 20,000 questions and 11,877 curated videos across diverse domains. It assesses 13 fundamental skill categories requiring integration of visual, audio, and textual signals, with evaluations revealing significant performance gaps: the best closed-source model achieves 64.2% accuracy, while the top open-source model only reaches 46.8%. This benchmark underscores the limitations of current multimodal models in handling omni-modal reasoning over extended content, providing insights into failure modes that practitioners can address in future model development.

arXiv cs.CL34 d agofound 12 d ago#benchmark#reasoning#videos

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

EmoInstruct-TTS is a newly proposed dual-path framework for emotional speech synthesis that allows users to specify emotions through natural language instructions. It features Emotion2embed, a supervised semantic-acoustic embedding covering 48 emotional states with fine-grained intensity levels, and utilizes the Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate emotion representations. This system enhances emotional controllability and the naturalness of synthesized speech, making it a significant advancement for practitioners looking to implement nuanced emotional expression in AI-generated speech.

arXiv cs.AI34 d agofound 15 d ago#tts#emotional synthesis#llm

Scaling Diverse Language Generation for 3D Visual Grounding

The article introduces ViGiL3D++, a scalable method for 3D visual grounding (3DVG) that enhances the generation of diverse grounding queries by integrating constraint sampling from scene graphs with language generation from large language models (LLMs). This approach demonstrates improved diversity compared to existing datasets and enhances performance on multiple 3DVG benchmarks, while also highlighting the limitations of vision-language models (VLMs). This development is significant for practitioners as it addresses the challenge of generalizing spatial language understanding in 3D environments, which is crucial for the deployment of AI agents in real-world applications.

arXiv cs.CL34 d agofound 13 d ago#3D visual grounding#language generation

Multimodal Image Colorization: Quantifying the Impact of Text-Conditioned Guidance on Grayscale-to-Color Translation

This study evaluates the impact of text conditioning on grayscale-to-color image translation using two architectures: U-Net and Stable Diffusion 1.5. The introduction of CLIP text conditioning resulted in significant improvements in performance metrics, with U-Net showing a 5.6% increase in PSNR and a 36.6% increase in colorfulness, while Stable Diffusion 1.5 achieved a 5.8% increase in PSNR. These findings underscore the effectiveness of integrating text guidance in enhancing the quality of automated colorization, which is critical for applications in historical restoration and medical imaging.

arXiv cs.CL34 d agofound 13 d ago#image colorization#text conditioning

Porting the Moebius 0.2B image inpainting model to run in the browser with Claude Code

The Moebius 0.2B image inpainting model has been successfully ported to run in the browser using WebGPU, allowing users to interactively remove regions from images and generate inpainted results. This lightweight model, capable of delivering 10B-level performance, originally required PyTorch and NVIDIA CUDA for operation. The browser-based implementation expands accessibility for practitioners, enabling real-time inpainting capabilities without the need for specialized hardware or software environments.

Simon Willison34 d agofound 12 d ago#image-inpainting#webgpu#browser

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Moebius is a lightweight image inpainting framework featuring a model size of 0.2 billion parameters while achieving performance levels comparable to models with 10 billion parameters. This framework offers significant efficiency in image restoration tasks, making it an important tool for practitioners focusing on resource-constrained environments or applications requiring fast inference times. The architecture optimizes performance without the overhead of larger models, providing a practical solution for real-time image inpainting.

Reddit r/LocalLLaMA34 d agofound 21 d ago#Moebius#image_inpainting

Local text to image model comparaison: The ultimate test.

A comparative evaluation of local text-to-image models was conducted using 192 prompts to assess their capabilities in generating images based on text input, focusing on aspects such as text accuracy, facial representation, human anatomy, and spatial composition. The results, which include generated images and performance assessments against various visual language models (VLMs), are accessible via links to both the image gallery and the GitHub repository for prompts. This analysis is significant for practitioners as it provides insights into the performance of local models relative to leading APIs, aiding in the selection and optimization of models for specific applications in text-to-image generation.

Reddit r/LocalLLaMA35 d agofound 21 d ago#text_to_image#benchmark

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

GeoVR is a novel framework designed to enhance the 3D awareness of Multimodal Large Language Models (MLLMs) by learning geometric representations from 2D video sequences, addressing the limitations of existing models in maintaining geometric and spatial consistency. It employs a multi-objective learning strategy focused on estimating camera poses, regressing depth maps, predicting scale factors, and distilling multi-scale 3D features, leading to significant improvements in spatial reasoning benchmarks. This advancement is crucial for practitioners as it establishes a new paradigm for integrating spatial intelligence into foundation models, enabling more robust applications in AI systems that require a deeper understanding of spatial relationships.

arXiv cs.AI38 d agofound 22 d ago#mlm#3d-representations#video

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

The article introduces BrainG3N, a dual-purpose tokenizer designed for the generation of controllable 3D brain MRI using a volumetric masked-autoencoder (MAE) architecture. This approach decouples the encoder, which generates clinically informative embeddings from a pretrained model on 35,309 volumes, and a CNN decoder for voxel reconstruction, achieving superior performance on a 23-task linear-probing benchmark compared to state-of-the-art models. This development is significant for practitioners as it enables both enhanced clinical task performance and the capability for conditional generation and patient-specific forecasting in neuroimaging applications.

arXiv cs.AI38 d agofound 24 d ago#3d-mri#generation#tokenizer

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

The article introduces Visual Attentive Prompting (VAP), a novel training-free perceptual adapter designed to enhance Vision-Language-Action (VLA) models for personalized commands by enabling top-down selective attention. VAP utilizes reference images as a non-parametric visual memory to ground user-specific objects through open-vocabulary detection, significantly improving performance in personalized manipulation tasks as demonstrated by new benchmarks, Personalized-SIMPLER and Personalized-VLABench. This advancement is crucial for practitioners as it enhances the ability of robotic systems to accurately identify and manipulate specific objects in real-world scenarios, thereby improving usability in personalized applications.

arXiv cs.AI38 d agofound 23 d ago#vision_language_action#personalization#robotics

Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA

This study presents a novel method for improving confidence calibration in Multimodal Large Language Models (MLLMs) applied to Medical Visual Question Answering (VQA) by integrating Multi-Strategy Fusion-Based Interrogation (MS-FBI) with expert LLM assessments. The proposed approach achieved a 40% reduction in Expected Calibration Error (ECE) across three Medical VQA datasets, underscoring the need for domain-specific calibration to enhance the reliability of MLLMs in healthcare applications. This advancement is critical for practitioners as it aims to mitigate the risks of misdiagnosis and improve the trustworthiness of AI-assisted medical decisions.

arXiv cs.AI38 d agofound 23 d ago#multimodal LLMs#medical VQA#confidence calibration

NEST: Narrative Event Structures in Time for Long Video Understanding

The article introduces NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset comprising 1005 full-length movies, each annotated with 102 multimodal narrative events that integrate visual content, dialogue, and audio. The benchmark establishes baselines for various tasks including event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE), revealing challenging performance metrics with ETD below 8% and EL under 6%. This dataset is significant for practitioners as it enhances the understanding of narrative structures in long videos, facilitating advancements in vision-language models and their ability to process complex temporal relationships in multimedia content.

arXiv cs.CL38 d agofound 22 d ago#video-understanding#narrative#vision-language#dataset

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

FreeStyle is a new framework for style-content dual-reference image generation that leverages community LoRA mining to create large-scale triplet datasets for training. It features a two-stage curriculum with mechanisms to prevent style leakage, including an attention-level enrichment constraint and a frequency-aware RoPE modulation strategy. This framework introduces a benchmark for evaluating style similarity, content preservation, and leakage rejection, demonstrating improved performance in balancing these aspects, which is critical for practitioners aiming to enhance the fidelity and reliability of generative models in AI applications.

arXiv cs.AI38 d agofound 23 d ago#style-content#dual-reference#generation#lora

TeleMorpher: Toward Robust Simultaneous Motion-Location Editing

TeleMorpher is a novel one-shot framework for simultaneous motion-location editing in video, addressing a gap in existing techniques. It utilizes motion priors and a training-free pose warping method to enhance control and precision in editing, while also introducing two new LPIPS-based metrics for evaluating background consistency and motion fidelity. This advancement is significant for practitioners as it enables more robust and reliable editing of dynamic video content, improving the quality of generated outputs in applications such as film and animation.

arXiv cs.AI38 d agofound 23 d ago#motion editing#diffusion models#video generation

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

The study evaluates 12 open-weight vision-language models (VLMs) in binary classification tasks across two clinical neuroimaging datasets, \textsc{FOR2107} and \textsc{OASIS-3}. It finds that smaller models can achieve up to 58% F1 score improvements when neuroimaging context is introduced, largely due to prompt framing rather than actual data integration, indicating a phenomenon termed the "scaffold effect." These results highlight the potential pitfalls of relying on surface-level performance metrics in clinical AI applications, emphasizing the need for deeper evaluation of multimodal reasoning capabilities.

arXiv cs.AI38 d agofound 23 d ago#clinical#vlm#neuroimaging

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

The article presents a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing, utilizing rectified flow matching to enhance performance while maintaining efficiency. This model leverages joint attention mechanisms over audio and text tokens, initially establishing coarse semantic alignment at a low-resolution stage before refining details at high resolution, thus addressing the limitations of existing convolutional U-Net approaches. The proposed framework demonstrates significant improvements in editing tasks with overlapping audio events and complex instructions, making it a valuable tool for practitioners focused on efficient and precise audio content manipulation.

arXiv cs.AI38 d agofound 23 d ago#audio editing#diffusion models#instruction-guided

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

The paper introduces a novel approach for analyzing the influence of style captions on speech generation in text-to-speech (TTS) systems, specifically using cross-attention attribution applied to speech diffusion models, including CapSpeech-TTS. By adapting the DAAM framework, the authors provide insights through per-token heatmaps across 25 layers and 24 ODE steps, revealing that style tokens exhibit lower temporal variance and peak attention in early diffusion steps, which is critical for enhancing controllability in expressive TTS. This research is significant for practitioners as it elucidates the interaction between natural language input and acoustic output, potentially guiding improvements in TTS model design and performance.

arXiv cs.AI38 d agofound 23 d ago#text-to-speech#style-captioned#cross-attention

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM is a newly proposed multimodal diffusion language model designed for efficient parallel region perception, overcoming limitations of existing autoregressive models in handling multiple region captioning tasks. Built on PerceptionDLM-Base, it employs efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, resulting in significant improvements in inference speed. The introduction of the Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) allows for comprehensive evaluation of caption quality and efficiency, demonstrating PerceptionDLM's competitive performance in multi-region tasks and underscoring its potential for practitioners in AI visual perception applications.

arXiv cs.AI38 d agofound 23 d ago#multimodal-llm#diffusion-models#perception

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror is a newly proposed diffusion-based model for makeup transfer that enhances facial attribute preservation, addressing limitations in identity and skin color retention seen in prior models like Stable-Makeup. Key innovations include facial geometry conditioning with ControlNets, region-specific makeup application, skin tone modulation, and a Levenberg-Marquardt Langevin sampler for faster inference, achieving a 60% improvement in facial recognition similarity and a 50% reduction in skin tone differences. This model is significant for practitioners as it enables more realistic virtual try-on experiences in makeup shopping, enhancing user satisfaction and accuracy in augmented reality applications.

arXiv cs.AI38 d agofound 23 d ago#makeup transfer#diffusion models#facial attributes

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

The QC-GAN framework introduces a parameter-efficient approach for speech enhancement by leveraging a Quaternion Conformer generator and MetricGAN-based training, achieving high fidelity with only 0.89 million parameters. It attained a PESQ score of 3.48 on the VoiceBank+DEMAND dataset, demonstrating performance comparable to state-of-the-art models at a fraction of the parameter count. Additionally, a smaller variant with 35K parameters achieved a PESQ score of 3.23, indicating significant efficiency improvements for practitioners focused on resource-constrained applications in speech processing.

arXiv cs.AI38 d agofound 22 d ago#speech enhancement#GAN#audio

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

The article introduces UAV-VLN-FOV, a new target-visible navigation task that isolates the see-and-reach stage for UAVs, facilitating a more precise evaluation of their ability to ground visible targets and execute 3D motion. It presents 3DG-VLN, a vision-language waypoint prediction framework that utilizes dynamic 3D direction cues and processes high-resolution front and downward views to enhance visual grounding and spatial alignment, resulting in a 13.82% improvement in success rate over existing UAV-VLN baselines. This advancement is significant for practitioners as it provides a dedicated benchmark and source code for developing more accurate navigation systems in UAV applications.

arXiv cs.AI38 d agofound 23 d ago#uav#vision-language#navigation

PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation

PhysDrift is a new embodiment-aware co-speech motion generation framework designed for humanoid robots, which directly predicts executable joint trajectories from speech, bypassing the traditional human-centric retargeting methods. The framework introduces IK-EER, optimizing kinematic feasibility and speech-motion alignment, and shows improvements in speech-motion synchronization, physical plausibility, and real-time interaction capabilities. This advancement addresses the embodiment gap in humanoid motion generation, enhancing expressive behaviors and motion diversity, which is crucial for practitioners developing more capable humanoid systems.

arXiv cs.AI38 d agofound 24 d ago#humanoid#motion#generation

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

MedRLM is introduced as a Recursive Multimodal Health Intelligence framework designed for long-context clinical reasoning and decision support, addressing the limitations of existing medical large language models that rely on single-step prompts. The framework utilizes a Clinical Evidence Graph Memory to integrate diverse patient data, including electronic health records, medical images, and sensor signals, enabling recursive inspection and synthesis of information. This approach enhances clinical decision-making by facilitating deeper reasoning in response to abnormal patterns and supporting clinician review through uncertainty-gated refinement, thereby advancing the capabilities of AI in real-world clinical settings.

arXiv cs.AI38 d agofound 23 d ago#health#llm#clinical#agents

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit is a lifelong adaptation framework designed for frozen flow-matching text-to-speech (TTS) systems, enabling them to learn pronunciation corrections without retraining. It utilizes a Modern Hopfield Network for content-addressable episodic memory, optimizing token-level perturbations in the text embedding space based on corrective feedback. In benchmarks involving 312 multilingual proper nouns, FlowEdit achieved a 92.7% reduction in target-word Phoneme Error Rate compared to the zero-shot baseline, while maintaining general-speech quality, with corrections processed in about 15 seconds on a single GPU.

arXiv cs.AI38 d agofound 24 d ago#tts#pronunciation#adaptation

TerraMind: Large-Scale Generative Multimodality for Earth Observation

TerraMind is introduced as the first generative multimodal foundation model specifically designed for Earth observation, utilizing a dual-scale representation that integrates both token-level and pixel-level data across nine geospatial modalities. The model employs a dual-scale early fusion approach, enhancing its zero-shot and few-shot application capabilities, and introduces a "Thinking-in-Modalities" (TiM) feature for generating artificial data to optimize outputs. TerraMind outperforms existing benchmarks, such as PANGAEA, and its pretraining dataset, model weights, and code are openly available, making it a valuable resource for practitioners in the field.

arXiv cs.AI38 d agofound 23 d ago#earth observation#generative models

DeepSeek Introduces Vision

DeepSeek has announced the release of Vision, a new model designed for enhanced visual understanding tasks. While specific model size and architecture details are not provided, Vision is expected to improve performance on benchmark datasets relevant to image recognition and processing. This development is significant for practitioners as it may enable more effective integration of visual data into existing AI workflows.

Hacker News39 d agofound 24 d ago#vision#deepseek

Would you still call this Dax? Novel Visual References in VLMs and Humans

The article introduces the Novel Visual References Dataset (NVRD), comprising 19,176 images and 90 visual concepts designed to investigate how vision-language models (VLMs) and humans map novel visual references to language. The dataset includes progressively perturbed versions of objects to assess generalization capabilities, revealing that models struggle with in-context learning of novel concepts that contradict prior knowledge, and tend to overgeneralize compared to human judgments. This work provides a new benchmark for understanding visual concept learning, which is crucial for improving VLM performance in real-world applications.

arXiv cs.CL39 d agofound 24 d ago#vision-language-models#novel-concepts

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

The article introduces Semantic Anchor-aligned Multimodal Augmentation (SAMA), a novel framework designed to enhance Multimodal Information Extraction (MIE) tasks like Multimodal Named Entity Recognition, Relation Extraction, and Event Extraction, particularly in low-resource settings. SAMA employs a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM) with a Universal Adapter and Task-Specific Adapters for generating high-fidelity synthetic data, alongside an Anchor-Preserving Diffusion mechanism for image synthesis and a Dual-Constraint Filtering module for sample selection. The framework demonstrates superior performance over existing augmentation methods across multiple benchmark datasets, highlighting its potential for improving data scarcity challenges in multimodal AI applications.

arXiv cs.CL39 d agofound 24 d ago#multimodal#information extraction#data augmentation

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

TurnGuide is a novel approach for enhancing Full-Duplex Speech Language Models (FD-SLMs) by implementing dynamic turn-level text-speech interleaving, which allows for improved modeling of conversational turn-taking. This method addresses the challenges of integrating discrete text tokens into continuous audio streams, enabling more natural and coherent spoken interactions while maintaining the acoustic flow. Experimental results indicate that TurnGuide achieves state-of-the-art performance in generating semantically meaningful speech across various turn-taking scenarios, making it a valuable advancement for practitioners working on real-time spoken dialogue systems.

arXiv cs.CL39 d agofound 24 d ago#speech#dialogue#llm

Midjourney goes from generating cat images to full-body ultrasound scans

Midjourney has announced its first hardware product, the Midjourney Scanner, an ultrasound-based full-body scanner that employs a ring of sensors for image capture. This shift from generating images to medical imaging represents a significant expansion of the company's capabilities and could influence the integration of AI in healthcare diagnostics. The move highlights the potential for AI models to transition from creative applications to practical, real-world health solutions.

The Verge — AI39 d agofound 25 d ago#midjourney#ultrasound#image generation

Midjourney Medical

The article does not provide any specific details about a release, publication, or announcement related to Midjourney Medical. Therefore, there are no technical details or implications for practitioners building with LLMs/AI to summarize.

Hacker News39 d agofound 24 d ago#midjourney#medical#ai

TRELLIS.2 now runs natively on MLX (Image to 3d object model)

TRELLIS.2 has been ported natively to MLX for Apple Silicon, enabling image-to-3D object model generation. The implementation supports resolution outputs of 512x512 and 1024x1024, with generation times of approximately 70 seconds for 512x512 and 300-700 seconds for 1024x1024 on an M4 Max with 128GB unified memory. This native support enhances usability in real workflows, making it a valuable tool for practitioners in 3D modeling and AI-driven design.

Reddit r/LocalLLaMA39 d agofound 29 d ago#trellis-2#image-to-3d#mlx

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

DeMaVLA is a newly introduced Vision-Language-Action (VLA) foundation model designed for generalizable deformable manipulation tasks, specifically targeting the folding of clothing items across various conditions. It utilizes a Vision-Language Model (VLM) backbone augmented with an action expert that employs flow matching for continuous action generation, while optimizing efficiency by pruning transformer layers. Pre-trained on 5,000 hours of dual-arm demonstrations and fine-tuned using a human-in-the-loop Data Aggregation pipeline, DeMaVLA demonstrates competitive performance on RoboTwin 2.0 and strong results in real-world household folding tasks, underscoring its potential for scalable manipulation capabilities in robotics.

arXiv cs.AI40 d agofound 25 d ago#vision-language#robotics

Enhancing Pathological VLMs with Cross-scale Reasoning

The paper introduces a novel cross-scale reasoning paradigm for vision-language models (VLMs) in pathology, addressing the need for multi-magnification reasoning in image interpretation. It presents Scale-VQA, a benchmark comprising 4,685 questions based on 2,537 pathology images, and introduces ScaleReasoner-R1, a model trained via reinforcement learning that achieves state-of-the-art results on this new benchmark and established single-scale benchmarks. This advancement is significant for practitioners as it enhances the ability of VLMs to integrate multi-scale evidence, improving diagnostic accuracy in pathological assessments.

arXiv cs.AI40 d agofound 28 d ago#pathology#vision-language model#cross-scale reasoning

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing

Co-PLNet is introduced as a novel framework for wireframe parsing that integrates point and line detection tasks through a Point-Line Prompt Encoder (PLP-Encoder) and a Cross-Guidance Line Decoder (CGL-Decoder). This architecture enables spatial cue exchange to improve prediction consistency and robustness, achieving enhanced accuracy and real-time efficiency in structured geometry perception, as demonstrated in experiments on the Wireframe and YorkUrban datasets. This approach is significant for practitioners as it addresses the limitations of existing methods by providing a more integrated solution for geometric representation in applications like SLAM.

arXiv cs.AI40 d agofound 28 d ago#wireframe#parsing#geometry

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

The article introduces m2sv, a scalable benchmark for map-to-street-view spatial reasoning, consisting of m2sv-20k and m2sv-sft-11k datasets aimed at improving vision-language models (VLMs) in aligning overhead maps with Street View images. The benchmark reveals that the best VLM achieves only 65.2% accuracy, significantly lower than human annotators' average of 72.0%, indicating persistent challenges in geometric alignment and reasoning consistency. This work emphasizes the need for advancements in grounded spatial reasoning, making it crucial for practitioners focusing on multimodal AI applications.

arXiv cs.AI40 d agofound 28 d ago#spatial reasoning#vision-language

Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation

The article presents a unified framework for enhancing geometry-consistent representations in monocular endoscopy, addressing challenges in pose estimation and depth prediction. It introduces Hierarchy-Aware Geometry-Semantic Adaptation, which employs selective low-rank adapters within the transformer architecture and integrates geometric supervision from synthetic data. Experimental results demonstrate improved representation quality and performance in navigation tasks, indicating that this approach could significantly aid practitioners in developing more reliable endoscopic navigation systems.

arXiv cs.AI40 d agofound 28 d ago#endoscopy#navigation#image representation

Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models

Detail++ is a training-free framework designed to enhance text-to-image diffusion models by introducing a Progressive Detail Injection (PDI) strategy, which decomposes complex prompts into simplified sub-prompts for staged generation. The method leverages self-attention for global composition and employs cross-attention mechanisms along with a Centroid Alignment Loss to improve attribute consistency and reduce binding noise. Extensive experiments show that Detail++ outperforms existing methods on T2I-CompBench and a new style composition benchmark, making it particularly valuable for practitioners dealing with complex multi-object scenarios in T2I generation.

arXiv cs.AI40 d agofound 28 d ago#text-to-image#diffusion models

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

PearlVLA is a newly proposed Vision-Language-Action framework that enhances action planning by conducting deliberation in the latent space of a vision-language model. It employs a dual-branch architecture that separates visual grounding from iterative plan refinement, utilizing a lightweight frozen latent world model to optimize action generation with low latency. Empirical results on the LIBERO benchmark indicate that PearlVLA achieves state-of-the-art performance, making it significant for practitioners seeking efficient planning mechanisms in AI systems.

arXiv cs.AI40 d agofound 28 d ago#vision-language#action-planning