Multimodal
TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization
TextHOI-3D introduces a novel framework for text-to-3D generation focused on hand-object interactions, addressing challenges in preserving language semantics and ensuring geometric accuracy. The system utilizes a compact VQ token space and a CLIP-conditioned visual autoregressive model to generate multi-view observations, which are then optimized to recover a unified hand-object mesh. Benchmark results demonstrate significant improvements in object contact accuracy and penetration volume, highlighting the effectiveness of multi-view visual tokens as an intermediate representation for practitioners working on 3D generative models in AI.
3D generationtext-to-3Dhand-object interaction