Multimodal
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Scone is a new unified understanding-generation model designed for subject-driven image generation that effectively integrates composition and distinction, addressing limitations in handling multiple subjects. It employs a two-stage training scheme focusing first on composition and then enhancing distinction through semantic alignment and attention-based masking. The accompanying SconeEval benchmark evaluates performance in both areas, with experimental results showing Scone surpassing existing open-source models on multiple benchmarks, making it a valuable tool for practitioners aiming to improve subject identity preservation in complex visual tasks.
image generationsubject-drivenmodeling