Resources
🎬  Project Page
Abstract
Generating visual instructions is essential for developing interactive AI agents that support in-context learning and human skill transfer. However, existing work remains limited to creating "static" instructions, i.e., single images that solely depict action completion or final object states. In this paper, we take the first step towards shifting instructional image generation to video generation. While general image-to-video (I2V) models can animate images based on text prompts, they primarily focus on artistic creation, overlooking the evolution of object states and action transitions in instructional scenarios. To address this challenge, we propose ShowMe, a novel framework that enables plausible action-object state manipulation and coherent state prediction. Additionally, we introduce structure and motion reward tuning to improve structural fidelity and spatiotemporal coherence. Notably, our finding suggests that video diffusion models can inherently serve as action-object state transformers, showing great potential for performing state manipulation while ensuring contextual consistency. Both qualitative and quantitative experiments demonstrate the effectiveness and superiority of the proposed method.
Â