StoryMaker

StoryMaker
Image from [1]

StoryMaker: A Holistic Approach to Character Consistency

The landscape of text-to-image generation has witnessed rapid advancements with models like DALL-E and Stable Diffusion capturing the imagination of creators. Yet, while these models excel at generating stunning single images from textual descriptions, maintaining consistency across a series of images, especially when characters are involved, remains a challenge. That’s where StoryMaker steps in, as a groundbreaking solution designed to create holistic and consistent characters in text-to-image generation.

Introducing StoryMaker

StoryMaker tackles this issue head-on by not only preserving facial identities but also ensuring that clothing, hairstyles, and body features remain consistent across a series of images. It allows for variations in background, character poses, and styles while maintaining a cohesive narrative.

How It Works

This model combines facial identity with character features like clothing, hairstyle, and body structure using a Positional-aware Perceiver Resampler (PPR). This module extracts and refines information from reference images to generate characters that stay visually consistent throughout a series of images. The model decouples character poses from reference images by using ControlNet, allowing flexibility in pose changes guided by text prompts or specific pose references. Additionally, StoryMaker employs LoRA (Low-Rank Adaptation) to enhance image fidelity and prevents characters from blending into the background using a cross-attention mechanism, ensuring coherent storytelling with consistent visuals.

Results

In quantitative evaluations, StoryMaker achieved high scores in image-text similarity and identity preservation, maintaining facial, clothing, and hairstyle consistency.

Visual comparisons show that it produces images where both single and multiple characters remain visually consistent, even when poses or backgrounds vary. StoryMaker excels at preserving the full character, including clothing and body features.

Conclusion

StoryMaker is a significant step forward in text-to-image generation, offering a solution to one of the biggest challenges in this field—character consistency. By ensuring that characters retain their identity, clothing, hairstyle, and body features across a series of images, StoryMaker enables creators to tell richer, more coherent stories. Whether it’s for comics, fashion, or personalized digital art, StoryMaker opens up new possibilities for creators looking to push the limits of what text-to-image models can achieve

References

[1] Zhengguang Zhou, Jing Li, Huaxia Li, Nemo Chen, and Xu Tang, “StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation,” arXiv preprint arXiv:2409.12576, 2024. [Online]. Available: https://arxiv.org/abs/2409.12576.