A research approach enabling expressive text-to-image generation using rich text formatting (font styles, colors, sizes, footnotes, hyperlinks) as input. By extracting token maps from cross-attention and self-attention maps in diffusion models, the system achieves localized image editing tied to specific text tokens. Techniques include spectral clustering for precise spatial masks, weighted noise averaging per token, region-based gradient guidance for RGB color control, and feature map injection to maintain structural consistency with plain text generation. Demonstrated on Stable Diffusion XL with examples like changing hair color, applying artistic styles to specific objects, and referencing external images via hyperlinks.

•5m watch time

Sort: