Thread
🚀Thrilled to release #TIP (Dual Text-Image Prompting), a #DALLE2 #StableDiffusion-2 enhanced #LLM that can generate coherent and authentic multimodal procedural plans toward a high-level goal. 🧵8

📜paper: arxiv.org/abs/2305.01795
🔗data & code: github.com/YujieLu10/TIP
🧵1/8 We introduce the multimodal procedural planning task and evaluate model performance using our collected WIKIPLAN and RECIPEPLAN datasets.
🧵2/8 We propose Text-Image Prompting (TIP), a dual-modality prompting approach that elicits procedural knowledge jointly from LLMs and T2I models, enabling visual-grounded text plans and textual-grounded image plans.
🧵3/8 T2I-B elicits visual imagination in LLM to generate explicit scene description (imagination prompt) for text-to-image model conditioned on the vanilla plan.
🧵4/8 I2T-B injects verbalization of the image plans to foster revision generation of visual-grounded text plans with awareness of multimodal context.
🧵5/8 Our TIP outperforms baselines, generating image plans grounded in textual context, allowing for complex reasoning in text & visuals. Check out our results & more in Appendix.
🧵6/8 Our WIKIPLAN and RECIPEPLAN datasets will also be released, check out examples in the paper Appendix.
🧵7/8 We show that TIP substantially improves performance in terms of textual and visual informativeness, temporal coherence, and plan accuracy on human and automatic evaluations. Check out extensive comparisons in our paper!
🧵8/8 A big shoutout to our fantastic team of authors: @lupantech, @ZhiyuChen4, @ZhuWanrong, @xwang_lk, @WilliamWangNLP.
Mentions
There are no mentions of this content so far.