🚀Thrilled to release #TIP (Dual Text-Image Prompting), a #DALLE2 #StableDiffusion-2 enhanced #LLM that can generate coherent and authentic multimodal procedural plans toward a high-level goal. 🧵8 📜paper: arxiv.org/abs/2305.01795 🔗data & code: github

Thread

🚀Thrilled to release #TIP (Dual Text-Image Prompting), a #DALLE2 #StableDiffusion-2 enhanced #LLM that can generate coherent and authentic multimodal procedural plans toward a high-level goal. 🧵8

📜paper: arxiv.org/abs/2305.01795
🔗data & code: github.com/YujieLu10/TIP

🧵1/8 We introduce the multimodal procedural planning task and evaluate model performance using our collected WIKIPLAN and RECIPEPLAN datasets.

🧵2/8 We propose Text-Image Prompting (TIP), a dual-modality prompting approach that elicits procedural knowledge jointly from LLMs and T2I models, enabling visual-grounded text plans and textual-grounded image plans.

🧵3/8 T2I-B elicits visual imagination in LLM to generate explicit scene description (imagination prompt) for text-to-image model conditioned on the vanilla plan.

🧵4/8 I2T-B injects verbalization of the image plans to foster revision generation of visual-grounded text plans with awareness of multimodal context.

🧵5/8 Our TIP outperforms baselines, generating image plans grounded in textual context, allowing for complex reasoning in text & visuals. Check out our results & more in Appendix.

🧵6/8 Our WIKIPLAN and RECIPEPLAN datasets will also be released, check out examples in the paper Appendix.

🧵7/8 We show that TIP substantially improves performance in terms of textual and visual informativeness, temporal coherence, and plan accuracy on human and automatic evaluations. Check out extensive comparisons in our paper!

🧵8/8 A big shoutout to our fantastic team of authors: @lupantech, @ZhiyuChen4, @ZhuWanrong, @xwang_lk, @WilliamWangNLP.

Mentions

There are no mentions of this content so far.

Thread by Yujie Alice Lu

Thread

Mentions