LLMGA: Multimodal Large Language Model based Generation Assistant

Abstract

In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting & outpainting, and instruction-based editing. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during inpainting and outpainting. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications in an interactive manner.

Why do you need LLMGA

Generation Assiatant. As a unified system, LLMGA can generate and edit images using methods such as Text-to-Image (T2I), inpainting, outpainting, and instruction-based editing through conversational interactions with users. By leveraging the extensive knowledge and understanding of image design from LLMGA, users can easily produce and revise images to obtain highly satisfactory images.
Design Expert. LLMGA incorporates an extensive array of image design data, offering deep insights for a wide range of design tasks, including logo creation, game character design, poster design, T-shirt design, infographic design, and more.
Illustration Generation. LLMGA can interactively generate story illustrations based on user-input story snippets.
Picture Book Generation. With a single user's instruction, LLMGA can generate an interwoven storybook of text and illustrations.
Multilingual Support. Through the multilingual adaptation of the LLMGA, T2I and editing model can generate content using Chinese language instructions.
Flexible Expansion. LLMGA offers enhanced flexibility by integrating with external plugins like ControlNet, enabling a wider range of functionalities.
To be continued ......

Method

Examples of LLMGA for assisting in image generation and editing

1 / 8

Interactive image generation and editing exemplify the comprehensive capabilities of LLMGA.
Users can design satisfactory images by engaging in interactions with LLMGA, leveraging its vast knowledge and ideas.

2 / 8

Amazing and vivid generation results for T2I.

3 / 8

Amazing and vivid generation results for T2I.

4 / 8

(1) For T2I generation, LLMGA can refine the user's generation prompt to produce more vivid and vibrant images.
(2) For similar image generation, LLMGA can understand the component and layout of the input images and generate a similar image.

5 / 8

(1) For inpainting & outpainting, LLMGA can provide detailed generation prompts based on user preferences and input images.
(2) For instruction based editing, LLMGA can understand user instructions and realize accurate editing.

6 / 8

Interactive image generation and editing exemplify the comprehensive capabilities of LLMGA.
Users can design satisfactory images by engaging in interactions with LLMGA, leveraging its vast knowledge and ideas.

7 / 8

Picture book generation (Directly taken from the gradio demo screenshot)

8 / 8

Design Expert (Directly taken from the gradio demo screenshot)

Performance

Visual comparison on T2I

LLMGA can refine short prompts by adding details, such as clothing, background, and actions.

Visual comparison on T2I plus ControlNet

LLMGA can enhance the details in generated images, producing visually pleasing images.

Visual comparison on instruction-based editing

LLMGA obtains strong instruction-based editing ability and results in high quality image results .

Visual comparison on inpainting and outpainting

LLMGA can infer complete images based on input masked images.

Visual comparison of image restoration methods

Visual comparison of image restoration methods. DiffRIR can alleviate the texture, contrast, and brightness disparities in inpainting & outpainting results.

BibTeX

      
        @article{xia2023llmga,
          title={LLMGA: Multimodal Large Language Model based Generation Assistant},
          author={Xia, Bin and Wang, Shiyin, and Tao, Yingfan and Wang, Yitong and Jia, Jiaya},
          journal={arXiv preprint arXiv:2311.16500},
          year={2023}
        }