Homework-4

.pdf

School

Carnegie Mellon University *

*We aren’t endorsed by this school

Course

10315

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by ProfessorCrab6037

H OMEWORK 4 M ULTI -M ODAL F OUNDATION M ODELS * 10-423/10-623 G ENERATIVE AI http://423.mlcourse.org OUT: Mar. 13, 2024 DUE: Mar. 22, 2024 TAs: Asmita, Haoyang, Tiancheng Instructions • Collaboration Policy : Please read the collaboration policy in the syllabus. • Late Submission Policy: See the late submission policy in the syllabus. • Submitting your work: You will use Gradescope to submit answers to all questions and code. – Written: You will submit your completed homework as a PDF to Gradescope. Please use the provided template. Submissions can be handwritten, but must be clearly legible; otherwise, you will not be awarded marks. Alternatively, submissions can be written in L A T E X. Each answer should be within the box provided. If you do not follow the template, your assignment may not be graded correctly by our AI assisted grader and there will be a 2% penalty (e.g., if the homework is out of 100 points, 2 points will be deducted from your final score). – Programming: You will submit your code for programming questions to Gradescope. There is no autograder. We will examine your code by hand and may award marks for its submission. • Materials: The data that you will need in order to complete this assignment is posted along with the writeup and template on the course website. Question Points Instruction Fine-Tuning & RLHF 9 Latent Diffusion Model (LDM) 6 Programming: Prompt2Prompt 31 Code Upload 0 Collaboration Questions 2 Total: 48 * Compiled on Wednesday 13 th March, 2024 at 17:17 1

Homework 4: Multi-Modal Foundation Models 10-423/10-623 1 Instruction Fine-Tuning & RLHF (9 points) 1.1. (6 points) Short answer: Highlight the differences between in-context learning, unsupervised pre-training, supervised fine-tuning, and instruction fine-tuning by defining each one. Assume we are interested specifically in autoregressive large language models (LLMs) over text. Each defi- nition must mention properties of the training examples and how they are used, and how learning affects the parameters of the model. Definition: in-context learning Definition: unsupervised pre-training Definition: supervised fine-tuning Definition: instruction fine-tuning 2 of 16

Homework 4: Multi-Modal Foundation Models 10-423/10-623 1.2. (3 points) Ordering: Consider a correctly defined reinforcement learning with human feedback (RLHF) pipeline. Select the correct ordering of the items below to define such a pipeline by num- bering them from 1 to N . If two items can occur simultaneously, number them identically. To exclude an item from the ordering, number it as 0 . • Repeat the previous step many times. • Repeat the following steps many times. • From human labelers, collect rankings of samples from the language model. • Collect instruction fine-tuning training examples from human labelers. • Take a (stochastic) gradient step for a reinforcement learning objective. • Sample a prompt/response pair from the language model. • Collect prompt/response/reward tuples from human labelers. • Perform supervised fine-tuning of the language model. • Query the regression model for its score of an input. • Perform supervised training of the regression model. • Pre-train the language model. 3 of 16

Homework 4: Multi-Modal Foundation Models 10-423/10-623 2 Latent Diffusion Model (LDM) (6 points) 2.1. (2 points) Short answer: Why does a latent diffusion model run diffusion in a latent space instead of pixel space? 2.2. Short answer: Standard cross-attention for a diffusion-based text-to-image model defines the queries Q as a function of the pixels (or latent space) Y ∈ R m × d y , and the keys K and values V as a function of the text encoder output X ∈ R n × d x . Q = YW q , K = XW k , V = XW v (where W q ∈ R d y × d and W k , W v ∈ R d x × d ) and then applies standard attention: Attention ( Q , K , V ) = softmax ( QK T / √ d ) V Now, suppose you instead defined a new formulation where the values are a function of the pixels (or latent space): V = YW v where W v ∈ R d y × d . 2.2.a. (2 points) What goes wrong mathematically in the new formulation? 2.2.b. (2 points) Intuitively, why doesn’t the new formulation make sense? Briefly begin with an explanation of what the original formulation of cross-attention is trying to accomplish for a single query vector, and why this new formulation fails to accomplish that. 4 of 16

Homework 4: Multi-Modal Foundation Models 10-423/10-623 3 Programming: Prompt2Prompt (31 points) Introduction In this section, we explore an innovative approach to image editing. Editing techniques aim to retain the majority of the original image’s content while making certain changes. However, current text-to-image models often produce completely different images when only a minor change to the prompt is made. State-of-the-art methods typically require a spatial mask to indicate the modification area, which ignores the original image’s structure and content in that region, resulting in significant information loss. In contrast, the Prompt2Prompt framework by Hertz et al. (2022) facilitates edits using only text, striving to preserve original image elements while allowing for changes in specific areas. Cross-attention maps, which are high-dimensional tensors binding pixels with prompt text tokens, hold rich semantic relationships crucial to image generation. The key idea is to edit the image by injecting these maps into the diffusion process. This method controls which pixels relate to which particular prompt text tokens throughout the diffusion steps, allowing for targeted image modifications. You’ll explore modifying token values to change scene elements (e.g. a ”dog” riding a bicycle → a ”cat” riding a bicycle) while maintaining the original cross-attention maps to keep the scene’s layout intact. HuggingFace Diffusers In this assignment, we will be using HuggingFace’s diffusers , a library created for easily using well- known state-of-the-art diffusion models, including creating the model classes, loading pre-trained weights, and calling specific parts of the models for inference. Specifically, we will be using the API for the class DiffusionPipeline and methods from its subclass StableDiffusionPipeline for loading the pre-trained LDM model. You are required to read the API for StableDiffusionPipeline: https://huggingface.co/docs/diffusers/en/api/pipelines/stable_ diffusion/text2img You will be implementing the model loading and calling individual components of StableDiffusion- Pipeline in this assignment. Starter Code The files are organized as follows: hw4/ run_in_colab.ipynb prompt2prompt.py ptp_utils.py seq_aligner.py requirements.txt Here is what you will find in each file: 1. run_in_colab.ipynb : This is where you can run inference and see the visualization of your implemented methods. 5 of 16

Homework 4: Multi-Modal Foundation Models 10-423/10-623 2. prompt2prompt.py : Contains the text2image_ldm(...) method that generates images from text prompts by controlling the diffusion process with attention mechanisms in Hugging- Face’s latent diffusion model, and contains the AttentionReplace class. The class contains the forward process and methods to replace attention. You will implement all these. (Note: Loca- tions in the code where changes ought to be made are marked with a TODO.) 3. ptp_utils.py : Contains a set of helper functions that will be useful to you for filling in the text2image_ldm(...) method. Carefully read through the file to understand what these functions are. 4. seq_aligner.py : Contains a set of helper functions that are used to ini- tialize AttentionReplace ’s class variables. You will need to implement get_replacement_mapper_(...) (Note: Locations in the code where changes ought to be made are marked with a TODO.) 5. requirements.txt : A list of packages that need to be installed for this homework. Command Line We recommend conducting your final experiments for this homework on Colab. Colab provides a free T4 GPU for code execution. (Run the run_in_colab.ipynb for visualization.) You may find it easier to implement/debug locally. We have also included a very simple example of visualization that you can run on the command line: python prompt2prompt.py Prompt2Prompt In this problem, you will implement Prompt2Prompt in the file prompt2prompt.py . Figure 1: Visual and textual embedding are fused using cross-attention layers that produce attention maps for each textual token. Figure source: Hertz et al. (2022) Latent Diffusion Model: You will implement the text2image_ldm method. In that method, we provided some suggested structure by giving you the left-hand side of the initializations. Implementing this method requires you to have already read the HuggingFace Diffusers API. See above. You will be working with the DiffusionPipeline type, but the line 6 of 16

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version