CSC401_A2_v1

.pdf

School

University of Toronto *

*We aren’t endorsed by this school

Course

401

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by HighnessTankLobster37

Computer Science 401 10 February 2024 St. George Campus University of Toronto Homework Assignment #2 Due: Friday, 8 March 2024 at 23h59m (11.59 PM), Neural Machine Translation (MT) using Transformers Email : csc401-2024-01-a2@cs.toronto.edu . Please prefix your subject with [CSC401 W24-A2] . All accessibility and extension requests should be directed to csc401-2024-01@cs.toronto.edu . Lead TAs : Julia Watson and Arvid Frydenlund. Instructors : Gerald Penn, Sean Robertson and Raeid Saqur. Building the Transformer from Scratch It is quite a cliche to say that the transformer architecture has revolutionized technology. It is the building block that fuelled the recent headline innovations such as ChatGPT. Despite all those people who claimed to work in a large language model, only a very selected and brilliant few understand and are familiar with its internal workings. And you — who came from the prestigious and northern place called the University of Toronto — must carry on the century-old tradition to be able to create a transformer-based language model using only pen and paper. Organization You will build the transformer model from beginning to end, and train it to do some basic but effective machine translation tasks using the Canadian Hansards data. In § 1, we will guide you through the process of implementing all the building blocks of a transformer model. In § 2, you will use the building blocks to put together the transformer architecture . § 3 discusses the greedy and beam search for decoded target sentence generation. In § 4, you’ll train and evaluate the model. Finally, in § 5, you’ll use the trained mode to do some real machine translation and write-up your analysis report. Goal By the end of this assignment, you will acquire low-level understanding of the transformer archi- tecture and implementation techniques of the entire data processing → training → evaluation pipeline of a functioning AI application. Starter Code Unlike A1, the starter code of this assignment is distributed through MarkUs. The training data is located at /u/cs401/A2/data/Hansard on teach.cs . We use Python version 3.10.13 on teach.cs . That is, just run everything with the default python3 command. You may need to add srun commands to request computational resources — please follow the instructions in the following sections to proceed. You shouldn’t need to set up a new virtual environment or install any packages on teach.cs . You can work on the assignment on your local machine, but you must make sure that your code works on teach.cs . Any test cases failed due to incompatibility issues will not receive any partial marks. Marking Scheme Please see the A2 rubric.pdf file for detailed breakdown of the marking scheme. Copyright © 2024 University of Toronto. All rights reserved. 1

1 Transformer Building Blocks and Components [12 Marks] Let’s start from the three types of building blocks of a transformer: the layer norm, multi-head attention and the feed-forward modules (a.k.a. the MLP weights). LayerNorm The normalization layer computes the following. Given an input representation h , the normalization layer computes its mean µ and the standard deviation σ . Then, it outputs the normalized features. h ← γ ( h − µ ) σ + ε + β (1) Using the instructions, please complete the LayerNorm.forward method. FeedForwardLayer The feed-forward layer is a two-layer fully connected feed-forward network. As shown in the following equation, the input representation h is fed through two layers of fully connected layers. Dropout is applied after each layer, and ReLU is the activation function. h ← dropout(ReLU( W 1 h + b 1 )) h ← dropout( W 2 h + b 2 ) (2) Using the instructions, please complete the FeedForwardLayer.forward method. MultiHeadAttention Finally, you need to implement the most complicated but important component of the transformer architecture: the multi-head attention module. For the base case where there’s only H = 1 head, the attention score is calculated using the regular cross-attention algorithm: dropout(softmax( QK ⊤ √ d k )) V (3) Using the instructions, please complete the MultiHeadAttention.attention method. Then, you need to implement the part where the query, key and value are split into H heads, and then pass them through the regular cross-attention algorithm you have just implemented. Next, you should combine the results. Don’t forget to apply the linear combination and dropout when you output the final attended representations. Using the instructions, please complete the MultiHeadAttention.forward method. 2 Putting the Architecture (Enc-Dec) Together [20 Marks] OK, now we have the building blocks, let’s put everything together and create the full transformer model. We will start from a single transformer encoder layer and a single decoder layer. Next, we build the complete encoder and the complete decoder together by stacking the layers together. Finally, we connect the encoder with the decoder and complete the final transformer encoder–decoder model. TransformerEncoderLayer You need to implement two types of encoder layers. Pre-layer normaliza- tion (Figure 1a), as its name suggests, applies layer normalization before the representation is fed into the next module. On the other hand, post-layer normalization (Figure 1b) applies layer normalization after the representation is fed into the representation. Using the instructions, please complete the pre layer norm forward and the post layer norm forward methods of the TransformerEncoderLayer class. 2

input: h Layer Norm Multi-head Attention Layer Norm Feed-Forward output: h + + (a) Pre-layer normalization for encoders. input: h Multi-head Attention Layer Norm Feed-Forward Layer Norm output: h + + (b) Post-layer normalization for encoders. Figure 1: Two types of TransformerDecoderLayer . TransformerEncoder You don’t need to implement the encoder class. The starter code contains the implementation. Nonetheless, it will be helpful to read it, as it will a good reference for the following tasks. TransformerDecoderLayer Again, you need to implement both pre- and post-layer normalization. Recall from lecture, there are two multi-head attention blocks for decoders. The first one is a self-attention block and the second one is a cross-attention block. Using the instructions, please complete the pre layer norm forward and the post layer norm forward methods of the TransformerDecoderLayer class. input: h Layer Norm Self Attention Layer Norm Cross Attention Layer Norm Feed-Forward output: h + + + (a) Pre-layer normalization for decoders. input: h Self Attention Layer Norm Cross Attention Layer Norm Feed-Forward Layer Norm output: h + + + (b) Post-layer normalization for decoders. Figure 2: Two types of TransformerEncoderLayer . TransformerDecoder Similar to TransformerEncoder , you should pass the input through all the de- coder layers. Make sure to add the LayerNorm module in the correct place depending if the model is pre- or post-layer normalization. Finally, don’t forget to use the logit projection module on the final output. 3

Using the instructions, please complete the TransformerDecoder.forward method. TransformerEncoderDecoder After making sure that the encoder and the decoder are both built properly, it’s time to put them together. You need to implement the following methods: • The method create pad mask is the helper function to pad a sequence to a specific length. • The method create causal mask is the helper function to create a causal (upper) triangular mask. • After finishing the two helper methods, you can implement the forward method that connect all the dots. In particular, you first create all the appropriate masks for the inputs. Then, you feed them through the encoder. And, finally, you can get the final result by feeding everything through the decoder. 3 MT with Transformers: Greedy and Beam-Search [20 Marks] The a2 transformer model.py file contains all the required functions to complete and detailed instructions (with hints). Here we list the high-level methods/functions that you need to complete. 3.1 Greedy Decode Let’s warm up by implementing the greedy algorithm. At each decoding step, compute the (log) proba- bility over all the possible tokens. Then, choose the output with the highest probability and repeat the process until all the sequences in the current mini-batch terminate. Using the instructions, please complete the TransformerEncoderDecoder.greedy decode method. 3.2 Beam Search Now, it’s time for perhaps the hardest part of the assignment – beam search. But don’t worry, we have broken everything down into smaller, and much simpler chunks. We will guide you step by step to complete the entire algorithm. Beam search is initiated by a call to the TransformerEncoderDecoder.beam search decode method call. Recall from the lecture that its job is to generate partial translations (or, hypotheses ) from the source tokens during the decoding phase. So, the beam search decode method gets called whenever you are trying to gen- erate decoded translations (e.g. from TransformerRunner.[translate, compute average bleu over dataset] etc.). Complete the following functions in the TransformerEncoderDecoder class: 1. initialize beams for beam search : This function will initialize the beam search by taking the first decoder step and using the top-k outputs to initialize the beams. 2. expand encoder for beam search : Beam search will process ‘batches‘ of size ‘batch size * k‘ so we need to expand the encoder outputs so that we can process the beams in parallel. ( Tip : You should call this from within the preceding function). 3. repeat and reshape for beam search : It’s a relatively simple expand and reshaping function. See how it’s called from the .beam search decode method and read the instructions in function com- ments. ( Tip : Review Torch.Tensor.expand). 4. pad and score sequence for beam search : This function will pad the sequences with eos and seq log probs with corresponding zeros. It will then get the score of each sequence by summing 4

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version