kakadenotes2

.pdf

School

University of Cincinnati, Main Campus *

*We aren’t endorsed by this school

Course

OPTIMIZATI

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

8

Uploaded by BrigadierGorillaMaster2190

Report
Notes on Attention and the Transformer Model Sham Kakade 1 Introduction Two “core” tasks: machine translation and language modeling. Many other tasks: part-of-speech tagging, named entity recognition, coref- erence resolution, semantic role labeling, question/answering, textual entail- ment, sentiment analysis, semantic parsing, etc. Goal today: build a language model. Why? “Representations” of the lan- guage may be helpful for many tasks. Interesting questions: memory, questions/answering, reasoning/logic. 2 A Short Summary of Some Improvements linguistics (grammars, parse trees) statistical machine learning “deep” models Brown clustering, n -gram models, IBM translation models Lots of work on Neural Embeddings. MT: rule-based machine translation. Statistical MT. IBM translation models. Then a series of “deep learning” based approaches: * One first end-to-end models, with an “encoder-decoder” architec- ture. “ Recurrent Continuous Translation Models.” (Kalchbrenner & Blunsom, 2013) * Seq2Seq: using sequential neural models was a good first step. “Sequence to Sequence Learning with Neural Networks.” (Sutskever et. al. ’14) 1
* A series of papers started incorporating “attention”, where one di- rectly tries to utilize long range dependencies in the representation. The idea is that these long range dependencies help when translat- ing given words (the broader context is important). Now, all state of the art methods use some form of ”attention”. The Transformer is one of the most popular ones: “Attention is All you Need” (Vaswani. et. al. ’17) Transfer learning: how can make learning easier by transferring knowledge of one task to another? Recent exciting results showing that representations extracted from a good language model can help with this. NAACL best paper: “Deep Contextualized Word Representations” (Peters et. al. ’18) Another improvement with pretraining: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Devlin et. al. ’18) 3 Datasets, Tasks, and some (important) details 3.1 Datasets and Objectives for Language Modelling machine translation: translate one sentence to another sentence. BLEU score used. language modeling: the goal is to learn a model over documents/ sequences, where given a document d = w 1: T (or a sequence of words/characters) our model provides a probability c Pr( d ) = b Pr ( w 1: T ) . Note that we often specify this joint distribution by the conditional distribution c Pr( w t +1 | w t ) where the w ’s are the words. The performance measure: If D is the true distribution, we measure the quality of our model by the cross entropy rate : CrossEnt ( c Pr || D ) := 1 T E w 1: T D h - log b Pr ( w 1: T ) i = 1 T E w 1: T D " - X t log b Pr ( w t +1 | w 1: t ) # The perplexity is defined as exp( CrossEnt ( c Pr || D )) . Intuitively, think of this as the number of plausible candidate alternative words that our model is suggesting. 2
Examples: Using a uniform distribution over m words gives a ppl of m . Using the (estimated) unigram distribution has ppl about 1000 . Shannon, in his paper ””Prediction and Entropy of Printed English” (’51), estimated 0.6 to 1.3 bits/character (using human prediction of letters). This translates to 4 . 5 bits/word, using 1 bit/character and 4 . 5 characters/word. This gives a ppl of 2 4 . 5 = 23 . On the PTB dataset, the best ppl is about 55 - 60 (on the validation set). The best character level entropy rate is 1.2 bits/character. This translates to about 77 ppl in perplexity units per word (to see this use 2 ( 1 . 175 * 390000 / 74000) since there are 390000 characters in valida- tion set and 74000 words in the validation set). There are other ’codings’ like BPE (byte pair encodings) and sub- words. One can translate perplexities between different codings, pro- vided they can faithfully represent the document/sequence. Concerns: memory and long term dependencies may not be reflected in this metric? Other ideas: RL, logic, meaning? Datasets used for language modeling: Penn Tree Bank (PTB): first collection. 1M words. 10K vocab sized (based on standardization) WikiText-2 (2M words) and WikiText-103 (103M words). Scraped from articles on Wikipedia passing a certain quality/length threshold, on all topics. 300k vocab size, > 3 times each. Google Billion Words: web crawl, assorted topics. 1B words. 800K vocab size. Books corpus: 11k public-domain novels. 1B words. Training: GPUs/TPUs are needed. Books/Billion words takes GPU weeks to a month to train (all standard models). TPU a few days. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3.2 The details matter: training and overfitting The details do matter a lot for training language models. In contrast, in visual object recognition, once we move to the “Res-Net” style of architectures, training is relatively easy, where overfitting, hyperparameter tuning, and regularization are not major concerns. In fact, simply “early stopping” on vanilla SGD training is often non-trivially competitive on any reasonable model. overfitting is very real in language models, contrast to vision. PTB with a trigram model (i.e. predict next word with previous two words) has 20 train ppl (in about 4 epochs) with 150 ppl on val. PTB with LSTM+dropout has about 30 train ppl (in about 500 epochs) with 60 ppl on val. dropout: this is needed. dropout is used everywhere in these networks. L 2 regularization alone is not comparable (very brittle and even with highly tuned it is not as good). (average) SGD or ADAM? Sometimes one algorithm is much better than the other. exploding gradients: these occur in practice. vanishing gradients: lots of discussion on this. unclear what is going on. dynamic evaluation: keep training at test time to handle topic drifts. squeezes about 5-10ppl on val (for PTB). 4 The Transformer Let x be the input sequences of size T × m , where T is the sequence length, often of length 512 and m is the vocabulary size, often in the range of 10 4 to 10 5 . Now we will describe a one hidden layer transformer, where it will predict the next word. Parameters: E R m × d embedding W V , W Q , W K R d embedding × d hidden W 1 R d hidden × d 1 W 2 R d 1 × d embedding 4
1. Embed the sequence: x xE + P so now x is of size d embedding × T . Here, P is the positional encoding. One common choice is: P t, 2 i = sin( t/ 10000 2 i/d embedding ) P t, 2 i +1 = cos( t/ 10000 2 i/d embedding ) where i indexes the embedding dimension and t the sequence time. Note that often this is a fixed choice (and not a learned parameter). 2. Compute the ’values’, ’query’, and ’key’: V = xW V , Q = xW Q , K = xW K , which are of size T × d hidden . 3. Compute the attention ”weights” scheme: h = softmax QK T d hidden V so h is of size T × d hidden . Importantly, note that QK T is a T × T matrix. Here, abusing notation, the vector valued softmax ( · ) function is applied to every row of the matrix QK > . Recall that the vector valued softmax ( · ) function is defined so that the i -th component is: [ softmax ( v )] i := exp( v i ) / X j exp( v j ) . Note: each row of softmax ( QK T ) sums to 1. The idea is that we want a convex combination of the columns of V. 4. The output after two transformations is then: O = ReLu ( ReLu ( hW 1 ) W 2 ) which is of size T × d embedding . 5. The prediction that the next word in the sequence, X T +1 , is the j -th word is then: e O = OE > c Pr( X T +1 = j ) = [ softmax ( e O T )] j . i.e. only the last node e O T is used for prediction. Here, we have coupled the embedding weights and the prediction weights, where both use the matrix E . 5
4.1 Invariances and other observations Define: M = QK T which is of size T × T . Suppose P = 0 (no positional encoding). The matrix M is shift invariant in that if translate the sequence by τ then ( i, j ) entry gets shifted to ( i + τ, j + τ ) (provided these are in bounds). Similarly, h i h i + τ . Lemma. Let Q, K, V R T × d embed , and let Π be a T -by- T permutation matrix. Then, σ Q K ) T V = Π σ ( QK T ) V , where σ ( · ) is the row- wise softmax of a matrix. In particular, suppose P = 0 , then if we permute the sequence, i.e. x 7→ Π x , then h 7→ Π h . Proof. The LHS is σ QK T Π T V . It suffices to show that σ QK T Π T ) = Π σ ( QK T T . Indeed, letting π denote the permutation specified by Π , we have σ QK T Π T ) i,j = e ( QK T ) π ( i ) ( j ) T j 0 =1 e ( QK T ) π ( i ) ( j 0 ) = e ( QK T ) π ( i ) ( j ) T j 0 =1 e ( QK T ) π ( i ) ,j 0 = (Π σ ( QK T T ) i,j . 5 Further comments: Transformer Observations: hidden state interpretation and ’sequential’ training/scoring: the above de- scription is model for c Pr( w T +1 | w 1: T ) . We may be interested in the model predicting c Pr( w t +1 | w 1: t ) (for t < T where often T = 512 ), i.e. we may want to make multiple predictions simultaneously (say for training). For this, there is a way to use ’masking’ with an upper triangular matrix so that (for all t T ): Pr( X t +1 = j | w 1: t ) = [ softmax ( e O t )] j computation: the transformer computations are very efficient due to the man- ner in which the matrix multiplications can be parallelized. In contrast, the LSTM fundamentally needs a for-loop over the history. (The LSTM is a circuit with greater depth.) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Transformer vs. LSTMs: hidden states vs no hidden states. Other models: TrellisRNN, From the web: generations: a qualitative discussion. Elmo and Bert improvements page on state of the art MT: out of curiosity Training: “big” models are best (unsurprisingly) dropout is critical (perhaps surprisingly) BPE now the norm. Do they handle the tails in a qualitatively (or quantita- tively) different manner? Adam vs SGD (unclear); learning rate decay (not done for SGD) training/dev(or val)/test: one oddity is that, often, the dev and the test set come from a different distribution than the training set. Some other thoughts: dynamic context evaluation: on the val/test set, update the model after each sample. Let θ g be the global model, and let θ l be the local model. θ l θ l - ∇ ( θ l , ( x, y )) + λ ( θ g - θ l ) Lemep-Ziv vs. language modeling as a code? model averaging is important. KN-smoothing. 7
5.1 Logistic regression, model averaging: even the well-specified case is not that easy. For logistic regression, there is a big difference between estimation with model averaging and estimation under a prior. Suppose you are given n samples of ( x i , y i ) , with y i ’s binary. And suppose we restrict to logistic classifiers: ( w, ( x, y )) = - log Pr( y | x, w ) = exp( w > x ) /Z x and let: LogLoss ( P ( · )) = E ( x,y ) D - log P ( y | x ) where D is the truth, and P is predictive distribution that we are scoring. (Hazan, Koren, Levy, ’14). Suppose that we know k w * k ≤ W 2 . Even if the true model is well specified, the best we can hope for, for any estimator b w m based on m random samples, has regret (with high probability): LogLoss ( b w m ) - LogLoss ( w * ) O ( 1 m ) . (Kakade, Ng ’04; Foster et. al. ’18) In the agnostic model (without any data generating assumptions), model averaging is much more powerful. Suppose that our “posterior” is: Pr( w | data ) = Pr ( w | ( x 1 , y 1 ) , ( x m , y m )) = π m i =1 Pr( y i | x i , w ) Pr( w ) where Pr( w ) is say a uniform prior (or Gaussian prior) which is over the ball of radius k w * k . Now suppose our prediction on any ( x, y ) is: P Bayes ( y | x ) = Z Pr( y | x, w ) Pr( w | data ) dw then: LogLoss ( P Bayes ) - LogLoss ( w * ) O ( 1 m ) . 6 Acknowledgements These notes were based on discussions with Xinyi Chen, Karthik Narashiman, Cyril Zhang, and Yi Zhang. 8

Browse Popular Homework Q&A

Q: Define - Static Multiple Issue..
Q: The primed-frame axes in the transformation diagram of Sample Problem 1 are easy to locate for any…
Q: Simplify and write in standard form (x − 8) (x² − 4x + 4) − (10x². · 4) – (10x² – 2x³ – 7) .St -…
Q: Given the solubility, calculate the solubility product constant (Ksp) of each salt at 25°C:…
Q: What is the total weight of the four cars in the table? Enter your answer in the box. kg Car 1 Car 2…
Q: You collected 1000 mL of river water from the stream running through campus. You prepared dilutions…
Q: Draw two major products of this reaction. Use wedge and dash bonds to indicate stereochemistry where…
Q: The Acme Company manufactures widgets. The distribution of widget weights is bell-shaped. The widget…
Q: n=310 and ˆpp^ (p-hat) =0.47, find the margin of error at a 99% confidence level Give your answer to…
Q: 8. Given P(3, -9) and Q(0, -3), find the component form of PQ and PQ.
Q: A woman is in the process of grieving for her husband of 35 years. She decides to focus her energy…
Q: Suppose that the magnetic field at point 1 is B₁ = 24 mT in (Figure 1). Assume that the wires…
Q: Part C. Find the √√x-5 8). x-15 1 9). +3+ + √x + 10
Q: s and foci of the conic and sketch the graph: 4x² + y² +8x-4y-8=0
Q: The number of potholes in any given 10 mile stretch of freeway pavement is normally distributed.…
Q: Given rectangle PQRS, find the following: 7. PQ+QR = 8. PQ+QR = 9. PQ + QR = 10. PQ+RS = P R 3 D
Q: Consider molecule #1. How would these molecules arrange themselves when added to water? A. Would…
Q: After exploring the 1619 Project: What do you notice?  What do you wonder?  As an artist or…
Q: The graph of 9(x) = x? the graph of h? was transformed to create the graph of h(x) =-) Which of…
Q: Bohr's model of hydrogen proposed that electrons can only occupy discrete orbitals. What is the…
Q: Consider the Polar relation r=2+3 cose. Create the Cartesian graph for the function with r= y and =…
Q: Social media refers to online and mobile technologies that distribute content to facilitate…