kakadenotes1

.pdf

School

University of Cincinnati, Main Campus *

*We aren’t endorsed by this school

Course

OPTIMIZATI

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

6

Uploaded by BrigadierGorillaMaster2190

Notes on Attention and the Transformer Model 1 Introduction Two “core” tasks: machine translation and language modeling. Many other tasks: part-of-speech tagging, named entity recognition, coref- erence resolution, semantic role labeling, question/answering, textual entail- ment, sentiment analysis, semantic parsing, etc. Goal today: build a language model. Why? “Representations” of the lan- guage may be helpful for many tasks. Interesting questions: memory, questions/answering, reasoning/logic. 2 A Short Summary of Some Improvements linguistics (grammars, parse trees) statistical machine learning “deep” models Brown clustering, n -gram models, IBM translation models Lots of work on Neural Embeddings. MT: rule-based machine translation. Statistical MT. IBM translation models. Then a series of “deep learning” based approaches: * One first end-to-end models, with an “encoder-decoder” architec- ture. “ Recurrent Continuous Translation Models.” (Kalchbrenner & Blunsom, 2013) * Seq2Seq: using sequential neural models was a good first step. “Sequence to Sequence Learning with Neural Networks.” (Sutskever et. al. ’14) 1
* A series of papers started incorporating “attention”, where one di- rectly tries to utilize long range dependencies in the representation. The idea is that these long range dependencies help when translat- ing given words (the broader context is important). Now, all state of the art methods use some form of ”attention”. The Transformer is one of the most popular ones: “Attention is All you Need” (Vaswani. et. al. ’17) Transfer learning: how can make learning easier by transferring knowledge of one task to another? Recent exciting results showing that representations extracted from a good language model can help with this. NAACL best paper: “Deep Contextualized Word Representations” (Peters et. al. ’18) Another improvement with pretraining: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Devlin et. al. ’18) 3 Datasets, Tasks, and some (important) details 3.1 Datasets and Objectives for Language Modelling machine translation: translate one sentence to another sentence. BLEU score used. language modeling: the goal is to learn a model over documents/ sequences, where given a document d = w 1: T (or a sequence of words/characters) our model provides a probability c Pr( d ) = b Pr ( w 1: T ) . Note that we often specify this joint distribution by the conditional distribution c Pr( w t +1 | w t ) where the w ’s are the words. The performance measure: If D is the true distribution, we measure the quality of our model by the cross entropy rate : CrossEnt ( c Pr || D ) := 1 T E w 1: T D h - log b Pr ( w 1: T ) i = 1 T E w 1: T D " - X t log b Pr ( w t +1 | w 1: t ) # The perplexity is defined as exp( CrossEnt ( c Pr || D )) . Intuitively, think of this as the number of plausible candidate alternative words that our model is suggesting. 2
Examples: Using a uniform distribution over m words gives a ppl of m . Using the (estimated) unigram distribution has ppl about 1000 . Shannon, in his paper ””Prediction and Entropy of Printed English” (’51), estimated 0.6 to 1.3 bits/character (using human prediction of letters). This translates to 4 . 5 bits/word, using 1 bit/character and 4 . 5 characters/word. This gives a ppl of 2 4 . 5 = 23 . On the PTB dataset, the best ppl is about 55 - 60 (on the validation set). The best character level entropy rate is 1.2 bits/character. This translates to about 77 ppl in perplexity units per word (to see this use 2 ( 1 . 175 * 390000 / 74000) since there are 390000 characters in valida- tion set and 74000 words in the validation set). There are other ’codings’ like BPE (byte pair encodings) and sub- words. One can translate perplexities between different codings, pro- vided they can faithfully represent the document/sequence. Concerns: memory and long term dependencies may not be reflected in this metric? Other ideas: RL, logic, meaning? Datasets used for language modeling: Penn Tree Bank (PTB): first collection. 1M words. 10K vocab sized (based on standardization) WikiText-2 (2M words) and WikiText-103 (103M words). Scraped from articles on Wikipedia passing a certain quality/length threshold, on all topics. 300k vocab size, > 3 times each. Google Billion Words: web crawl, assorted topics. 1B words. 800K vocab size. Books corpus: 11k public-domain novels. 1B words. Training: GPUs/TPUs are needed. Books/Billion words takes GPU weeks to a month to train (all standard models). TPU a few days. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: The figure below shows a boy swinging on a rope, starting at a point higher than A.  Consider the…
Q: Your grandmother enjoys creating pottery as a hobby. She uses a potter's wheel, which is a stone…
Q: Q13. Suppose a connected graph, G, has 12 vertices. How many edges must there be in a spanning tree…
Q: Consider an infinite line of charge with a constant charge density of A lying on the z-axis. Using…
Q: Solve the following linear programming problems. Restrict ? ≥ 0 and ? ≥ 0. Minimize g = 7x + 6y…
Q: When k2 >> k-1, KM approximates the affinity of the enzyme•substrate complex. The (circle one)…
Q: A competitive inhibitor interacts with the free enzyme to form an enzyme•inhibitor complex (E•I).…
Q: Kacey is learning how to figure skate. She is spinning with her arms out, each arm is .5 m, with an…
Q: account is compounded semi-annually at 6%, how much interest will a principal of $14,400 earn in 60…
Q: Let θ be an angle in quadrant II such that  cos θ = −7/10 .Find the exact values of csc θ and tan θ…
Q: Explain why 1,4-pentadiene is less acidic when compared to cyclopentadiene
Q: What was the purpose of adding KOH to the cotton balls in the respirometers? Consider what gases…
Q: Describe each of these historical events and their significance to the development of dual…
Q: 8. A) Please derive the inductance per unit length of a coaxial shown in the figure in terms of a…
Q: A student dissolves 12.1 g of lithium chloride (LiCl)in 300. g of water in a well-insulated open…
Q: Set up a definite integral to find the length of the curve  y= (3/4)x(4/3) - (3/8)x(2/3) over the…
Q: The Federal Reserve System achieves its goals in the following way O A. controlling interest rates…
Q: Every year, the students at a school are given a musical aptitude test that rates them from 0 (no…
Q: Benzoic acid is a weak acid because Select one: Oa. a. Ob. It partially ionizes in water Oc. it does…
Q: The points ​(-4​,-4​) and ​(3​,6​) are the endpoints of the diameter of a circle. Find the length of…
Q: Question # 1: Draw the Bode log-magnitude and phase plots for the system. R(s) +, G(s) = = E(s) G(s)…
Q: Determine whether the following scenarios would increase, decrease or have no impact on potential…