kakadenotes1

.pdf

School

University of Cincinnati, Main Campus *

*We aren’t endorsed by this school

Course

OPTIMIZATI

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

Uploaded by BrigadierGorillaMaster2190

Notes on Attention and the Transformer Model 1 Introduction • Two “core” tasks: machine translation and language modeling. • Many other tasks: part-of-speech tagging, named entity recognition, coref- erence resolution, semantic role labeling, question/answering, textual entail- ment, sentiment analysis, semantic parsing, etc. • Goal today: build a language model. Why? “Representations” of the lan- guage may be helpful for many tasks. • Interesting questions: memory, questions/answering, reasoning/logic. 2 A Short Summary of Some Improvements • linguistics (grammars, parse trees) → statistical machine learning → “deep” models • Brown clustering, n -gram models, IBM translation models • Lots of work on Neural Embeddings. • MT: – rule-based machine translation. – Statistical MT. IBM translation models. – Then a series of “deep learning” based approaches: * One first end-to-end models, with an “encoder-decoder” architec- ture. “ Recurrent Continuous Translation Models.” (Kalchbrenner & Blunsom, 2013) * Seq2Seq: using sequential neural models was a good first step. “Sequence to Sequence Learning with Neural Networks.” (Sutskever et. al. ’14) 1

* A series of papers started incorporating “attention”, where one di- rectly tries to utilize long range dependencies in the representation. The idea is that these long range dependencies help when translat- ing given words (the broader context is important). Now, all state of the art methods use some form of ”attention”. The Transformer is one of the most popular ones: “Attention is All you Need” (Vaswani. et. al. ’17) • Transfer learning: how can make learning easier by transferring knowledge of one task to another? Recent exciting results showing that representations extracted from a good language model can help with this. – NAACL best paper: “Deep Contextualized Word Representations” (Peters et. al. ’18) – Another improvement with pretraining: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Devlin et. al. ’18) 3 Datasets, Tasks, and some (important) details 3.1 Datasets and Objectives for Language Modelling • machine translation: translate one sentence to another sentence. BLEU score used. • language modeling: the goal is to learn a model over documents/ sequences, where given a document d = w 1: T (or a sequence of words/characters) our model provides a probability c Pr( d ) = b Pr ( w 1: T ) . Note that we often specify this joint distribution by the conditional distribution c Pr( w t +1 | w ≤ t ) where the w ’s are the words. • The performance measure: If D is the true distribution, we measure the quality of our model by the cross entropy rate : CrossEnt ( c Pr || D ) := 1 T E w 1: T ∼ D h - log b Pr ( w 1: T ) i = 1 T E w 1: T ∼ D " - X t log b Pr ( w t +1 | w 1: t ) # • The perplexity is defined as exp( CrossEnt ( c Pr || D )) . Intuitively, think of this as the number of plausible candidate alternative words that our model is suggesting. 2

• Examples: – Using a uniform distribution over m words gives a ppl of m . – Using the (estimated) unigram distribution has ppl about 1000 . – Shannon, in his paper ””Prediction and Entropy of Printed English” (’51), estimated 0.6 to 1.3 bits/character (using human prediction of letters). This translates to 4 . 5 bits/word, using 1 bit/character and 4 . 5 characters/word. This gives a ppl of 2 4 . 5 = 23 . – On the PTB dataset, the best ppl is about 55 - 60 (on the validation set). The best character level entropy rate is 1.2 bits/character. This translates to about 77 ppl in perplexity units per word (to see this use 2 ( 1 . 175 * 390000 / 74000) since there are 390000 characters in valida- tion set and 74000 words in the validation set). – There are other ’codings’ like BPE (byte pair encodings) and sub- words. One can translate perplexities between different codings, pro- vided they can faithfully represent the document/sequence. • Concerns: memory and long term dependencies may not be reflected in this metric? • Other ideas: RL, logic, meaning? • Datasets used for language modeling: – Penn Tree Bank (PTB): first collection. 1M words. 10K vocab sized (based on standardization) – WikiText-2 (2M words) and WikiText-103 (103M words). Scraped from articles on Wikipedia passing a certain quality/length threshold, on all topics. 300k vocab size, > 3 times each. – Google Billion Words: web crawl, assorted topics. 1B words. 800K vocab size. – Books corpus: 11k public-domain novels. 1B words. • Training: – GPUs/TPUs are needed. – Books/Billion words takes GPU weeks to a month to train (all standard models). TPU a few days. 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version