Lesson 14

pdf

School

University of Massachusetts, Amherst *

*We aren’t endorsed by this school

Course

188

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

27

Uploaded by CaptainElkMaster384

Report
© 2020 UMass Amherst UWW. All rights reserved. ECE 568 Computer Architecture Lesson 14 Tournament Predictors and Branch Target Buffer
Prior Learning Patterson, Chapter 3 and Appendix C Static predictions Dynamic prediction (local and global) 1 & 2-bit dynamic branch prediction schemes Correlating branch prediction Branch history table Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 2
Rationale Using local and global predictors in parallel (tournament predictor) Comparing the accuracy of branch predictor Adding a branch target buffer or cache Predicated execution and return from subroutine Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Objectives Analyze characteristics of tournament predictors Evaluate accuracy of branch prediction Explore branch target cache and buffer Explore predicated execution Analyze pitfalls of branch prediction Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 4
Tournament Predictors Some branches need only local information while others can benefit from global information. Tournament predictors: use two predictors, 1 based on global information and 1 based on local information, and combine with a selector Hopes to select right predictor for right branch (or right context of branch) Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 5
Tournament Predictor in Alpha 21264 The system computes both a local predictor and a global one. Then the selector decides which predictor to use based on which is currently doing better (gave the better result the last two times). The selector uses one of 4096 finite state machines (FSM) each with 4 states. Based on the results of the last branch decision it transitions between states. The current state indicates which (global or local) predictor to use. Based on the last 12 branches, a 12-bit pattern is generated as follows 12-bit pattern: ith bit is 0 => ith prior branch not taken; ith bit is 1 => ith prior branch taken; This 12 bit pattern is used as an address into the 4K table of FSMs. Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The Selector’s Finite State Machines States a & b suggest to use predictor “1”, c & d suggest to use predictor “2”. Each transition consists of 2 bits, i1 and i2: i1=1 if predictor “1” is correct, 0 if incorrect. Same for i2. Start with state a that suggests to use “1”, if i1=i2 both predictors were right (or wrong)- no reason to change decision. If i1,i2=10 the selected predictor (“1”) is correct – we clearly want to use it. If i1,i2=01, “2” did better so we start to doubt “1” and move to state b . If next time i1,i2=01 again, will move to state c that suggests to use “2”. But, if next time i1,i2=10 we go back to a . Clearly if i1=i2 we stay in b . Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 7 00,10,11 00,11 10 Use 1 Use 2 Use 2 Use 1 00,01,11 00,11 10 10 01 01 01 Selector a b c d The 4096 table is not addressed by 12 bits from PC but by the 12 most recent outcomes of branches. Experience has shown that such input works better than bits of PC.
Point to Ponder Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 8 00,10,11 00,11 10 Use 1 Use 2 Use 2 Use 1 00,01,11 00,11 10 10 01 01 01 Selector a b c d Will all 4K entries of the Selector be used? a. Yes b. No
Global Predictor Also has 4K entries (different from the selector’s 4K table). The table is also indexed by the history of the last 12 branches. Each entry in the global predictor is a standard 2-bit predictor. Same as the selector, the 12 bit pattern is used to index into the 4K table. Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 9 4K 2 bits 1 3 2 12 . . . Global predictor 12 bit Address Standard 2-bit global predictor 12 bit branch history
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Points to Ponder Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 10 1) Will all 4K entries of the Global Predictor be used? a. Yes b. No 2) Will there be aliasing in the Global Predictor? a. Yes b. No
Local Predictor The input uses the least significant 10 bits of the PC for the branch instruction. This is a way to give unique a table entry for each branch although two branch instructions may have the same 10 bits for their instruction address (aliasing possible). Using this 10-bit address it indexes into a 1K table (left box). Each 10-bit entry corresponds to the most recent 10 branch outcomes for that particular branch. Next this 10-bit history pattern is used to index into another 1K table (right box). Each entry in this table consists of a 3-bit saturating counter which provides the local prediction. Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 11 1K 10 bits 1K 3 bits PC 10 bits Saturating counter
The Complete Tournament Predictor in Alpha 21264 Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 12 Total size: 4K*2 + 4K*2 + 1K*10 +1K*3 = 29K bits! (~180K transistors) 1. Is aliasing possible? a. Yes b. No
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Percent of Predictions from Local Predictor in Tournament Prediction Scheme Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 13 98% 100% 94% 90% 55% 76% 72% 63% 37% 69% 0% 20% 40% 60% 80% 100% nasa7 matrix300 tomcatv doduc spice fpppp gcc espresso eqntott li
94% 96% 98% 98% 97% 100% 70% 82% 77% 82% 84% 99% 88% 86% 88% 86% 95% 99% 0% 20% 40% 60% 80% 100% gcc espresso li fpppp doduc tomcatv Profile-based 2-bit counter Tournament Accuracy of Branch Prediction Profile: branch profile from last execution (static in that is encoded in instruction, but profile) Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 14
Accuracy vs. Size (SPEC) Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 15 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 Total Predictor Size (Kbits) Conditional Branch Misprediction Rate Local - 2 bit counters Correlating - (2,2) scheme Tournament
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Destination Address Next topic is to determine the destination address before the instruction is fully decoded. We need the address at the same time as the prediction. We can predict if a branch is taken already in IF but we can not fetch the instruction from the target as the target address has not yet been calculated. Solution: use a Branch Target Buffer (BTB) that will include, for branches that are predicted to be taken, the target address. Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 16
Need Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch used as index to get prediction AND branch address (if taken) Note: must check for branch match now, since can’t use wrong branch address Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 17 Branch PC Predicted PC =? PC of instruction FETCH Prediction state bits Yes : instruction is branch; use predicted PC as next PC (if predict Taken) No : branch not predicted; proceed normally (PC+4)
Point to Ponder Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 18 Branch PC Predicted PC =? PC of instruction FETCH Prediction state bits Yes : instruction is branch; use predicted PC as next PC (if predict Taken) No : branch not predicted; proceed normally (PC+4) Will the field of “Branch PC” have 30-bit?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Branch Target “Cache” Branch Target cache - Only predicted taken branches “Cache” - Content Addressable Memory (CAM) or Associative Memory Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 19 Memory array Decoder n 2^n Standard RAM Key_i Data_i Content addressable memory External KEY =? No : not found Yes : output data_i
Branch Target “Cache” Use a big Branch History Table & a small Branch Target Cache Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 20 Branch PC Predicted PC =? Prediction state bits (optional) Yes : predicted taken branch found No : not found PC 30 bits
Dynamic Branch Prediction for 5-Stage MIPS Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 21 Taken Branch? Entry found in branch- target buffer? Send out predicted PC Is instruction a taken branch? Send PC to memory and branch-target buffer Enter branch instruction address and next PC into branch-target buffer Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Normal instruction execution Branch correctly predicted; continue execution with no stalls No Yes Yes Yes No No ID IF EX
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Points to Ponder regarding Branch Target Buffer for the 5-Stage MIPS Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 22 Taken Branch? Entry found in branch- target buffer? Send out predicted PC Is instruction a taken branch? Send PC to memory and branch-target buffer Enter branch instruction address and next PC into branch-target buffer Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Normal instruction execution Branch correctly predicted; continue execution with no stalls No Yes Yes Yes No No ID IF EX 1. (Left side) If not in buffer, is it not a branch? 2. (Right side) Are we sure it is a branch?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 23 Assume: BTB hit_rate =11% (taken branches 12% and most already in buffer) P{incorrect prediction | entry found in buffer} = 10% P{Branch taken | not in BTB} = 6% (branch never seen or was not taken and removed from BTB). Updating the BT buffer takes 1 cycle so total penalty 2 cycles Branch_CPI_Penalty = [BTB_hit_rate x P{Incorrect_prediction|is in BTB}] x Penalty_Cycles + [(1-BTB_hit_rate) x P{Branch_taken|not in BTB}] x Penalty_Cycles = .11x.1x2+.89x.06x2 =.127 Contribution to CPI = 0.12 x 0.127 (only of taken branches).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Predicated Execution Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then neither store result nor cause interference Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move. Drawbacks to conditional instructions Still takes a clock even if “annulled” Stall if condition evaluated late: Complex conditions reduce effectiveness since condition becomes known late in pipeline Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 24 x A = B op C Can these replace all branches? a. Yes b. No
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Special Case: Return Addresses Register Indirect branch (JR r31) - hard to predict address An entry in the BTB would be useless SPEC 85% such branches for procedure return Use stack discipline for procedures, save return address in small buffer that acts like a stack for nested subroutines: 8 to 16 entries has small miss rate Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Pitfall: Sometimes dumber is better Alpha 21 2 64 uses tournament predictor (29 Kbits) Earlier 21 1 64 uses a simple 2-bit local predictor with 2K entries (or a total of 4 Kbits) SPEC benchmarks, 21 2 64 outperforms 21 2 64 avg. 11.5 mispredictions per 1000 instructions 21 1 64 avg. 16.5 mispredictions per 1000 instructions Reversed for transaction processing (TP) ! 21 2 64 avg. 17 mispredictions per 1000 instructions 21 1 64 avg. 15 mispredictions per 1000 instructions TP code much larger & 21 1 64 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21 2 64) Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Lesson 14 © 2020 UMass Amherst UWW. All rights reserved. 27 Lesson Summary Tournament Predictor Branch Target Address Branch Target Buffer (BTB)/Cache Dynamic Branch Prediction for the 5-Stage MIPS Predicated Execution Return Addresses issue in Branch Prediction
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: NO₂ O OCH₂CH3 -. Show the best way to make the ether: Synthesis, starting from an alcohol/phenol.…
Q: What TWO statements correctly describe the impact of the Spanish on American Indians living in…
Q: Goods Grapes Textiles Spain 8 4 Note: figures are hours of labor per unit of output France 15 5
Q: What fibers are present in the fibrocartilage in the HPO of the microscope and where is located in…
Q: Question 15 A student designed an experiment to test the hypothesis that even very cold objects…
Q: The position vs. time graph below is for an object that is exhibiting SHM. The position function le…
Q: Write out all the steps in the mechanism of this electrophilic aromatic substitution reaction.…
Q: Find the volume of the solid obtained by rotating the region enclosed by the graphs of y=12−x,…
Q: Find the function f(x) =3x-2 and g(x)= 2x-2 find  i)(fog)(x) ii)(gof)(x)
Q: Describe with words and/or pictures the graph of:                      a. z2 + y =2…
Q: Write out the mechanism (intermediate/transition state) for this reaction; Br CH3 NaOCH, Ę₂ For the…
Q: -2 (a) Consider the functions f(x) = 100 – zª and F(x) = [*¸ ƒ(t) dt. i. Write a simplified form of…
Q: The slope of an isoquant is A. ΔLTC / ΔQ B.-ΔK / ΔL. C.ΔTVC / ΔQ D. None of the Above
Q: a) List the components of country’s GDP in an open economy. For each component, provide an  example…
Q: E| Η- H +ΔΕ -* − ΔΕ A Antibonding molecular orbital -Η Hea 1 Bonding molecular orbital +ΔΕ -* − ΔΕ B…
Q: (7a + 2)(4a + 12) (6z + 3)(22 + 6) f(a) Find the vertical asymptote(s), If there is more than one…
Q: Sketch the graph of the function below, including correct signs, x- intercepts and y-intercepts.…
Q: Why do you think x86 will not let you run the instruction below :   mov [r8], [rax]
Q: A common design requirement is that an environment must fit the range of people who fall between the…
Q: Ammonium sulfate is added to an unknown mixture of ions. A precipitate forms. The solution is…
Q: Solve the given equation for x. 96-2x - 96x-9 =
Q: Question 12 Which of the following models depicts changes in sound waves that would occur as an…