-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Computer Architecture 2004 Notes

HEX LOOKUP TABLE	POWERS OF TWO
 0  0 0000		 0	   1
 1  1 0001		 1	   2
 2  2 0010		 2	   4
 3  3 0011		 3	   8
 4  4 0100		 4	  16
 5  5 0101		 5	  32
 6  6 0110		 6	  64
 7  7 0111		 7	  128
 8  8 1000		 8	  256
 9  9 1001		 9	  512
10  a 1010		10	 1024
11  b 1011		11	 2048
12  c 1100		12	 4096
13  d 1101		13	 8192
14  e 1110		14	16384
15  f 1111		15	32768

VIRTUAL ADDRESS BITS
virtual address = virtual page number + virtual page offset
virtual page number = TLB tag + TLB set index
physical address = physical page number + physical page offset
physical page offset = virtual page offset

CACHE BITS
physical address = cache tag + cache set index + cache block offset


    TIME            - ARTHIMETIC
    NORMALIZED TIME - GEOMETRIC
    RATES           - HARMONIC


           Seconds   Instructions   Clock cycles     Seconds
    Time = ------- = ------------ * ------------ * -----------
           Program     Program       Instruction   Clock cycle

  Formulas...
    CPU-time = (CPU-execution-clock-cycles+Memory-stall-clock-cycles)*Clock-cycle-time
    Memory-stall-clock-cycles = Read-stall-cycles + Write-stall-cycles
    Read-stall-cycles=Reads/Program*Read-miss-rate*Read-miss-penalty
    Write-stall-cycles=(Writes/Program*Write-miss-rate*Write-miss-penalty)+Write-buffer-stalls
    AMAT=Time-for-a-hit*(Miss-rate*Miss-penalty)
  In write-through cache, Read-miss-rate ~= Write-miss-rate so
    Memory-stall-clock-cycles = Memory-accesses/Program * Miss-Rate * Miss-Penalty
    Memory-stall-clock-cycles = Instructions/Program * Misses/Instruction * Miss-Penalty

      total=seek-time+rotational-delay+transfer-time+controller-time


------------------------------------------------------------------------
Notes on Patterson and Hennesey
------------------------------------------------------------------------
CHAPTER 1 - Computer Abstractions and Technology
1.1 Introduction
1.2 Below your program
  instructions (machine code)
  assembly language and assemblers
  high-level programming languages and compilers
  1 allow programm to think in a more natural language (possible domain specific)
  2 improved developer productivity
  3 machine independence
  subroutine libraries for I/O evolution in operating systems
  systems software
    operating systems
    assemblers
    compilers
  applications software
1.3 Under the covers
  5 classic components
    1 input devices
      keyboard
      mouse
    2 output devices
      screen
      printer
    1/2 I/O devices
      disk
      network
    3 memory
      main memory
      caches
        data cache
        instruction cache
    4 datapath
    5 control
  4+5 = processor (aka CPU)
  opening the box (other terms)
    motherboard
    integrated circuits or chips (ICs)
    memory
      DRAM - dynamic random access memory
      cache memory
    instruction set architecture (ISA) (aka architecture)
    architecture vs implementation
  a safe place for data
    primary memory 
      main memory
      volatile
      typically DRAMs
    secondary memory
      nonvolatile
      typically magnetic disks
    magnetic disks
      arm 
      read/write head
    secondary vs primary
      1 nonvolatile
      2 slower
      3 cheaper
  communicating to other computers
    1 communication    - shared informatio
    2 resource sharing - shared resources such as I/O devices
    3 nonlocal access  - remote access
1.4 Integrated Circuits: Fueling Innovation
  technology evolution generations
    1 (vacuum tubes)
    2 transistor
    3 integrated circuit
    4 large scale integrated circuit (LSI)
    4 very large scale integrated circuit (VLSI)
  manufacturing process
    semiconductor - silicon
    ignots sliced into wafers naturally have defects
    wafer is diced into dies (aka chips)
    yield is ration of good dies to total dies
    good dies are bonded to a package and retested
1.5 Real Stuff: Manufacturing Pentium Chips    
1.6 Fallacies and Pitfalls
  Fallacy: Computers have been built in the same, old-fashioned wat for
           far too long and this antiquated model of computation is
           running out of steam.
  Pitfall: Igoring the inexorable progress of hardware when planning a
           new machine.
1.7 Concluding Remarks
1.8 Historical Perspective and Further Reading
1.9 Key Kerms
1.10 Exercises

------------------------------------------------------------------------
CHAPTER 2 - The Role of Performance
2.1 Introduction
  some measures of performance
    response time (execution time)
    throughput

                        1
  Performance[x] = ------------
                   Execution[x]
                  
  Performance[x] > Performance[y]

       1              1    
  ------------ > ------------
  Execution[x]   Execution[y]

  Execution[x] < Execution[y]

  Performance[x]       Execution[y]
  -------------- = n = ------------
  Performance[y]       Execution[x]

  improve performance 
    => increase performance
    => decrease execution time

2.2 Measuring Performance
  execution time
    wall-clock time, reponse time, elapsed time
      total time to complete a task
    CPU execution time (or CPU time)
      time spent by CPU excluding I/O, waiting for other programs, etc
      user CPU time vs system CPU time
    system performance - elapsed time on unloaded system
    CPU performance - user CPU time
  clock
    clock cycle
    clock period = 1/(clock rate)
  clock and CPU performance
    CPU-time = CPU-clock-cycles * Clock-cycle-time
   CPU-clock-cycles = Instruction-Count * Average-CPI
    CPU-time = Instruction-Count * Average-CPI * Clock-cycle-time
  CPI and IPC
    CPI = Clock Cycles Per Instruction
    IPC = Instructions Per Clock Cycle
    CPI = 1/IPC
  Big Picture

           Seconds   Instructions   Clock cycles     Seconds
    Time = ------- = ------------ * ------------ * -----------
           Program     Program       Instruction   Clock cycle

  the three components
    Instruction count
     - limited by architecture
     - measure with hardware counters or simulation
    CPI
    - the big variable in implementation
    - varies with memory system, processor structure, mix of instructions
    - varies by applicatio
    Clock period
    - fixed in a specific implementation 
    - historically just a fixed documented value but..
2.4 Choosing Programs to Evaluate Performance
  workload
  benchmarks
  reproducibility
2.5 Comparing and Summarizng Performance
  total execution time
    arithmetic mean of total time for average time
    WEIGHTED ARITHMETIC MEAN (vs GEOMETRIC MEAN)
2.6 Real Stuff: The SPEC95 Benchmarks and Performance on Recent Processors
  SPEC ratio (normalize by dividing by Sun SPARCstation10/40 time)
  SPECfp95 and SPECint95 are GEOMETRIC MEAN of SPEC ratios
  For a given ISA, three typical sources of improvement:
  1.) increase clock rate
  2.) improve processor organization to lower the CPI
  3.) compiler enhancements that 
      - lower instruction count
      - use lower CPI instructions
2.7 Fallacies and Pitfalls
  Pitfall: Expecting the improvement of one aspect of a machine to
           increase performance by an amount proportional to the size of
           the improvement.
      Amdahl's law - the law of diminishing returns
        The performance enhancement possible with a given improvement is
        limited by the amount that the improved feature is used.
      Corollary:
        Make the common case fast
  Fallacy: Hardware-independent metrics predict performance.
      Example: using code size as measure of speed
  Pitfall: Usings MIPS as a performance metric.
                        Instruction count
      "native MIPS" = ---------------------
                      Execution time x 10^6
      Problems with MIPS:
      1.) specifies instruction execution rate but not the capabilities
          of the instructions. different ISAs matter.
      2.) MIPS varies between programs on the same computer
      3.) MIPS can vary inversely with performance
  Fallacy: Synthetic benchmarks predict performance
      Whetstone and Dhrystone have unrealistic patterns that are either
      easily optimized or exploitable by benchmark specific
      optimization..
  Pitfall: Using the arithmetic mean of normalized execution times to
           predict performance.
      Result will depend on choice of reference machine.
      Use geometric mean, not arithmetic mean, for normalized times

  Fallacy: The geometric mean of execution time ratios is proportional
           to total execution time.
     Violates fundamental principle of performance measurement
     - They do not predict execution time
2.8 Concluding Remarks
  high-performance design
  low-cost deisgn
  cost/performance ratio
  2003 cheat sheet
    TIME            - ARTHIMETIC
    NORMALIZED TIME - GEOMETRIC
    RATES           - HARMONIC
2.9 Historical Perspective and Further Reading
  "peak MIPS", MOPS, and other FLOPS
  "relative MIPS"
  Quest for the average program
    kernel benchmarks (instead of Whetstone...)
  Quest for a simple program
    using quicksort etc for benchmarking ... bad!
  SPEC
    SPECfp SPECint - CPU oriented
    SPEC SDM = Systems DEvelopment Multitaskinbg
    SPEC SFS = System-level File Server
    SPEChpc = high end scientific workloads
2.10 Key Terms
2.11 Exercises

------------------------------------------------------------------------
CHAPTER 3 - Instructions: Language of the Machine
3.1 Introduction
  instructions
  instruction set
3.2 Operations of the Computer Hardware
  DESIGN PRINCIPLE 1: Simplicity favors regularity.
    rationale for a fixed number of arguments for all arthimetic ops
3.3 Operands of the Computer Hardware
  registers
    word = 32 bits, size of MIPS register
    unliked variables at higher level, # of registers is limited
    32 registers for MIPS
  DESIGN PRINCIPLE 2: Smaller is faster.
    rationale for a limited number of registers
  load/store
    indexed load "lw $t0, 8($s3)"
    address in bytes
    alignment restriction, can't load any address just aligned words
  spilling registers
    "store"ing values from registers back to memory to make room
  registers are important because
    1.) less time to access
    2.) higher throughput than memory (can operatate on more than one at once)
    registers faster to access and simpler to use
3.4 Representing Instructions in the Computer
  decimal vs binary
  registers referenced by number 0..31
    NAME        #       Usage           Preserved
    $zero        0      always zero     na
    $at          1      assembler only  
    $v0..$v1     2..3   value return    no
    $a0..$a3     4..7   arg passing     no
    $t0..$t7     8..15  temporaries     no
    $s0..$s7    16..23  saved           yes
    $t8..$t9    24..25  more temps      no
    $k0..$k1    26..27  OS only         
    $gp         28      global pointer  yes
    $sp         29      stack pointer   yes
    $fp         30      frame pointer   yes
    $ra         31      return address  yes
  machine language - numeric version of an instruction
  machine code - numeric version of assembly language
  instruction format - 32bits for MIPS
    R-type (or R-format) R=Register
    +--+--+--+--+-----+-----+
    |op|rs|rt|rd|shamt|funct|
    +--+--+--+--+-----+-----+
    | 6| 5| 5| 5|  5  |  6  |
    +--+--+--+--+-----+-----+
    op = opcode
    rs = register source 1
    rt = register source 2
    td = register destination
    shamt = shift amount (chapter 4)
    funct = function code, selects opcode variant
  DESIGN PRINCIPLE 3: Good design demands good compromises.
    rationale for multiple instruction formats within 32 word
    I-type (or I-format) I=Immediate
    +--+--+--+-------+
    |op|rs|rt|address|
    +--+--+--+-------+
    | 6| 5| 5|  16   |
    +--+--+--+-------+
    op = opcode
    rs = register source
    rt = register design
    address = 16-bit offset
  Big Picture
    stored-program concept
    1.) instructions are represented by numbers
    2.) programs can be stored in memory to be read or writen just like numbers
3.5 Instructions for Making Decisions    
  if-then
    branches
      beq/bne - conditional branches
      labels - symbolic assembler notation for instruction address
  if-then-else
    j - unconditional branch
  basic block
    sequence of instructions with two properties
    1 no branches except possibly at end
    2 no branch labels except possibly at beginning
    early phase of compilation is breaking a program into basic blocks
  while/for loop
    slt - set if less than for expressions like "i<s"
  case/switch
    can use "jump address table" as optimization
    jr - jump register
3.6 Supporting Procedures in Computer Hardware
  execution of a function
  1.) place parameters in a place where the function can access them
  2.) transfer control to the function
  3.) acquire the storage resource sneeded for the function
  4.) perform the desired task
  5.) place the result value in a place where teh calling program can access it.
  6.) return control to the point of origin
  MIPS registers for function calls
    $a0..$a3    parameter passing
    $v0..$v1    value return
    $ra         return address
  jal function address - jump-and-link
    simultaneously 
    - set $pc to function address
    - save old $pc+4 in $ra (return address)
    return using "jr $ra"
  stack
    historically grows from higher to lower addresses
      $sp register - stack pointer
      push - store value and $sp-=4
      pop  - take value  and $sp+=4
    used to save/restore registers we overwrite in function body
      a form of register spilling
  MIPS convention register saving convention
    $t* registers are caller saved if caller will needs them later
    $s* registers are always callee saved if callee uses them
  function outline
    function prologue
      $sp-=(number of used $s* registers)*4
      save values of any $s* registers we use on stack
    function body
      save return values in $v0..$v1
    function epilogue
      restore values saved on stack to registers
      restore $sp to old value
      return to caller with "jr $ra"
  leaf procedure vs nested procedure
    leaf doesn't call any other procedures
    leaf doesn't need to worry about caller saved registers
    nested has to save
      any $t* registers it will need after call
      any $a* registers it will need after call
      any $v* registers it will need after call
  frame pointer
    $fp register
    points to procedure record (aka activation record)
    used to access function local storage
    not always used
      on MIPS GCC uses it
      but not MIPS/SGI compilers (which use it as $s8)
    picture of stack allocation
$fp -> saved argument registers (if any)
       saved return address
       saved saved registers (if any)
       local arrays and structures (if any)
$sp -> 
  global pointer
    $gp register
    used for accessing globals
  PRESERVED     NOT PRESERVED
  $s0-$s9       $t0-t9
  $sp           $a0-$a3
  $ra           $v0-$v1
  above $sp     below $sp
  $fp, $gp  
3.7 Beyond Numbers
  characters
  ASCII encoding - one byte per character
  lb/sb - load/store byte (only uses lower 8 bits of register)
3.8 Other Styles of MIPS Addressing
  two more ways of addressing operands
  1.) faster access to small constants
  2.) make branches for efficient      
  Constant or immediate operands
    addi $sp, $sp, 4 (add immediate)
    slti $t0, $s2, 10 (set less than immediate)
    without immediate, need to load constant in separate instruction
  DESIGN PRINCIPLE 4: Make the common case fast
    rationale for immediate format
    52% of gcc arithmetic instructions, 69% of spice
    lui - load upper immediate
      immediates instructions just take 16 bit immediate value
      to load 32 constant 0xdeadbeef into $t0
        lui  $t0, 0xdead
        ori $t0, 0xbeef (not addi, it would sign extend)
      assembler uses reserved $at register along with lui 
        when using 32 constants for load/store addresses etc
  Addressing in branches and jumps:
    jump instruction can use 26-bit word address (28-bit byte address):
        J-type (or J-format) J=Jump
        +--+---------+
        |op| address |
        +--+---------+
        | 6|    26   |
        +--+---------+
        op = opcode
        address = 26-bit offset
    jump only changes lower 28 bits of PC, not upper 4...
      see loader/linker in 3.9
    beq/bne use I-type so only 16 bit word address
      this address is actually word offset from $pc+4
      known as "PC-relative addressing"
    note for both of these that the lower two-order zero bits aren't stored
      which gives the 26+2=26 and 16+2=18
    dealing with larger conditional branch offsets
      beq $s0, $s1, L1
        replace with
          bne $s0, $s1, L2
          j L1
          L2:
      done automatically by assembler
  MIPS Addressing mode summary
  1 register addressing                 R-type add
  2 base or displacement addressing     I-type ld/st
  3 immediate addressing                I-type addi
  4 PC-relative addressing              I-type beq/bne
  5 psuedo-direct addressing            J-type j/jal
3.9 Starting a Program
  Translation hierarchy
    C program
      COMPILER
        assembly language program
          ASSEMBLER
            object: machine language mode + object: library routines
              LINKER
                executable: machine language program
                  LOADER
                    memory
  Compiler
  Assembler
    handle pseudo-instructions
      "move $t0, $t1" => "add $t0, $t1, $zero"
      "blt" => "slt/bne"
    handle long conditional branches as noted above
    handle load of 32-bit constants
    handle nuermic bases: binary, octal, decimal, hexidecimal
    generate object file (primary purpose of course)
    typical object file format (from unix)
      object file header        size and position of other parts of file
      text segment              machine language code
      data segment              both static data and dynamic data
      relocation information    list of absolute addresses
      symbol table              list of undefined labels
      debugging information     source and data structure information
  Linker
    aka link editor
    allows seperate compilation of modules
    three steps
      1 place code and data modules symbolicly in memory
      2 determine the address of data and sintruction labels
      3 patch both the internal and external references
    much faster to patch (aka edit) code than recompile and reassemble
    produces executable that is like object file but typically lacks
      no unresolved references
      no relocation information
      no symbol table
      no debugging information
    partially linked files created used to create libraries of routines
      still are object files
  Loader
    executes an executable file
    unix example
    1 Reads the executable file header to termin size of the text and
      data segments
    2 Creates an address space large enough for the text and data
    3 Copies the instructions and data from from the executable file
      into memory
    4 Copies the parameters (if any) to the main program onto the stack
    5 Initializes the machine registers and sets the stack pointer to
      the first free locatio
    6 Jumps to a start-up routine that copies the parameters ino the
      argument registers and calls the main routine of the program. When
      the main routine returns, the start-up routine terminates the
      program with an exit system call.
3.10 An Example to Put It All Together
  inlining small functions like swap
  mips compiler always saves space on stack for arguments
    in case of vararg, it actually saves them there
3.11 Arrays versus Pointers
  optimizing compilers can make array code as efficient as pointer code...
  ...but lets compare the straight forward assembly for both
  memzero example
  array       9 instructions      7 in loop
  pointer     8 instructions      4 in loop !!!
3.12 Real Stuff: PowerPC and 80x86 Instructions
  PowerPC
    Extra addressing modes
      Indexed Addressing
        lw $t1, $a0+$s3
        add two registers to create address
      Update Addressing
        lw $t0, 4($s3)
        after using address, increment by immediate value
    Extra instructions
      load multiple / store multiple
        transfer up to 32 works at a time
        fast copies of memory when used together
        also for save/restore registers
      counter register
        bc "branch counter"
          decrements ctr register
          only branch when ctr matches conditions
    condition codes (or flags) for conditional branch
  Intel 80x86
    8080
      8 bit
      registers are special purpose, not general purpose registers
    8086
      16 bit extension to 8080 (not entirely backward compatible)
    8087 
      floating point co-processor for 8086
      60 instructions
      special FP stack, not registers
    80286
      24 bit address space
      memory mapping
      protection model
    80386
      32 bit extension (32 bit registers and 32 bit address space)
      new instructions and addressing modes
      - make more like a general purposed register machine
      paging mode, not just segments
    80486, Pentium, Pentium Pro
      3 multiprocessing instructions
      conditional move
    MMX
      57 new SIMD style instructions
      using floating point stack
    registers and data addressing
      about 8 gprs
      arthimetic/logical mostly two address, not three address      
      7 addressing modes
    integer operations
      1 data movement -  move, push, pop
      2 arthimetic/logical - test and integer and decimal arthimetic
      3 control flow - conditional, unconditional, calls, returns
        control flow uses condition registers (like powerpc)      
      4 string instructions - string move and string compare
        8080 heritage - faster not to use them - see fallacy
    instruction encoding
     1-17 bytes in length
3.13 Fallacies and Pitfalls
  Fallacy: More powerful instructions mean higher performance
    moving data faster with FP registers than move-with-repeat on x86
  Fallacy: Write in assembly language to obtain the highest performance
  Pitfall: Forgetting that sequential word address in machines with byte
           addressing do not differ by one.
  Pitfall: Using a pointer to an automatic variable outside its defining
           function.
3.14 Concluding Remarks
  SUMMARY OF DESIGN PRINCIPLES
  1: Simplicity favors regularity.
  2: Smaller is faster.
  3: Good design demands good compromises.
  4: Make fhe common case fast
3.15 Historal Perspective and Further Reading
  accumulator architectures
    just one register
  general purpose register architectures
    register-memory  (CISCy)
    - one operand in memory
    load-store or register-register (RISCy)
    - both operands in registers
    - need to ld/st from memory
    memory-memory (VAXy)
    - all operands can be in memory (or registers)
  compact code and stack architectures
    memory scarcity encouraged variable width instruction formats
    stack machine have dense encoding
      no registers, makes for easy compiler - Java VM
  High-Level-Language computer architecture
    doomed by more efficient programming languages and compilers
  RISC architecture
    fixed instruction length
    load-store instruction set
    limited addressing modes
    limited operations
    MIPS, SPARC, PA-RISC, PowerPC, Alpha
3.16 Key Terms
3.17 Exercises 

------------------------------------------------------------------------
CHAPTER 4 - Arithmetic for Computers
4.1 Introduction
4.2 Signed and Unsigned Numbers
  binary representation of numbers
    bits are numbered right to left
    least significant bit on right
    most significant bit on left
  overflow - when not enough bits to represent number
  negative representations
    sign and magnitude 
      two zeros (+0 and -0)
    one's complement
      negation algorithm: invert all bits
      still two zeros (+0 and -0)
      hardware needs extra step to subtact
    two's complement
      MSB acts as sign bit
      negation algorithm: invert all bits and add one (unsigned)
    biased notation
      number+bias always non-negative
      used for floating point
        00..00 is most negative
        11..11 is most positive
    sign extension
      example: when loading a byte into a word register with "lb", 
               the upper 24 bits match the sign bit to preserve
               two's complete for numbers. "lbu" is used to characters,
               since we don't they don't logically have a signed bit.
  addresses
    slt/slti need unsigned variants as well (sltu/sltiu) 
    so we can compare addresses which are logically unsigned
    (note address offsets are usually signed)
4.3 Addition and Subtraction
  addition - just like decimal
  subtraction - two's complement, just negate and add
  overflow (addition) 
  - only when both numbers have the same sign
    addding two positives and get a negative
    addding two negatives and get a positive
  - add/sub cause exception on overflow MIPS
    addu/subu variants never overflow
    C never has overflow exceptions so always uses addu/subu
  - in order to recover from exception, use mfc0 to recover the
    ExceptionPC. exception handler uses the reserved registers $k0..$k1
    to avoid disturbing "caller" registers.
4.4 Logical Operations
   shifts sll/slr (use shamt field)
   bitwise logic and/or/andi/ori
4.5 Constructing an Arithmetic Logic Unit
  A 1-Bit ALU
    logical operations
      A and B, A or B, !A, A, B
      given directly by hardware
        AND, OR, INVERTER, MULTIPLEXOR
    adder
      inputs: a, b, CarryIn
      outputs: sum, CarryOut
    ALU
      inputs: a, b, CarryIn, operation
      outputs: result, CarryOut
      operation: selects AND, OR, NOT, A, B, A+B
  A 32-Bit ALU
    at worse, connect 32 1-Bit ALUs using "ripple carry"
    subtraction
      need to convert B to two's complement
      - negate B's bits using inverter
      - add one (use CarryIn)
      A+!B+CarryIn=1
      add BInvert input to ALU
  MIPS additions
    slt
      need extra output "set" from adder
      need extra input "less"
      feed MSB "set" to LSB "less" for slt
    overflow bit
      use "set" from adder plus extra logic
    BNegate 
      since BInvert == CarryIn always, just combine to make BNegate
    Zero
      performance A-B and make sure all output bits are zero
      used for condition branch testing
  Finished ALU
    inputs: a,b,operation
    outputs: zero, result, overflow, CarryOut
  Carry Lookahead
    our 32 alu is slow propagating carryout through all 32 ALUs
    carry-lookahead adder
      generate and propagate - P&G - only scales to 4 input
    carry-loookahed unit
      combines most significant P&G to produce Carry inputs for adders
  Shift
    use separate barrel shifter
    handle arbitraty rotation in one time of add without propagation delay
4.6 Multiplication
  multiplicand*multiplier=product
  overflow - addition had 1-bit overflow, mul overflow can be a word
  1st version
    algorithm
      walk multiplier from LSB to MSB
      if current multiplier bit is on
        add multiplicand to product
      shift left multiplicand
    requires
      64-bit multiplicand, 64-ALU, 64-product, 32-multiplier
  2nd version
    algorithm
      walk multiplier from LSB to MSB
      if current multiplier bit is on
        add multiplicand to product
      shift right product
    requires
      32-bit multiplicand, 32-ALU, 64-product, 32-multiplier
  final version
    algorithm
      walk multiplier from LSB to MSB
      if current multiplier bit is on
        add multiplicand to product
      shift right product (also shifts multiplier giving "walk")
    requires
      32-bit multiplicand, 32-ALU, 64-product(shared with 32-multiplier)
  signed numbers
    just sign extend as we shift right
  booth's algorithm
    replace some arithmetic with shifts
      (assuming shifting is faster than arithmetic)
    strength reduction
      similar conceptually to using left shift for multiplying by 2
    unfortunately sometimes booth's algorithm does more arithmetic
      depends on input
  MIPS: overflow checking must be done in software
4.7 Division
  Dividend=Quotient*Divisor+Remainder  Remainder<Divisor
  1st version
    algorithm
      shift right divisor
      shift left quotient
    requires
     64-divisor, 64-alu, 64-remainder,32-quotient
  2nd version
    algorithm
      shift left remainder
      shift left quotient
    requires
     32-divisor, 32-alu, 64-remainder,32-quotient
  final version
    algorithm
      shift left remainder/quotient
    requires
     32-divisor, 32-alu, 64-remainder(shared with 32-quotient)
  signed
    have to deal with quotient AND remainder
    to avoid ambiguity, remainder always is same sign as dividend.
  same hardware can be used for multiply and divide!
  MIPS: overflow checking must be done in software
4.8 Floating Point
  reals (not integers)
  scientific notation
    single digit to left of decimal point
    normalized - no leading zeros
  floating point
    binary point (not decimal point...)
    normalized scientific notation in binary
    digit to left of decimal is always 1
    exponent is base 2
    1.xxxxxxxxxx*2^yyyy
    xxxxxxxxxx=significand yyyy=exponent
  IEEE 754 representation of FP
    any representation trades off accuracy and range
    single precision: 1 sign,  8 exponent, 23 significand
    double precision: 1 sign, 11 exponent, 52 significand
    leading 1 implied, giving 23+1=24 52+1=53 significand digits
    -1^S*(1+F)*2^E
    sign-and-magnitude representation
    overflow of exponent
    underflow of exponent
    0=0
    Sorting - to easy sorting with integer operations
      - sign digit at front
      - exponent in biased format, not two's complement
        bias is 127 for single and 1023 for double
  FP addition
    1 denormalize smaller number to match exponents
    2 add significands
    3 normalize significand if necessary
    4 round if necessary to fit significand bits
  FP multiplication
    1 multiply exponents by adding them
      add unsigned and then substract bias
    2 multiply significandss
    3 normalize 
    4 round significand
    5 set sign
  MIPS
    32 dedicated $f* FP registers are single width
    pairs are used for doubles ($f0 is really then both $f0..$f1)
    special lwc1/swc1 instructions for loading these registers
  matrix layout (diversion)
    C - row major
    Fortran - column major
  IEEE details
    rounding details
      IEEE 754 intermediate results keep 2 extra bits
        guard and round
      guarantees that result equals infinite intermediate precision
    modes
      1 always round up
      2 always round done
      3 truncate
      4 round to nearest even (most common)
    divide by zero
      +/- Inf 
      largest exponent reserved for this constants
    NaN - Not a Number
      operating with Inf or 0/0 etc
      extra operators for compares involving NaN
    denormalized  (denorms or subnormals)
      numbers between 0 and smallest exponent
      0 exponent but non-zero significand
      allow "gradual underflow"
4.9 Real Stuff: Floating Point in the PowerPC and x86    
  PowerPC
    "fused" multiply-add
      for matrix-multiply
      one less instruction
      also greater intermediate precision
  x86 - 8087
    separate floating point stack
    80-bit "double extended precision" on stack
    1 data movement instructions - load, load constant, sotre
    2 arithmetic instructions - add sub mul div sqrt abs
    3 comparison - including send to integer processor for branch
    4 transcendental - sin, cos, log, exp
4.10 Fallacies and Pitfalls
  Fallacy: Floating-point addition is associative.
     x+(y+z) =?= (x+y)+z
  Fallacy: Just as a left shift instruction can replace an integer
           multiply by a power of 2, a right shift is the same as an
           integer division by a power of 2.
    PowerPC has "shift right algebraic" used with "add with carry" to /2
  Pitfall: The MIPS instruction addiu sign-extends its 16-bit immediate field.
    No "subi".
  Fallacy: Only theoretical mathematicians care about floating-point accuracy.
    Pentium FP bug
4.11 Concluding Remarks
4.12 Historical Perspective and Further Reading
4.13 Key Terms
4.14 Exercises

------------------------------------------------------------------------
CHAPTER 5 - The Processor: Datapath and Control
5.1 Introduction
  logic conventions
  - asserted/assert means logically true
  combinatorial logic
  - depends only on inputs, not internal state
  - example ALU
  state elements (aka sequential logic)
  - depend on internal state
  - examples memory, registers
  - at least two inputs and one output
    I: clock, data in O: data out
    D-type flip-flop
  clocking methodology
  - when signals can be read and when they can be written
    books assumes edge-triggered clocking methodology
    stored values only updated on a clock edge
    allows read/write of state element in same cycle w/o feedback
      assuming other requirements (setup and hold times) are met
5.2 Building a Datapath
  Common datapath elements
  - instruction memory
    input: instruction address
    output: instruction
  - PC program counter register
    input: new PC
    output: old PC
  - adder to increment PC
    use dedicated ALU hardwired to always perform add operations
    input: old PC, hardwired +4
    output: new PC
  R-Type instructions (aka arithmetic-logical instructions)
  - register file 
    support 2 reads and 1 write simultaneously
    inputs: read register 1, read register 2, write register, data
    outputs: read data 1, read data 2
    control: RegWrite
  - ALU 
    inputs: 2 32-bit values
    output: 1 32-bit ALU result + zero
    control: 3-bit ALU operation
  - data memory
    input: address, write data
    output: read data
    control: MemWrite MemRead
  - sign-extension unit
    used to sign extend immediate value from R-Type instruction for ALU input
    input: 16-bit immediate value
    output: 32-bit signed extended value
  - branch "unit"
    - ALU
      input: PC+4, sign extended immediate<<2
      output: branch target
    input: branch target, ALU zero
    output...
  - jumps
    set lower 28-bits of PC to 26-bit J-immediate<<2
5.3 A Simple Implementation Scheme
  multiplexor - data selector
  simple implementation does everything in one clock
    may need to duplicate resources
      separate data and instruction memory
        (or single dual ported memory...)
  SEE FIGURE 5.13 PAGE 354 (start with just datapath)
  SEE FIGURE 5.29 PAGE 372 (finished with jump controls)
  ALU control
    combinatorial logic based on instruction opcode (and possible r-type function)
    (actually "main control" changes the opcode into smaller 2-bit ALUOp input)
    figure 5.15 page 356
  main control
    combinatorial logic based on instruction opcode
    RegDst      select rt (lt/st) vs rd (r-type)
    RegWrite    register file write data
    ALUSrc      select register file or immediate as ALU "B" input
    PCSrc       select PC+4 or branch offset
    MemRead     output read data from address
    MemWrite    input write data to address
    MemtoReg    register file write data input from ALU or from data memory
    figure 5.27 page 370
    Jump        select jump address instead of PCSrc selected value
  why simple is bad in this case?
    CPI always 1
    clock cycle determined by longest path
      load instruction uses 5 in series
        instruction memory
        register file
        ALU 
        data memory
        register file
    floating point would make this much worse
5.4 A Multicycle Implementation
  datapath changes (more shared resouces)
    single memory unit (not separate data/instruction)
    single alu (not alu + two adders)
    added registers to buffer output
      Instruction register IR
      memory data register MDR
      A register out - alu in #1
      B register out - alu in #2
      ALUOut
    all registers but IR change every cycle
    sharing functional units means more multiplexors
      memory address input - PC (instruction) vs alu output (data)
      ALU A input - PC vs register
      ALU B input - add 2 more: +4, sign extended shifted branch offset
    controls
      more write control signals since not updated every cycle
        pc, memory, registers, IR
      1 RegDst      rt or rd is destination (ld vs alu )
      1 RegWrite    update register
      1 ALUSrcA     PC vs register
      2 ALUSrcB     register, 4, 16-bit immediate, 16-bit immediate<<2
      1 MemRead     read memory
      1 MemWrite    write memory
      1 MemtoReg    alu or memory to register
      1 IorD        PC or ALUOut for memory address
      1 IRWrite     fetch instruction to IR
      1 PCWrite     update PC
      1 PCWriteCond update PC if ALU Zero is active
      2 ALUOp       add/sub/function
      2 PCSource    alu (pc+4), aluout (branch), j-type immediate<<2 (jump)
    FIGURE 5.31 Page 380 (start)
    FIGURE 5.33 Page 382 (with control)
    Steps
    1.) instruction fetch
    2.) instruction decode and register fetch
    3.) execution, memory address computation, branch/jump completion
    4.) memory access or R-type instruction completion step
    5.) memory read step completion step
    completion summary:
    3 cycles branch/jump 
    4 cycles for ALU/store
    5 cycles for load
    control implementation options
    1 finite state machine
      next-state function
      decoding - branching to different state based on opcode
      figuring 5.42 page 396
      block of combinatorial logic with state register
    2 microprogramming (next section)
5.5 Microprogramming: Simplifying Control Design
  microinstructions
  1 ALU control         ALU operation
  2 SRC 1               ALU input A
  3 SRC 2               ALU input B
  4 Register control    read or write + destination
  5 Memory              read or write and source (+ destination for read)a
  6 PCWrite             update PC
  7 Sequencing          how to choose the next microinstruction
  implementation
  - microcode storage
  - micro program counter
  - adddress select (+1 or from storage output)
5.6 Exceptions
  exceptions and interrupts
    events other than branches/jumps that affect control flow
  exception 
  - from within processor
  - example: syscall, overflow, undefined instruction
  interrupt
  - from outside processor
  - example: I/O devices
  handling example
  - store PC in HPC
  - branch to OS handler
    terminiate ... or restart
  - how does OS know what type of exception?
    1 status register (aka MIPS "Cause register")
    2 vectored interrupts 
  implementation
  - new registers for EPC and Cause
  - new control signals 
    EPCWrite    write EPC
    CauseWrite  write Cause
    IntCause    data to write to Cause
  - new PC input - exception handler address
  - two new FSM states 
    IntCause=0 undefined instruction
    IntCause=1 overflow
5.7 Real Stuff: The Pentium Pro Implementation
   simple instructions  - hardwired control
   complex instructions - microcoded
5.8 Fallacies and Pitfalls
   Pitfall: Implementing a complex instruction with microcode may not be
            faster than a sequence using simpler instructions.
     - caches alleviate some microcode benefits
     - x86 MOVS (move string) slower 
     - Penitum LOOP (like PowerPC bcctr (branch counter)) slower 

   Fallacy: If there is space in the control store, new instructions are
            free of cost.
     - 80286 added (and later processors have had to keep)
       complicated but now unused protection mechanisms
       decimal arithmetic instructions
5.9 Concluding Remarks
5.10 Historical Perspective and Further Reading
  emulation - emulating 7090 with microcode on 360 was faster
  hardwired control vs microprogrammed control
  firmware - the microprogram ROM
5.11 Key Terms
5.12 Exercises

------------------------------------------------------------------------
CHAPTER 6 - Enhancing Performance with Pipelining
6.1 An Overview of Pipelining
  MIPS 5 stage pipeline
    1 Fetch instruction from memory
    2 Read register while decoding the instruction
    3 Execute the operation or calculate an address
    4 Access an operand in data memory
    5 Write the result into a register
  Ideally N-stage pipeline is N-times faster
                                       Time-between-instructions[nonpipelined]
Time-between-instructions[pipelined] = ---------------------------------------
                                       Number-of-pipe-stages
  However
  - inbalanced stages
  - pipeline overhead
  Pipelining improves performance
  - by increasing instruction throughput
  - NOT by decreasing the execution time of an individual instruction
  Designing Instruction Sets for Pipelining (MIPS)
    1 all instructions same length
      easier to fetch (1) and decode (2)
    2 minimal instruction formats
      allows decode to overlap with register read
    3 memory operands only in ld/st
     execute stage can be used to calculate effective address
    4 operands aligned in memory 
      don't need access 2 words with one memory operation
  Hazards
    structural hazard
      hardware cannot support the combination of operations 
        singled ported memory would be problem
    control hazard (aka branch hazard (but also exceptions...))
      "need to make a decision based on the results of one instruction
       while others are executing"
        conditional branch
      strategies
        stall: stall the pipeline causing a bubble
        predict: if right win, if lose, do no harm, continue
        delayed decision: MIPS branch delay slot...
          hidden by assemble
          50% filled by compilers with useful instructions
    data hazard
      "an instruction depends on the results of a previous instruction
       still in the pipeline"
        need result of previous ALU instruction as input to next ALU
       strategies
         forwarding/bypassing
           perfect for ALU-ALU: no stall
           LD-ALU: still need to stall one cycle 
             aka "load-use data hazard" 
               original MIPS forced compiler to put non-dependant
               instruction after load to avoid stall hardware
       PowerPC update addressing requires forward data and offset
  THROUGHPUT IMPROVED..
  ...NOT INSTRUCTION EXECUTION TIME 
6.2 A Pipelined Datapath
  Take single cycle data path
  - multi-cycle style reuse would structural hazard
  MIPS 5 stage pipeline
    1 IF:  Instruction Fetch
    2 ID:  Instruction Decode
    3 EX:  Execute or address calculation
    4 MEM: Data Memory access
    5 WB:  Write back
  left-to-right flow except:
    1 write back to register file - data hazards
    2 update of PC                - control hazards (branch hazard/exceptions) 
  new hardware
    registers to pass state between stages
    - IF/ID   PC+4 Instruction
    - ID/EX   PC+4 Reg1 Reg2 Immediate WriteReg ControlWB ControlM ControlEX
    - EX/MEM  PC+Offset Zero ALU Reg2 WriteReg ControlWB ControlM
    - MEM/WB  ReadData ALU WriteReg ControlWB
    - Note: PC could be considered WB/IF
  SEE FIGURE 6.18 PAGE 460 (data done)
  SEE FIGURE 6.30 PAGE 470 (add basic control)
  SEE FIGURE 6.40 PAGE 484 (add data forwarding control)
  SEE FIGURE 6.46 PAGE 492 (add data hazard control)
  SEE FIGURE 6.51 PAGE 499 (add branch flush control)
  SEE FIGURE 6.55 PAGE 506 (add exception control)
  SEE FIGURE 6.65 PAGE 523 (final showing almost all)
                           (forarding to ID from 6.6 still missing)
    multiple-clock-cycle diagram
      IF  ID EX  MEM WB 
          IF  ID EX  MEM WB 
              IF  ID EX  MEM WB 
                  IF  ID EX  MEM WB 
                      IF  ID EX  MEM WB 
                          IF  ID EX  MEM WB 
    OR using datapath abbreviations:
      IM REG ALU DM REG 
         IM  REG ALU DM  REG 
             IM  REG ALU DM  REG 
                 IM  REG ALU DM  REG 
                     IM  REG ALU DM  REG 
                         IM  REG ALU DM  REG
    single-clock-cycle diagram
      shows a snapshot of the entire datapath
6.3 Pipelined Control
  same basic control signals
    PCSrc RegWrite ALUSrc ALUOp RegDst MemWrite MemRead MemtoReg
  add control to later pipeline stage registers
    see figure 6.30 page 470
6.4 Data Hazards and Forwarding
  4 types of data hazards
    1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
    1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
    2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
    2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
  ignore if we are not writing register
    EX/MEM.RegWrite = 0
    MEM/WB.RegWrite = 0
  ignore destination register $r0
    EX/MEM.RegisterRd != 0
    MEM/WB.RegisterRd != 0
  final complication
    if both WB and MEM are writing same reg
      add $1, $1, $2
      add $1, $1, $3
      add $1, $1, $4
    then we want to favor the MEM stage to write back
      MEM/WB.RegWrite
        EX/MEM.RegisterRd != ID/EX.RegisterRs
        EX/MEM.RegisterRd != ID/EX.RegisterRt
  FINAL LOGIC ON PAGE 480/483
  forwarding unit can detect these states
    two new controls ForwardA ForwardB to select ALU inputs
    two new ALU inputs from EX/MEM ALU and MEM/WB ReadData/ALU Mux
6.5 Data Hazards and Stalls
  hazard detection unit
    forwarding only works for resolving alu-alu data hazards
    "load-use data hazard" - ld-alu needs stall
    specifically:
      if (ID/EX.MemRead and
         ((ID/EX.RegisterRt = IF/ID.RegisterRs) or
          (ID/EX.RegisterRt = IF/ID.RegisterRt)))
        stall the pipeline
    action
      do not update PC
      do not update IF/ID
      clear EX/MEM/WB control signals in ID/EX     
6.6 Branch Hazards
  the control hazard other than exceptions
  options
  1 assume branch not taken
    set controls to zero
      basically same as "load-use data hazard" action
      except IF ID and EX stages, not just ID
  2 reduce delay of branches
    move up branch address calculation
      MEM -> EX
      EX -> ID (MIPS)
    new hardware in ID
    - move adder for PC+immediate from EX
    - comparator for Rs and Rt
    - IF.Flush control to conditional flush next instruction
      based on comparator output plus instruction type (control)
    - forwarding muxes, paths, and control signals need to be in IF/ID
  dynamic branch prediction
    instead of just assuming not taken
      predictor table
        input: lower bits of PC
        output: bit predicting branch
      1-bit predictor
      - just return last branch
      - in loop, gets it wrong twice
      2-bit predictor
      - have to get wrong twice to switch
      - in loop, gets it wrong once
  branch delay slot is not so popular
  - deeper pipelines mean one is not enough
  - lots of space for branch predictor
6.7 Exceptions
  another control hazard besides branches
  overflow
    example overflow in
      add $1, $2, $1
    new controls
      ID.Flush
      EX.Flush
    need to preserve
      original $1 in $1
      PC+4 in EPC
      (also set Cause register)
  multiple exceptions can happen simultaneously
    MIPS *tries* to deliver them in order
  interrupts (vs exceptions) are asynchronous so can be scheduled
  imprecise vs precise
    - imprecise: don't necessary have right EPC for Cause
      yuck: OS has to figure out what instruction caused overflow
    - MIPS and most machines have precise (because of VM actually...)
6.8 Superscalar and Dynamic Pipelining
  superpipelining
    longer piplelines
  superscalar
    multiple instructions per cycle
      VLIW is special case I suppose
    CPI can be less than 1
    CPI becomes IPC (Instructions Per Clock)
  dynamic pipeline scheduling
    (aka dynamic pipelining)
    see below a few lines
  superscalar example: superscalar MIPS
    two units
    1 ALU or branch
    2 load or store
  dynamic pipelining scheduling example
    reservations stations
      wait until all operands arrive
      feed functional units (integer or FP)
    commit unit
      holds results until "safe" to put in register or memory
        register: potentially a copy of the register file to write
        memory: potentially a store buffer (aka write buffer)
    out-of-order execution but...
    - in-order completion (conservative)
    - out-of-order completion (imprecise interrupts)
    speculative execution
      dynamic pipelining meets branch prediction
    non-blocking cache
      nice when servicing lots of in-flight instructions
    motivations for dynamic execution
    1.) Hide memory latency
    2.) avoid stalls that compiler could not schedule
        due to potential dependencies between load/store
    3.) Speculatively execute while waiting for hazards
6.9 Real Stuff: PowerPC 604 and Pentium Pro Pipelines
  rename buffers/rename registers
    used to hold results while waiting for the commit unit
  commit unit can commit multiple per cycle
    maintaining precise exceptions
6.10 Fallacies and Pitfalls
  Fallacy: Piplining is easy.
    first edition had bug
  Fallacy: Pipelining ideas can be implemented independent of technology
    branch delay slot no longer makes sense
  Pitfall: failure to consider instruction set design can adversely
           impact pipelining.
    widely variable instruction lengths and running times
      cuase imbalance of stages
      complicate hazard detection
      complicate maintenance of precise interrupts
    sophisticated addressing modes
      complicate hazard detection
      multiple memory accesses per cycle complicate control
      multiple memory accesses per cycle cause bubbles
  Fallacy: Increasing depth of pipeline always increases performance.
    1 data hazards - cause stalls
    2 control hazards - slower branches
    3 pipeline register overhead
6.11 Concluding Remarks
  improve throughput not latency
  compiler writer has to understand 
  - pipeline (to avoid hazards)
  - superscalar makes it more important
  - dynamic pipelining might make this less important...
6.12 Historical Perspective and Further Reading
  IBM Stretch
  - first pipelined
  IBM 360/91
  - dynamic pipelining
    Tomasulo's algorithm
  Stanford MIPS = Microprocessor with Interlocked Pipelined Stages
    compiler/assembler do the work
  IBM ACS -> Power-1 -> PowerPC
  - superscalar
  LIW/VLIW
  - preceded superscalar
  - complier guarantees no dependencies within packet
  - however, not binary compatible like superscalar
6.13 Key Terms
6.14 Exercises

------------------------------------------------------------------------
CHAPTER 7 - Large and Fast: Exploiting Memory Hierarchy
7.1 Introduction
  Principle of Locality
    Temporal Locality
      if an item is referenced, 
      it is likely to be referenced again soon
    Spatial Locality
      if an item is referenced, 
      items whose address are close will tend to be referenced soon
  Memory Heirarchy
    multiple levels with different speeds and sizes
      faster - more expensive - smaller
      slower - cheaper - larger
    levels
      register          processor
      cache             SRAM
      main memory       DRAM
      secondary memory  magnetic disk
  terminology
    block - minimum amount of memory at a level
    hit/miss - found/not found at current level
    hit rate (aka hit ratio) = hits/total
    miss rate = 1 - hit rate
    hit time - time to determine hit/miss and return hit
    miss penalty - time to replace block in upper level from block in lower
    inclusion - usually if something is in a level, it is in all lower levels
  temporal locality
    keep most recently access items closer to processor
  spatial locality
    blocks are contiguous workds in memory
7.2 The Basics of Caches
  questions
    how do we know if something is in the cache
    how do we find it
  directed mapped cache
    cache location based on address
    typically by lower order bits of address
      (block address) modulo (number of cache blocks in the cache)
      MIPS word cache - block address is word address not byte
      this answers how we find it
    tags
      upper bits of address to identify contents
        this answers how we know if something is in the cache
      valid bit - set if this entry is valid
  cache control
    hits are easy
      basically use existing control logic
    misses are the work
      stall processor
      controller fills cache
    example: instruction cache miss
      1 send the original PC to the memory
      2 instruct main memory to read and wait
      3 write cache entry
      4 restart the instruction execution
    stall on use (for data cache misses)
      allow pipeline to continue but stall if 
      register containing missing value is used
   writes are more complicated than reads
     write of data in cache
       "inconsistent" - if we just update cache but not memory
       "write-through" - always write both
       "write-back" - write when cache live is evicted
     write of data not in cache
       don't need to read, just write to cache line and memory
     write-through
       do we make CPU wait for the memory write?
       "write buffer" allows CPU to continue
         only stall if buffer is full (rarely)
     write-back
       better when processor can generate writes faster than memory can handle
       more complicated to implement
   split caches
     a single cache with 2x the memory usually has higher hit rate than
       two caches that split the memory.
     however a split cache has twice the bandwidth 
       so it makes sense at levels closer to processor
  spatial locality
    need blocks larger that one unit
      (block address) modulo (number of cache blocks in the cache)
      block address = item address / number of items in block
    writes are messy as usual
      if tag matches, no problem
      if tag does not match, must fetch from memory
    major disadvantage of large block is increased miss penalty
      early restart - start as soon as requested word is present, not all
      requested word first / critical word first - 
        early restart + make sure first word is what we wanted
   memory system support for caches
      a. one word wide memory organization
      b. wide memory organization - cache block wide
      c. interleaved memory organization - pipeline access
   other options
     interleaving is nice but unfortunately because memory depths are
       increasing faster than memory widths
     DRAM access is row access and colmn access.
     allow repeated column access with only one row access
       One version known as "page mode"
       "EDO (Extended Data Out) RAMS
     SDRAM
       burst access to a series of sequental locations
     both of these use existing DRAM circuitry
     increase bandwidth without
     - expandability and minimum memory size issues of wider memories or interleaving
7.3 Measuring and Improving Cache Performance
  Formulas...
    CPU-time = (CPU-execution-clock-cycles+Memory-stall-clock-cycles)*Clock-cycle-time
    Memory-stall-clock-cycles = Read-stall-cycles + Write-stall-cycles
    Read-stall-cycles=Reads/Program*Read-miss-rate*Read-miss-penalty
    Write-stall-cycles=(Writes/Program*Write-miss-rate*Write-miss-penalty)+Write-buffer-stalls
    AMAT=Time-for-a-hit*(Miss-rate*Miss-penalty)
  In write-through cache, Read-miss-rate ~= Write-miss-rate so
    Memory-stall-clock-cycles = Memory-accesses/Program * Miss-Rate * Miss-Penalty
    Memory-stall-clock-cycles = Instructions/Program * Misses/Instruction * Miss-Penalty
  Big Picture
    relative cache penalties increase as a machine becomes faster
    1 the lower the CPI, the more pronounced the impact of stall cycles
    2 the main memory system is unlikely to improve as fast as processor cycle time.
  Reducing Cache Misses by More Flexible Placement of Blocks
    "direct mapped" - block has only one possible entry in the cache (as above)
      search 1
    "fully associative" - block can be in any cache entry
      search all
    "set associative" - block can be in a fixed number (>2) locations
      search N for N-way set associative cache
      (block address) modulo (number of sets in the cache)
      block lookup
        need valid/address tags for each set entry
        potentially need LRU information or approximation
        parallel tag comparison on lookup
          need N comparators for "N-way" of associativity
      block replacement
        LRU is easy for 1-way and 2-way, see 7.5 for more options
  Reducing the Miss Penalty Using Multilevel Caches
    primary cache       L1 - On chip
    secondary cache     L2 - SRAM
    main memory         MM - DRAM
7.4 Virtual Memory
  motivations
    1 allow efficient and safe sharing of memory among multiple programs
    2 remove the programming burdens of a small limited amount of main memory
  "address space" - separate virtual from physical
    supports #1 by allowing programs to share memory without knowing about other programs
  "overlays"
    supported #2 but required burden
  VM as a cache
    "page"             - "block"
    "page fault"       - "miss"
  "virtual address" vs "physical address"
    translated by "memory mapping" aka "address translation"
      virtual  address = virtual  page number + page offset
      physical address = physical page number + page offset
  "relocation" - VM allows programs to be run anywhere in physical memory
     no longer need special hardware and OS support
  very large miss penalty going to secondary storage (aka disk) leads to:
    - pages should be large enough to amortize the high access time
      4kb-16kb today. moving to 32kb to 64kb
    - organizations that reduce the page fault rate are attractive
      FULLY ASSOCIATIVE placement of pages
    - page faults can be handled in software becuase overhead will be
      smakled compared to he access time to disk. Furthermore, software
      can afford to use clever algorithms for chooisng how to place pages
      because even small reductions in miss rate will pay for the cost of
      such algorithms.
    - Using write-through to mange writes in virtual memory will not work,
      since writes take too long. Instead, VM systems use write-back.
  (segmentation systems (as opposed to paging) used variable sized blocks)
  Placing a Page and Finding It Again
  "page table"
    indexed by virtual page number, gives validity, physical page number, etc
  "page table register"
    hardware pointer to start of page table
    hardware software interface
      each process usually has its own page tables
      OS switches by pdateing page table register
      OS responsible for physical memory allocation and updating page tables
  Page faults
    happen when page table entry says "invalid"
    OS has to find page and find place in physical memory for it
    OS needs reverse physical->virtual mapping 
      so it can find info on page to replaces
    usually LRU approximation
      possibly supported by hardware "use bit" or "reference bit"
   keeping page table size down
     1 use limit register that restricts current process's page table size.
       only allows growth in one direction
     2 two separate page tables and two limits
       support separate stack and heap
       use high order bit to pick segment (MIPS)
       not good for sparce address space
     3 use hashing to make table size proportional to physical page count
       "inverted page table" - more complicated lookup process
     4 multiple levels of page tables - x86 - complexity
     5 allow page tables to be paged...
  writes
    as mentioned, use write-back aka "copy back"
    "dirty" bit tracks what needs to be written
  TLB - Translation Lookaside Buffer
    need to make virtual to physical translation fast
    needs all page table data including status bits
      virtual page number
      physical page number
      reference and dirty bits
    writes update dirty bit
    misses
      miss can mean
      - TLB miss (most of the time hopefully)
      - page fault
      TLB misses can be handled in hardware or software
      replacement means writeback of status bits if necessary
    associativity
      possible fully associative, but all options possible
      sometimes "random" used as replacement algorithm
  VM and caches
    cache can be 
      physically indexed and physically tagged
        TLB access now on critical path
      virtually indexed and virtually tagged
        aliasing is a problem
      virtually indexed and physically tagged
        if index is just with page offset, this is P-P
  protection
    to support OS hardware must provide at least (2 is ?)
      1 at least two modes kernel/supervisor/executive
      2 CPU state that user can read but not write
        user/supervisor bit, page table pointer, TLB
      3 mechanism to go from user to supervisor - syscall
    allow limited sharing
      read/write/execute bits
    TLB now needs to flushed on protection boundary switches
      unless we add a "process identifier" aka "task identifier"
  instruction vs data page faults
    instruction page fault is easy, happens before instruction starts
    data is harder, we are midstream
    need to be "restartable"
    if instruction is x86 stateful style, need to preserve other state
  more exceptions details
    when exception handler is running, further exceptions disabled
    otherwise EPC and Cause could be overwriten
7.5 A Common Framework for Memory Hierarchies
  Question 1: Where can a block be placed?
    associativity
    increasing associativity is that it usually decreases miss rate
    but increased cosat and slower access time
  Question 2: How is a Block Found?
    N-comparisons for N-way associativity
    In VM case, we use separate lookup table to avoid this because:
      1 full associativity is good because misses are very expensive
      2 full associativity allows sophisticated replacement schemes
      3 full map can be indexed with no extra hardware and w/o searching
      4 large page size means table overhead is relatively small
  Question 3: Which Block Should Be Replaced on a Cache Miss?
    random
    LRU
    LRU approxmiation
      4-way approxmiation example
        one bit for which pair is lru
        one bit for each pair entry for which is lru
      VM uses reference bits for approxmiation (clock algorithm...)
  Question 4: What Happens on a Write?
    write-through
    - misses are simpler and cheaper because no writeback
    - easier to implement but in practice will need a write buffer
    - miss issues
      1 write-on-miss/fetch-on-write/allocate-on-miss (as above)
      2 no-fetch-on-write - allocate but don't fill
      3 no-allocate-on-write - don't do either
      2/3 are known as write-around
    - can write data while reading tag (because data can't be dirty)
    - but multi-word blocks still require a read to fill other words on write miss
    write-back aka copy-back
    - individual words can be writen by the processor rate, not memory rate
    - multiple writes w/i block require only one write to lower level
    - when blocks are written, system can make use of high bandwith
      transfer since the entire block is written
    - either need two cycles (to write existing + new data)
      or pipeline with store buffer
    VM is write-back by necessity
    as processor speed grows faster than memory speed, 
      write-back is increasing in use by necessity as well
  THE THREE C's
    1 COMPLUSORY
      cold start misses
      - to improve, increase block size
    2 CAPACITY
      blocks replaced and later retrieved because not enough space
      - to improve, make the cache largera
    3 CONFLICT
      caused by change from full associativity to less associativity
      "collision misses"
      think of as misses that would go away if changed to fully associative
      - to improve, increase associativity
7.6 Real Stuff: The Pentium Pro and PowerPC 604 Memory Hierarchies
  nonblocking cache - allow other data cache instructions to continue
    hit under miss - allow other hits during miss    
    miss under miss - allow other misses during miss
7.7 Fallacies and Pitfalls

  Pitfall: Forgetting to account for byte addressing or the cache block
           size in simulating a cache
  Pitfall: Using miss rate as the only metric for evaluating a memory hierarchy
  Pitfall: Ignoring memory system behavior in writing programs or
           generating code in a compiler.
  Pitfall: Extending an address space by adding segments on top of an
           unsegmented address space.
7.8 Concluding Remarks
  Principle of Locality
  "compiler directed prefetching"
7.9 Historical Perspective and Further Reading
7.10 Key Terms
7.11 Exercises

------------------------------------------------------------------------
CHAPTER 8 - Interfacing Processors and Peripherals
8.1 Introduction
  performance measurement depends on application but
  1 how long each task takes
  2 how many taskes per second
8.2 I/O Performance Measures: Some examples from Disk and File systems
  supercomputer I/O benchmarks
  - data throughput
  - aka "data rate" bytes of data per second
  transaction processing I/O benchmarks
  - response and throughput
  - aka "I/O rate"  disk accesses per second
  file system I/O benchmarks
  - MakeDir/Copy/ScanDir/ReadAll/Make
8.3 Types and Characteristics of I/O Devices
  Three characteristics
    1 Behavior  - input/output/storage
    2 Partner   - human or machine
    3 Data rate - peak rate
  Devices
    Mouse
      polling
    Magnetic Disk
      nonvolatile
      platters/tracks/sectors/cylinders
      seek/seek time
      rotational latency aka rotational delay
      transfer time
      disk controller/controller time
      total=seek-time+rotational-delay+transfer-time+controller-time
    Networks
      Characteristics
        1 Distance     - length
        2 Speed        - MB/sec
        3 Topology     - bus, ring, star, tree
        4 Shared lines - none (point-to-point) or shared (multi-drop)
      RS232 terminal network
      LAN
        Ethernet
        bus
        Packet
        switched network
      Long-haul network
        interface message processor (aka IMP)
        packet-switched
        protocol stack
8.4 Buses: Connecting I/O Devices to Processor and Memory
  Bus advantages:
    1 versatility - easy to add new devices
    2 low cost    - shared wires
  Bus disadvantage
    1 communications bottleneck
  Physical limits
    length of bus 
    number of devices
  Potentially conflicting goals
    fast bus access
    high bandwidth
  physical makeup
    control lines
    data lines
  bus transaction
    two parts
      1 sending the address
      2 receiving or writing data
    defined by what they do to memory
      read  - from memory - input
      write - to memory   - output
  types
    processor-memory buses
      usually design specific
      types of devices known at design time
    I/O buses
      often reused - standard busses
      designed to handle unknown devices
      simple low level interface to make requiring minimal electronics
    backplane busses
      often reused - standard busses
      designed to handle unknown devices
      more logic but only one bus
  synchronous vs asynchronous
    synchronous
      clock in control lines
      fixed protocol relative to clock
      easy to implement with small FSM
      disadvantages
        1 every device must run at same clock rate
        2 clock skew means bus cannot be long if it is fast
      processor-memory buses can be synchronous
        close devices, few devices, high clock rates
    asynchronous
      not clocked
      support devices at different speeds
      lengthened w/o clock skew
      handshaking protocol 
        separate set of control lines
          ReadReq
          DataRdy
          Ack
        handshaking
          holding ReadReq until Ack response
      like a pair of FSMs
        more steps in protocol but each step can be faster?
      in summary
        1 scale better with technology changes
        2 support a wider variety of bus speeds
  increasing bus bandwidth
    1 data bus width
    2 separate versus multiplexed address and data line
    3 block transfers
    4 split transaction protocol 
      release bus after request while waiting for response
  obtaining access to bus
    bus master and slaves
      processor and memory
    multiple bus masters
      bus arbitration
        request / granted
        another set of control lines
        balance
          1 bus priority
	  2 fairness
        schemes
          1 daisy chain arbitration
            bus grant line runs from highest to lowest priority
            highest priortiy always wins
          2 centralized, parallel arbitration
            central arbiter - PCI
          3 distributed arbitration by self-selection  
            devices agree on priority, highest wins - NuBus
          4 distributed arbitration by collision detection
            ethernet

  I/O OPTIONS
  Option	High Performance		Low Cost
  ----------------------------------------------------------------
  Bus width	separate address and data lines	multiplex 
  Data width	wider is faster (e.g. 32 bits)	narrowor (8 bits)
  Transfer size	multiple words, less overhead	single-word
  Bus masters	multiple masters w/ arbitration	single master
  Clocking	synchronous			asynchronous

8.5 Interfacing I/O Devices to the Memory, Processor, and Operating System
  Questions
  - How is a user I/O request transformed into a device command and
    communicated to the device
  - How is data actually transfered?
  - What is the role of the operating system
  Characteristics that determine role of OS
    1 I/O system shared by multiple programs using the processor

    2 I/O systems often use interrupts (externally generated exceptions)
      to communicate information about I/O operations. Because interrupts
      cause a transfer to kernel or supervisor mode, they must be handele
      by the OS
    3 The low-level control of an I/O device is complex because it
      requires managing a set of concurrent events and because the
      requirements for correct device control are often very detailed.
  OS must provide:
    - guarantee user program only access portions of I/O device that
      they have permission to. file system example
    - abstraction for accessing devices
    - handling of interrupts from devices (as it handles exceptions from program)    
    - equitable access to shared resources and schedule access for throughput
  types of communication between OS and hardware
    1 OS gives commands to I/O devices
      - memory mapped I/O
      - special I/O instructions
    2 devices notify when command completed
      - polling
      - interrupts
          device identity via vectored interrutpts or cause register
        priorities
    3 data must be transfer between memory and I/O device
        polling and interrutpts are okay for small data transfers
        DMA (direct memory access) for when more bandwidth is needed
          DMA controller is bus master
          three steps
            1 processor sets up DMA with 
                identity of device, operation to perform,
                address of src/dest, length of transfer
            2 DMA starts operation and transfers data w/o processor
            3 when complete, DMA controller interrupts processor
        I/O processors (I/O controllers, channel controllers)
          part of device, not general purpose DMA
          given a I/O program to execute by the OS
          interrupts when done
  DMA vs memory system
    stale data "coherency problem"
      w/o DMA, only processor talks to memory
      now controller can talk to memory without knowledge of cache
      three solutions
        1 I/O via cache - expensive and hurts performance
        2 OS flushing - selective cache invalidation by OS
        3 hardware flushing - frequently in multiprocessors for coherence
    virtual vs physical addresses
      virtual con - have to translate address
        processor give DMA controller small set of mappings to use
      physical con - cannot cross page boundary
        processor gives "chain" of transfers, each w/i 1 page
      either way, OS can't remap pages in use by DMA
8.6 Designing an I/O System
  constraints
    1 latency
    2 bandwidth
  general approach to design: iterate
    1 find weakest link
      cpu, memory, backplace, i/o controller, devices
    2 configure weakest link for desired bandwidth
    3 determine requirements for the rest of the system for this bandwidth
8.7 Real Stuff: A Typical Desktop I/O System
   PCI and SCSI
8.8 Fallacies and Pitfalls
  Fallacy: A 100-MB/sec bus can transfer 100 MB of data in 1 second
    base 10 vs base 2
  Pitfall: Using the peak transfer rate of a portion of the I/O system
      to make performance projects or performance comparisons
  Fallacy: Magnetic storage is on its last legs and will be replaced shortly
  Pitfall: Moving functions from the CPU to the I/O processor, expecting
      to improve performance without a careful analysis.
8.9 Concluding Remarks
  file caching
  RAID
  online/offline/robo-line
8.10 Historical Perspective and Further Reading
8.11 Key Terms
8.12 Exercises

------------------------------------------------------------------------
CHAPTER 9 - Multiprocessors
9.1 Introduction
  program styles
    many programs - high throughput for independent tasaks
    one  program  - parallel processing program
  questions
    1 how do parallel processors share data
    2 how do parallel processors coordinate
    3 howb many processors
  answers to first two questions
    single-address space / shared-memory processors
      synchronization
        locks
      UMA vs NUMA
    message passing - send and receive
      private memories
      clusters            
  communications styles
    bus or network
9.2 Programming Multiprocessors
  issues
    communication overhead
    requires knowledge of machine model
9.3 Multiprocessors Connected by a Single Bus 
  microprocessor SMP
  - small size means more per bus
  - caches can lower bus traffic
  - mechanisms for cache coherency
  snooping cache coherency
    read miss - check other caches
    write - invalidate or update
    cache hardware - duplicate tags and extra read port for snooping
    write-invalidate
      similiar to write-back in that it only uses bus less, only on first write
      less bus bandwidth required
    write-update (aka write-broadcast)
      similiar to write-through in that it tells everyone
      new values appear in caches sooner, reducing latency
    commerical machines use write-back with write-invalidate
    block size issues
      two processors writing different words in block
        optimize protocol to update words not blocks
      false sharing problem
        larger block sizes might contain subset of contentious elements
    memory consistency model
      what happens when both write at same time? bus arbitration decides
      conservative model - sequential consistency
    example protocol
      three states
        1 read only
        2 read/write
        3 invalid
    MESI protocol
      States
        Modified   - read/write
        Exclusive  - read only (only one copy)
	Shared     - read only (many copies)
	Invalid    - invalid
      used by PPro and PowerPC
    synchronization using coherency
      lock variables - semaphores
        implemented with atomic swap operations
        spin waiting
9.4 Multiprocessors Connected by a Network
  false dichotomy: shared memory vs distributed memory
    shared is about address space
    distributed is about physical location
  real dichotomy: shared memory vs multiple private memories
  real dichotomy: central vs distributed memory
  addressing in large-scale parallel processors
    distributed shared memory
      ld/st memory operations cause message passing send/receive
    cray t3e
      single address space w/o coherency!
    directory based coherency
      alternative to snooping when you don't have a bus
      sharing status of a block is always known in a single location
      instead of watching bus, teh directory controller sends explicit
        commands to caches that contain data to be written
    cache-only memory
      migration of distributed main memory blocks 
        closer to processors with more usage
    shared virtual memory
      slow because of large page sized granularity
9.5 Clusters
  advantages 
    cost
  disadvantages
    administration proportional to number of machines
      as oppose to one machine
    connected by I/O bus
      bandwidth
    division of memory
      sequential part of program has only 1/N memory
      n copies of OS
  bug weakness of separate memories is also advantage
    availability and expandability
  clustered, shared memory
    builing a cluser from SMP machines    
9.6 Network Topologies
  fully connected
    high cost high performace
  bus
    low cost low performace
  ring
    simultaneous transfers
  metrics
    total network bandwidth - best case
    bisection bandwidth
  2-d grid or mesh
  n-cube
  multi-stage networks
    fully-connected
    omega
  implementation
    length vs speed
    3-d network vs 2-d chips and boards
9.7 Real Stuff: Future Directions for Multiprocessors
  good
  - fast microprocessor
  - high-capacity DRAM
  - increasing network bandwidth
  fault tolerance
    conflicts with topology dependent algorithms
    so many components, things WILL break
  more problems with topology dependent algorithms
  - lack of standard topology
  - high overhead of communication masks network latency
  - operation in presense of broken lnks and nodes
  MPP: massitvely parallel processors 
  - currently more than 1000
9.8 Fallacies and Pitfalls
  Pitfall: measuring performance of parallel processor by linear speedup
           versus execution time.
    relative speedup - same program
    true speedup - best program for each system
  Fallacy: Amdahl's law doesn't apply to parallel computers
  Fallacy: Peak performance tracks observed performance.
9.9 Concluding Remarks - Evolution versus Revolution in Computer Architecture
9.10 Historical Perspective and Further Reading
   SISD - the uniprocessor
   SIMD - vector operations - data parallelism - for loops
   MISD - useless
   MIMD - often SPMD (single program multiple data)
9.11 Key Terms
9.12 Exercises
------------------------------------------------------------------------
APPENDIX A - Assemblers, Linkers, and the SPIM Simulator
------------------------------------------------------------------------
APPENDIX B - The Basics of Logic Design
------------------------------------------------------------------------
APPENDIX C - Mapping Control to Hardware

------------------------------------------------------------------------
MIPS INSTRUCTION SUMMARY
Integer

ARITMETIC
add                             add   $s1, $s2, $s3
add unsigned                    addu  $s1, $s2, $s3
subtract                        sub   $s1, $s2, $s3
subtract unsigned               subu  $s1, $s2, $s3
add immediate                   addi  $s1, $s2, 100
add immediate unsigned*         addiu $s1, $s2, 100
move from coprocessor register  mfc0  $s1, $epc
multiply                        mult  $s2, $s3
multiply unsigned               multu $s2, $s3
divide                          div   $s2, $s3
divide unsigned                 divu  $s2, $s3
move from Hi                    mfhi  $s1
move from Lo                    mflo  $s1
LOGICAL
and                             and   $s1, $s2, $s3
or                              or    $s1, $s2, $s3
and immediate                   andi  $s1, $s2, 100
or immediate                    ori   $s1, $s2, 100
shift left logical              sll   $s1, $s2, 10
shift right logical             srl   $s1, $s2, 10
#shift right aritmetic          sra   $s1, $s2, 10
DATA TRANSFER
load word                       lw    $s1, 100($s2)
store word                      sw    $s1, 100($s2)
load byte                       lb    $s1, 100($s2)
load byte unsigned              lbu   $s1, 100($s2)
store byte                      sb    $s1, 100($s2)
load upper immediate            lui   $s1, 100
CONDITIONAL BRANCH
branch on equal                 be    $s1, $s2, 100
branch on not equal             bne   $s1, $s2, 100
set on less than                slt   $s1, $s2, $s3
set on less than immediate      slti  $s1, $s2, 100
set on less than unsigned       sltu  $s1, $s2, $s3
slti unsigned*                  sltiu $s1, $s2, 100
UNCONDITIONAL BRANCH
jump                            j    2500
jump-register                   jr   $ra
jump-and-link                   jal  25000
* page 230 notes that addiu/sltiu signed extend their immediates

Floating Point
ARITMETIC
FP add single                   add.s $f2, $f4, $f6
FP subtract single              sub.s $f2, $f4, $f6
FP multiple single              mul.s $f2, $f4, $f6
FP divide single                div.s $f2, $f4, $f6
FP add double                   add.d $f2, $f4, $f6
FP subtract double              sub.d $f2, $f4, $f6
FP multiple double              mul.d $f2, $f4, $f6
FP divide double                div.d $f2, $f4, $f6
DATA TRANSFER
load word copc. 1               lwc1 $f1, 100($s2)
store word copc. 2              swc1 $f1, 100($s2)
CONDITIONAL BRANCH
branch on FP true               bclt 25
branch on FP false              bclf 25
FP compare single               c.lt.s $f2, $f4         eq,ne,lt,le,gt,ge
FP compare double               c.lt.d $f2, $f4         eq,ne,lt,le,gt,ge
------------------------------------------------------------------------
PSEUDO-INSTRUCTIONS
move                            move
multiply immediate              multi
#multiply immediate unsigned    multiu
load immediate                  li
load single                     l.s
load double                     l.d
store single                    s.s
store double                    s.d
branch less than                blt
branch less than or equal       ble
branch greater than             bgt
branch greater than or equal    bge
no operation                    nop
------------------------------------------------------------------------

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Computer Architecture 2003 Notes
- analysis (arthimetic vs geometric vs harmonic)
- caches
  direct mapped
  set associative
  physically tagged vs virtually tagged
- TLBs
  virtual memory
- ISA types
  accumulator
  CISC
  RISC
  stack
- control hazards (branching)
  predictors (1 bit vs 2 bit vs random vs)
- data hazards
  scoreboarding
  read-after-write hazard
  write-after-write hazard
  write-after-read hazard
  read-after-read nonhazard
  register renaming
- pipleine stages
1  fetch
   decode
2  register read
3  execute
4  memory
5  register write
- system calls
  on context switch, hardware state that needs saving
  - PC
  - SP
  - registers (integer and floating)
  - other processor state (flags, privilege level if more than one)
  on context switch, software state that needs saving
  - process info
    page tables
    I/O state such as open files
  what are the special MIPS registers for?
  register sets
- dma
  scatter gather
- ilp
  vliw
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Notes for Christos Kozyrakis on ee108b slides

1.1  No "Lecture 1" label
1.10 "One free programming project late day" twice. Second should be
     "One free laboratory project late day".
2.5  Pentium did not add MMX. 
     Original Pentium did not have MMX
     only "Pentium MMX" and then "Pentium Pro"
2.34 the purpose of the "shift left by 1" isn't clear to me
3.6  white space alignmeet off for "4 Bytes" (extra space before 4)
4.0  Missing Lecture number (specifically "4")
5.18 would read clearer if 1/InstructionCount constant moved outside of sum
5.20 I thought an example would be good here
5.27 JIT does not imply profile driven
     the term is usually "adaptive" as in adaptive compiler or adaptive JIT
5.28 constant folding is not just for math but also for constant string operations
     basically have a compile time interpretter to evaluate constants
5.29 constant propagation seems basically the same as copy propagation
     is there a better example that would differentiate it?
5.31 insert line break between
     Multiplies take 12 cycles on MIPS R2000 
     ALU instructions take 1 cycle
6.11-6.12 can the text be include in slide (as opposed to on slide 43-44)
6.17 the XOR gate as a drawing error on left side curves (on printout at least)
6.23 more details could be useful here
6.24 *can the problem/solution text be include in slides
6.26 what is "LU"?
6.34 can c1...c4 text be included?
6.40 *what is simplest gate? NOR? NAND? depends on process?
6.43-6.44 
     43 says "do the operation"
     44 says "do the function"
     probably should use same terminology
7.3 is marked as review but i don't see it in lecture 6
7.6 *build a ??? instead. a tree?
7.20 says clocked not combinatorial most common
     but 7.22 use combinatorial
     is this because we haven't done caches yet?
7.21 can we have some labels on signals?
7.42-... can we have a key for 0,1,X,<prev> etc
7.51 mention why multilevel coding a good thing
8.24 ALU shaded only partially (also 8.27)
9.24 wrong shading on first reg on "sub" 
     should be right half shaded in ID/RF, not left as for WB
     correct shading on 9.27
9.34 stage labels match old pipeline
     not new labels used on next slide 9.35
10.24 labels for diagrams on right? or not
11.17 "power is one" should be "power is on"
11.18 $1/MB only 10 years a go (for comparison)
      "this is where we really store data" =>
      "this is where we really store data and programs"
11.26 disks are now ~250GB even ~400GB
      PowerPC = 8bytes/register * (32 int + 32 floating) = 512B
      plus of course "special registers" such as LR, CTR, ...
11.34 "2n" should be "2^n" as it is in lower right
11.36-.. you label miss types before defining them
      many comps people feel classifying capacity vs conflct puzzling
      book is weak here as well
11.41 missing kind of cache miss (unlike other slides)
11.46 again, talks about conflict vs capacity w/o 3-C's introduction
      (I see that 3-C's is in lecture 12 but it seems better to talk
       about it with 11 if possible)
12.5 odd character (shows up as square on printout) in title between Memory and Heirarchy
12.30 "repalcing" => "replacing"
13.24 it took me a while to understand what the bars meant
      usually threads are drawn as squiggly lines with arrows :)
13.26 "user processes cannot acces top region of memory"
      i found this confusing.
      its not that they can't access kernel memory, it's not even part
      of the part of their address space. on many systems the user could
      map and use that space if they wanted.
13.30 "ctl-c" typically written "ctrl-c" or "control-c"
      also, ctrl-c is misleading if really "any" key would work as example
      ctrl-c is extra special because it causes software signal
13.32 i found it misleading to say the OS has exception handler
      usually they just delegate to user mode handler
13.36 "la" isn't in MIPS subset discussed in book
      they would use 
          lui  $k0, 0xdead
          ori $k0, 0xbeef (not addi, it would sign extend)
          lr $k1, 0($k0)
      "addu $k1" should be "addiu $k1, $k1, 1" 
      "eret" isn't in book is it?
13.39 after signal process
      run user mode handler
14.1 was this really Kunle and not you? 
15.18 * answers from book?
16.10 so what is the south bridge?
      what is bandwidth to memory?
16.19 "price approaching 1$/GB"
      i think it is safely past this
      $200 for 250GB SATA disk at frys. how much is 400gb?
16.30 US is not 1440x960
      US is a set of resolutions which you must display
      however you don't have to be native
      higest US non-interlaced (aka progressive) line count is 720p
      higest US interlaced line count is 1080i
      I'm not sure where you 960 came from
16.31 I think some cards have higher than 24bit internal precision
16.33-35 it has been a while since most people used color palettes
18.22 CS315
      CS140/240
      CS143/243/343
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Sometimes slides are labeled as "Review" sometimes not
3.5-3.?
7.4
8.10-8.16
10.6-10.11
11.4-11.13
12.8-12.23
13.6-13.20
14.9-14.18
15.5-15.9
15.19-15.24
18.3-19
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
add to doc
3.24
5.24
5.36
6.15 logic design review from B2-B3
6.24 "set less than" problem - cannt just use subtraction
6.36 carry stuff
6.40 fan out vs fan in
     delay of gate based on load and logical effort
     outputs 
     - "load" is defined by how many gates are driven by this gate
     fan-out
     inputs  
     - "logical effort" determined by number of inputs 
     fan-in
7.17 cycle time = longest prop delay + setup + clock skew
8 no microcode covered
9.34 example of fixing stall after load with extra ALU (load/add)
     causes stall other way (add/load)
12.24 3-C's
12.35 write miss actions table
13.4  cache org example table
14.21 OS stuff
13.33-13.34 precise exceptions
13.35 MIPS has EPC, Cause, and *BadVAddr*
      BadVAddr is for address exception
14.21 context switch list
14.29 64 entry TLB with 9kb pages maps 0.5 MB
14.32 multiple page sizes at once
14.33 page sharing for copy on write
16.28 reliability vs availability
17.10-11 ISA/EISA backplane
         PCI backplane
         USB I/O