See also:
  My Computer Architecture Notes from Comps 2004
    http://carlstrom.com/stanford/comps/Computer-Architecture.txt
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
CS282
Computer Architecture: A Quantitative Approach, Third Edition
by John L. Hennessy and David A. Patterson
Foreword vii by Bill Joy
- RISC
  pioneered by John Cocke IBM
  promoted and named by P&H
  commericalized in PowerPC, MIPS, and SPARC
Preface xvii
- embedded vs desktop vs server markets
  emeddded
    low power
    no active cooling
    game consoles, cameras, cell phones
    EEMBC benchmarks
  desktop
    latency
    SPEC2000 benchmarks
  server
    reliability, scalability, throughput
    TPC-C benchmarks
    SPEC FS (File System)
Acknowledgments xxv
Chapter 1 Fundamentals of Computer Design
1.1 Introduction 2
- moore's law improvements steady
  architecture improvements more bursty
  however, figure 1.1 shows that architecture improvements have major
    impact on relatively performance curve (15x in 2001 starting from 1984)
- the move to microprocessors
  35% baseline performance improvement per year without architecture improvements
  new microprocessor architectures easier to adopt because:
  1.) easier to switch with mostly non-assembly code
  2.) easier to switch with unix/linux to port
- microprocessor have meant
  1.) supercomputer power for the desktop user
  2.) even supercomputers are built with microprocessors
- huge transistor budgets have allowed CISC to adopt RISC internally
1.2 The Changing Face of Computing and the Task of the Computer Designer 4
- timeline
  60s mainframe
  70s mini-computer
  80s microprocessors
  90s handheld devices
- computing markets
  Desktop
  - price performance
  Servers
  1.) availability (not reliability)
  2.) scalability
  3.) throughput
  Embedded
  - widest power and cost ranges
    8-bit to 32-bit
    low-end 8-bit 16-bit  $00.50
    middle  32-bit        $10.00
    high    64-bit        $$$$$$ console, high end network switch , $$$
  - real-time performance requirement
    soft real-time
  - minimizing
    1.) memory
        can be significant portion of system cost
        sometimes want to stick to on processor memory
        code density can be an issue
    2.) power
        battery live
        lower power can use plastic instead of more expensive ceramic packaging
  - custom logic
    1.) custom logic circuts on chip
    2.) custom software
    3.) DSP processor with custom hw/sw
- The Task of the Computer Designer
   computer architecture
   - not just instruction set architecture (ISA) design
   - also architecture implementation
   instruction set architecture
   - programmer visible instruction set
   - hw/sw interface
   architecture implementation
   - organization
   - hardware
   organization
   - memory system
   - bus structure
   - CPU design
   hardware
   - detailed logic design
   - packaging technology
   computer architecture
   1.) ISA design
   2.) organization
   3.) hardware
   computer architect makes tradeoffs between functional requirements
   - price
   - power
   - performance
   today, different designs for different markets!
1.3 Technology Trends 11
- Growth Rates
  - Integrated circuit logic tecehnology
    1.) transistor density 35%/yr
    2.) die size 10-20%/yr
    1/2 => transistor count 35%/yr
  - Transistor DRAM
    density 40-60%/yr
    cycle time groth slow 33% in 10 years
    bandwidth increases 2x as fast as latency decreases
      (future is high bandwidth, but relatively high latency)
  - Magnetic Disk Technology
    disk density 100%/yr
    access time growth slow also 33% in 10 years
  - Network Technology
    bandwith focus of growth
    latency room for improvement
  - Designer must plan ahead, anticipating these growth rates
  - Discrete leaps, not just continuous changes
    32-bit processor on one chip...
    When cache could move on chip...
    When multiple-cores can move on chip...
- Scaling of Transister Performance, Wires, and Power in ICs
  feature size
  - three dimensional
    devices shrink quadratically in horizontal
    devices also shrink in vertical
    affects operating volatage
  - growth
    transistor performance improves linearly with decreasing feature size
    transistor count improves quadratically with decreasing feature size
    big budget
    - move from 4,7,16,32,64 bit processors
    - pipelining
    - caches
    wires do not improve
    - signal delay proportional to R*C (resistance*capacitance)
    - wires get shorter but R*C per unit length increase
    - move to copper provided one-time improvement in wire-delay
    - wire delay is now a major issue, rivaling transistor switching delay
    - large ammount of cycle time for signal propagation
      Pentium 4 has 2 stages of 20+ stage pipleline just for signal propagation
  power scaling
  - CMOS processor power dominated by switching transistors
  - energy/transistor proportional to
    load capacitance *
    switching frequency *
    square of voltage
  - parameter trends don't favor power decreasing :)
    transistor count ++
    switching frequency ++
    load capacitance -
    voltage -
  - power likey to be limiting factor for chips, not raw transistor count
1.4 Cost, Price, and Their Trends 14
- Impact of Time, Volume, and Commodification
  key factors in cost
  1.) learning curve
    - manufactuing costs decrease over time
    yield (measure of learning curve)
    - percentage of manufactured devices that pass testing
    - 2x the yield = 1/2 the cost
      DRAM price=~cost, as yield improves, price comes down directly
      CPU  price~!cost, yield improvements, ..., profit!!! :)
      (to be fair, the book says that price decreases, just not as fast
       as for commodity DRAMS)
  2.) volume
    - rule of thumb: cost decreases 10% each time volume doubles
    - ammortize development cost
  commodities
  - clear product definition
  - drives volume
  - pressure to lower price down to (or below) cost
  - drives low end growth, while sacrificing profits
- Cost of an Integrated Circuit
  rest of system is so commodity oriented, IC cost can be differentiating factor
  wafers, chopped into dies
  simple math...
    cost of ic = (cost of die + cost of testing die + cost of packagine + cost of final test) / final test yield
    cost of die = cost of wafer / (dies per wafer * die yield)
    diers per wafer = area of wafer / area per die (approximately...)
  can use redundancy to raise yield
    example, DRAM includes redundant memory cells to accomidate flaws
    also in stand alone SRAMs
    also in processing embedded cache SRAM
  architect can control
    die size
    I/O pins
    redundancy and binning hacks
  mask set cost significant for low production runs
    $1 million for 4-6 high density metal masks
    cost going up
    use reconfigutable logic for flexibility
    use gate arrays which have fewer custom mask levels
- Distribution of Costs in a system: An example
  $1000 PC
    overall costs
       6% cabinet
      37% processor board
      37% I/O devices
      20% software (OS+office)
   22% for processor and 19% for monitor
- Cost versus Price - Why they differ and by how much
  1.) component costs
  2.) direct costs = labor, purchasing components, scrap, warranty = 10-30%
  3.) gross margin = indirect costs
      R&D, marketing, sales, manufacturing equipment maintenance, real
      estate, cost of financing, pretax profits, taxes
  4.) average discount
  1-3 = average selling priace
  1-4 = list price
  R&D percentage
     4% commodity PC business
    12% high end server business
    15%-20% is unsustainable
  expensive, large machine problem...
    sell few AND higher R&D costs
    need to have higher gross margin
  options
     high-performance design (supercomputer, high end servers)
     low-cost design (embedded)
     cost-performance (desktop, workstation, low end server)
1.5 Measuring and Reporting Performance 24
- Metrics
  response time (aka execution time)
  - how long does it take to get my job done
  - one task vs many tasks
  => latency
  throughput (vs response time)
  - how much work can get done in a certain time
  - many tasks vs one task
  - important measure for batch, server, supercomputer operations
  execution time is only completely reliable metric (see section 1.9)
- Measuring Performance
  wall-clock time, response time, elapsed time
    latency to complete a task including
      disk access, memory access, I/O, OS overhead, everything
  CPU time
    does not include time waiting for I/O or other running programs
    divided into user CPU time (program) and sytem CPU time (OS)
  system performance vs CPU performance
    system performance: wall-clock time on unloaded system
    CPU performance:    user CPU time
- Choosing Program to Evalutate Performance
  workload
    typical mixture of programs and OS commands that users run on a machine
  bests to work approximations of workload
  1.) real applications
      compilers, word processors, photoshop
      portability problems for OS dependant apps
  2.) modified (or scripted) applications
      modified for portability
      modified to focus on one part of execution
      scripts to similate user response, esp in multi-user case
  3.) kernels
      isolate performance of individual features of a machine
      Linpack and "Livermore Loops"
  4.) toy benchmarks
      10-100 line programs
      Sieve of Eratosthenes, Puzzle, Quicksort
  5.) synthetic benchmarks
      kernels are at least from real programs
        synthetic are artificial
          either compiler optimizer can't make sense of synthetic program
            OR
          compiler can over optimize over simplified synthetic program
      Whetstone and Dhrystone
- Benchmark suites
  SPEC - 1980's effort for better workstation benchmarks
  Ziff Davis does a lot of Windows machine benchmarks
    Business Winstone: browser and office product multitasking
    CC Winstone: content creation (browser, photoshop, premier, audio editing)
    Winbench: kernels for CPU, video, disk performance
- Desktop Benchmarks
  for CPU and Graphics
    SPEC CPU benchmarks: SPEC89, SPEC92, SPEC95, SPEC CPU2000
      CPU2000
        CINT2000
          11 integer benchmarks
             c-compiler, VLSI place-and-route, graphics app
        CFP2000
          14 floating point benchmarks
             quantum chromodynamics, finent element modeling, fluid dynamics
      focus on portability and minimizing role of I/O
        C, C++, F77, F90
      for desktops and single-CPU servers
      results based on machine, compiler, and OS configuration
    SPECviewperf for OpenGL
      3D rendering
    SPECapc uses graphics applications
      Pro/Enginner - solid modeling of photocopier
      SolidWorks 2001 - 3D CAD/CAM of assembly line
      Unigraphics V15 - aircraft model
        assembly, drafting, numeric control machining, solid modeling, and optimization
- Server Benchmarks
  SPECrate
  - run multiple copies of SPEC CPU2000, usually one copy per CPU
  SPECSFS
  - NFS file server benchmark
  - throughput measurement but with requirements on response time
  SPECWeb
  - web server benchmark
  - static and dynamic pages generatio
  - server responses to client posts
  TP (transaction processing) benchmarks
    measurement is transactions/second
      however, response time requirement
      slow reponding transactions don't count in total
    higher transactions rates means large systems
      that is more concurrent users and large data set
    system cost must be reported
      to allow cost-performance analysis
    TPC-A 1985
    TPC-C 1992
      complex query environment
    TPC-H
      ad hoc decision support
      queries unrelated, past queries can't help optimize future
    TPC-R
      business decision support
      standard set of queries allows pre-optimization
    TPC-W
      web-based transaction benchmark
      covers database and web sever
- Embedded Benchmarks
  EEMBC (pronounced "embassy")
    5 classes of kernels
    1.) 16 automotive/industrial - 6 microbenchmarks, 5 automotive, 5 filter/FFT
    2.)  5 consumer - multimedia
    3.)  3 networking - shortest path, routing, packet flow
    4.)  4 office automation - graphics and text
    5.)  6 telecommunication - filtering and DSP
- Reporting performance results
  reproducibility
    list everything another experimenter would need to duplicate the results
    SPEC requires
      fairly complete description of machine
      compiler flags
      baseline and optimized results (see software configuration below)
      actual performance times, in tabular form and graph
    TPC requires even more
      benchmarking audit
      cost information
    software configuration
      often run in single user mode to reduce overhead
      OS enhancements help TPC
      compiler enhancements help CPU
        especially when source modification allowed
        benchmark specific optimization flags!!!
          which is why SPEC requires both baseline and optimzied reporting
          baseline optimizations have to work for ALL apps in same language
    modifications allowed?
      1.) no modification
          SPEC
      2.) modifications alloed but are difficult or impossible
          TPC-C typically uses closed source databases from Oracle and Microsoft
      3.) modifications are allowed
          NAS Supercomputing benchmark
            as long as you produce same output on same input...
          EEMBC
            have to report results with original code as well as modified soure
      4.) hand coding (assembly)
          EEMBC allows hand coding of all benchmarks
- Comparing and Summarizing Performance
  faster: is it clear what this means? lying with statistics
- Total Execution Time: A Consistent Summary Measure
  arithmetic mean tracks total execution time
- Weighted Execution Time
  weighted arithmetic mean
    for when some apps are more important that others...
    for example, if one does 4x of the work...
       but isn't that dependant on the apps scaling the same on new configurations?
- Normalized Execution Time and the Pros and Cons of Geometric Means
  normalize execution time to reference machine
    take the average of the normalized execution times
    used by SPEC, where SPARCstation is used for reference
  geometric mean is consistent no matter which machine is the reference
    also makes it independant of biasing by using larger inputs on your better apps
      of course, most benchmarks specify inputs
  HOWEVER
    violates fundamental rule of measure performace by execution time
    encourages vendors to focus on what is easy, not what is hard
      especially on small, short running programs,
        not large programs with multiple phases
1.6 Quantitative Principles of Computer Design 39
- Make the Common Case Fast
- Amdahl's Law
  the law of diminishing returns
- CPU Performance Equation

                 Seconds   Instructions   Clock Cycles      Seconds
      CPU time = ------- = ------------ * ------------ * ------------
                 Program     Program       Instruction    Clock Cycle

  Several Parameters are interdendent:
      Clock cycle time  - hardware technology and organization
      CPI               - Organization and instruction set architecture
      Instruction count - instruction set architecture and compiler technology
  specifically
    organization affects clock cycle time and CPI
    ISA affects CPI and instruction count
    hardware technology largely affects cycle time, but also CPI
    compiler technology affects instruction count
- Measuring and Modeling the Components of the CPU Performance Equation
  existing CPU
    measure execution time
    clock speed is known
    instruction count from hardware performance counters
    statistical CPI from hardware performance counters
      simulation can give more detail
  simulation
    profile based, static modeling
      get a program profile via
      1.) hardware sampling
      2.) instrumented execution
      3.) ISA simulator
    trace based
      particularly useful for memory system performance
        can be combined with static analysis of simple pipeline performance
          or more detailed dynamic analysis for complicated pipelines
      isolates model of memory from pipeline
    execution driven simulations
      detailed simulation of both memory and pipeline
- Principle of Locality
  temporal locality
  spatial locality
- Take Advantage of Parallelism
  system level
    SPECWeb or TPC
    multiple processors
    multiple disks
  processor level
    ILP
    pipelining
  digital design
    set associative caches
    ALU with carry look ahead adders
  speculation
    evalutate multiple possible outcomes
    late selection of result
      cary select addres
     set-associate caches
     handling branches in piplelines
1.7 Putting It All Together: Performance and Price-Performance 48
- Performance and Price-Performance for Desktop Systems
  Figure 1.19, integer performance roughly parallel to price-performance
    notably, there was not a significant premium for extra performance
      on high end systems, in other words you get what you pay for
  Figure 1.20, FP performance vs price-performance more varied
    Penitum 4 system price-performance was notably higher
      in other words, P4 a good deal if you need FP
- Performance and Price-Performance for Transaction-Processing Serverrs
  Measuring OLTP with TPC-C
    1.) resaonably approximation of real OLTP
    2.) total system performance
    3.) rules are very complete
    4.) vendors spend effort to make it run well
    5.) requirement to report performance and price-performance
        TPM (transactions/minute) and TPM/$
  Figure 1.22 shows 6 highest absolute performance OLTP systems, clusters of 48-280 cpus
  Figure 1.23 shows 6 highest price-performance OLTP systems, SMPs with 3-8 cpus
    4x better in terms of price-performance than their big brothers
- Performance and Price-Performance for Embedded Processors
  cost can be affected by integrated memory control and I/O controller
    not reflected in benchmarks
  x86, PowerPC, MIPS64 systems
  wide range in price makes slower processors more cost-effective
    (not true in Desktop systems, similar to high performance premium in servers)
1.8 Another View: Power Consumption and Efficiency as the Metric 56
- Performance/Watt
  Power consumption varies in embedded processors by a factor of 10x
  desktop processors pushing against air cooling limits
  server operating costs affected by power consumption
1.9 Fallacies and Pitfalls 57
- Fallacy
    The relative performance of two processors with the same
    instruction set architecture (ISA) can be judge by clock rate or
    by the performance of a single benchmark suite
- Fallacy
    Benchmarks remain valid indefinitely
      maxtrix300 in SPEC89
        99% of time in one line
        blocking optimization owned this benchmark
      adding compiler optimizations for specific benchmarks
      SPEC92 increased GCC input sizes to be more interesting
      half of benchmarks added in 1992 and 1995 were dropped in next release
- Pitfall
    Comparing hand-coded assembly and compiler-generated, high-level
    language performance
- Fallacy
    Peak performance tracks observed performance
      difference is usually 10x in supercomputer applications
- Fallacy
    The best design for a computer is the one that optimizes the primary
    object without considering implementation
- Pitfall
    Neglecting the cost of software in either evaluating a system or
    examining cost-performance
- Pitfall
    Falling prey to Amdahl's Law
- Fallacy
    Synthetic benchmarks predict performance for real programs
- Fallacy
    MIPS is an accurate measure for comparing performance among computers
1.10 Concluding Remarks 65
1.11 Historical Perspective and References 67
- The First General-Purpose Electronic Computers
  ENIAC
    J. Presper Eckert, John Mauchly
    University of Pennsylvania
  EDVAC
    stored program computer proposal
    "von Neumann machine"
    John von Neumann, J. Presper Eckert, John Mauchly
  EDSAC
    Maurice Wilkes
    Cambridge University
  IAS machine
    Institute for Advanced Study, Princeton
    Burks, Holdstine, von Neumann
      basis for IBM 701
  Mark-I, Mark-II, Mark-III, Mark-IB
    Howard Aiken
    Harvard
    "Harvard architecture" had seperate memories for instructions and data
      term sometimes used today for split L1 caches for instructions and data
  Whirlwind
    MIT
    Magnetic Core Memory
- Important Special-Purposes Machines
  BOMB
    UK
    Alan Turing
    Electro-mechanical
  COLOSSUS
    UK
    Newman and Flowers
    Electronic
  ERA machines
    US
    acquired by Sperry-Rand (which also owned Eckert Mauchly Computer Corporation)
  Konrad Zuse
    Germany
    first floating point, deemed unnecessary by von Neumann
  John Atanasoff
    Iowa State University
    ABC computer, never completely operations
    binary representation
    however, Mauchly visited, probably was influenced by ABC
- Commerical Developments
  BINAC
    Eckert-Mauchly Computer Corporation for Northrup
  UNIVAC I
    Eckert-Mauchly -> Remington-Rand -> Sperry-Rand -> UNIVAC division
  IBM 701, IBM 702
    modest successes
  IBM 650, 704, 705
    significant successes
- Development of Quantitative Performance Measures: Successes and Failures
  Gibson mix
    average instruction execution time
      MIPS
  Whetstone
    Algol 60 -> FORTRAN
  Livermore FORTRAN Kernels
  Relative MIPS
    VAX-11/780 same speed relative to IBM 370/158 which was 1 MIPS
  MFLOPS
    usually only peak reported...
  SPEC
    see above
Exercises 74
Chapter 2 Instruction Set Principles and Examples
2.1 Introduction 90
- MIPS architecture spans embedded, desktop, server
- x86 shows more than purely aesthetic design is important
- DSP instructions
  largely used by special purpose libraries
    never generated by compiler...
- Vector instructions
    again, largely used by platform specific libraries
      including memcpy...
    note quite as special purpose as DSP instructions
      GCC 4 does some auto-vectorization...
2.2 Classifying Instruction Set Architectures 92
- categories
  Stack
    implicit arguments from stack
    example: C=A+B
      PUSH A
      PUSH B
      ADD
      POP  C
    Symbolics, Java Virtual Machine
  Accumulator
    implicit argument of accumulator
    example: C=A+B
      LOAD  A
      ADD   B
      STORE C
  General-purpose register architectures
    no implicit arguments
    Register-memory (CISC)
      operands from register and memory
      example: C=A+B
        LOAD  R1, A
        ADD   R3, R1,B
        STORE R3, C
      IBM 360/370, Intel 80x86, Morotola 68k, TI TMS320C45x
      pros
        data can be accessed w/o seperate load
        instruction format easy to encode
          good code density
      cons
        operands not equivalent since a source operatnd in binary operation destroyed
        encoding register and memory in each instruction can restrict number of registers
        clocks per instruction vary by operand location
    Register-register/load-store (RISC)
      example: C=A+B
        LOAD  R1, A
        LOAD  R2, B
        ADD   R3, R1, R2
        STORE R3, C
      operands from registers only, except load/store from/to memory
      Alpha, ARM, MIPS, PowerPC, SPARC, SuperH, Trimedia TM5200
      pros
        simple fixed-length instruction coding
        simple code generation model
        instructions take similar number of clocks to execute
      cons
        higher instruction count because of extra memory reference instructions
        more instructions and lower density leads to larger programs
    Memory-Memory
      operands from memory only (not in use today)
      example: C=A+B
        ADD   C, A, B
      VAX
      pros
        most compact
        doesn't waste registers on temporaries
      cons
        large variation in instruction size
          especially for 3 operand instructios
        large variation in work per instruction
        memory access create memory bottlenck
  extended accumulator or special purpose register computer
    x86 certain registers are implicit in some instructions
- advantage of general-purpose register architectures
  1.) registers are faster than memory
  2.) registers are more efficient for a compiler to use
      allows easy reordering without maintaining stack discipline
  registers can hold variables
    reduces memory traffic
    register are faster as well (stated in 1.)
    code density improves (fewer bits to name register than memory)
  general-purpose registers best for compiler
    special purpose users constrain compiler
    DSP crowd still has a lot of special purpose registers
      ...and hand coding as well so that isn't as much of an issue
  calling conventions dictate some register usage
    register for argument passing
    volatile vs non-volatile registers across function calls
SUMMARY
  NEW architectures are register-register/load-store
2.3 Memory Addressing 95
- Interpretting Memory Addresses
  endianness
    little endian
      76543210
      strings appear backwards (DRAWKCAB) in registers
    big endian    01234567
  alignment
    byte, half word, word, double word (ie, 8, 16, 32, 64 bit)
- Addressing Modes
  Effective Address
    the computed address actually used to access memory, regardless of mode
  PC-relative addressing
    primarily for code addresses in control transfer
  Figure 2.6
    Register
      Add R4, R3
      When a value is in a register
    Immediate
      Add R4, #3
      For constants, embedded in instruction
    Displacement
      Add R4, 100(R1)
      constant offset from address in register
    Register indirect
      Add R4, (R1)
      special case of displacement with no offset
    Indexed
      Add R4, (R1+R2)
      useful for array indexing, base in R1, offset in R2
    Direct or absolute
      Add R1, (1001)
      useful for static data, need large address
    Memory indirect
      Add R1, @(R3)
      double indirect
    Autoincrement
      Add R1, (R2)+
      Useful for stepping through an array
    Autodecrement
      Add R1, -(R2)
      useful for push/pop with autoincrement
    Scaled
      Add R1, 100(R2)[R3]
      similar to indexed, but r3 is scaled by constant such as word size
  Figure 2.7
    Frequently used addressing moves for 3 SPEC89 apps on VAX (which had the richest modes)
      Displacement, Immediate, Register Indirect, Scaled, Memory indirect
        (excluding PC-relative)
- Displacement Addressing Mode
  affects code density
  Figure 2.8
    On Alpha with 16-bit displacement field, running SPEC CPU2000
     integer programs use smaller displacements on average (1-bit)
     floating point programs use large displacements on average (13 bits)
- Immediate or Literal Addressing Mode
  Figure 2.9
    1/4 of data transfers and ALU operations use immediate operand
   Figure 2.10
    On Alpha with 16-bit immediate field, running SPEC CPU2000
     integer programs use LARGE displacements on average (14-bit)
     floating point programs use SMALLER displacements on average (4 bits)
    On VAX with 32-bit immediates, 20%-25% of immediates were bigger than 16-bits
2.4 Addressing Modes for Signal Processing 101
- Circuluar buffers are common in DSP applications
  Special "modulo" or "circular" addressing mode
- FFT needs a particular permution
  Special "bit reverse" addressing mode (just for FFT!)
SUMMARY: Memory Addressing
  Expectations for new architecture:
  1.) addressing modes: displacement, immediate, register indirect
      Figure 2.7 shows these cover 75-99% of SPEC CPU2000 apps
  2.) displacement 12-16 bits
      Figure 2.8 shows these cover 75-99% of displacements
  3.) immediate 8-16 bits
      Figure 2.10 shows these cover 50-80% of displacements
  Modes that cannot be targeted by compiler of questionable value
    even not frequently used ones are not usually worth implementation cost
2.5 Type and Size of Operands 104
- type
  implied by operation
  tagged in hardware
    only in museums
- examples
  integer
    byte, half-word, word
  floating point
    single-precision, double-precision
    before 1980, vendor specific implementation
    IEEE standard 754
  character (8-bit byte or 16-bit unicode)
    now rare "string" operations
  binary coded decimal (BCD)
    4-bits per digit
    pack and unpack into string form
    exact arithmetic
- registers contents
  typically store full word or double precision values
    e.g., byte becomes word, single becomes double
    VAX had operations for shorter values but only used 6% of operand accesses
2.6 Operands for Media and Signal Processing 105
- vertex
  x,y,z,w
  a 3d point with additional "w" information (such as color)
  3 vertexes is triangle...
- pixels
  render that triable to get
  r,g,b,a (color + alpha transparency)
- fixed point
  fractions between -1 and +1 in binary
SUMMARY: Type and Size Operands
- 32-bit architecture
  8,16,32 integers
  32,64 doubles
- 64-bit adds 64-bit integers
- probably not decimal (BCD) support
- DSP's have wider accumulator registers than memory for more accuracy
  often more that 2x more
  (x86 floating point similar, 80-bit precision or something like that -bdc)
2.7 Operations in the Instruction Set 108
- Categories of instructions: Figure 2.15
  Arithmetic and logical
  Data transfer
  Control
  System
  Floating point
  Decimal
  String
  Graphics
- Most frequently used are most simple
  Top 10 for x86 covering 96% (Figure 2.16)
   1 22% load
   2 20% conditional branch
   3 16% compre
   4 12% store
   5  8% add
   6  6% and
   7  5% sub
   8  4% move register-register
   9  1% call
  10  1% return
2.8 Operations for Media and Signal Processing 109
- Media processor output judged by human perception
  which can't perceieve 64-bit values :)
  therefore use less precision
    example graphics using single precision
- partitioned add
  since using 4 16-bit values, let 64-bit ALU do 4 adds in parallel
  just need to prevent carries between partitions
  this is just special case of ...
- vector instructions
  single-instruction multiple-data (SIMD)
- commerical examples
  Alpha MAX
  HP PA-RISC MAX2
  Intel Pentium MMX
  PowerPC AltiVec
  SPARC VIS
- categories
  add/subtract
  saturating add/sub
  multiply
  compare
  shift right/left
  shift right arithmetic
  multiple and add
  shift and add (satuarying)
  absolute difference
  max/min
  pack (2n bits->n bits)
  unkac/merge
  permute/shuffle
- AltiVec seems to be most complete
    uses 128-bit ALU
  AltiVec followed by MMX
- DSP processor issues
  1.) saturating artitmetic
      don't want exception on overflow, just go to max value of type
  2.) special rounding modes for storing from wide accumulator to memory
  3.) multiple-accumualte instruction for dot product
  other wacky DSP stuff
  - some memory mapped registers!
  - usually no FP unit, all fixed point
SUMMARY
simple instructions are popular and important
  load, store, add, subtract, move register-register, shift
    DSP adds multiply and multiply-accumulate
2.9 Instructions for Control Flow 111
- many names for control flow
    transfers, branch, jump
      jump   => unconditional
      branch => conditional
- 4 major types of control flow instructions
  conditional branches
  jumps
  procedure calls
  procedure returns
- frequencey in alpha runinng SPEC CPU2000
  approx 80% condition branch
  approx 10% jump
  approx 10% call/return
    about the same for INT and FP, see Figure 2.19
- Addressing Modes for Control Flow Instructions
  Destination Address
    the computed address actually used as target for control flow
  mode examples
    PC-relative
      position independance
        position independant code (PIC)
      works for
        conditional branches
        jumps
      quantitative results from Alpha (Figure 2.20)
        4-8 bits of displacement needed for integer and floating point respectively
        75% of are forward relative
        RISC has lower code density, but knows code is word aligned, saving lower bits
        CISC has higher code density, but needs byte indexed offset taking more space
    register indirect
      works when target is not known at compile (really link) time
        case/switch statements
        virtual functions or methods
        high-order functions or function pointers
        dynamically shared libraries
- Conditional Branch Options
  Figure 2.21
    Condition code
      test special bits set by ALU operations, possibly under program control
        x86, ARM, PowerPC, SPARC, SuperH
      pros
        sometimes condition is set for free (w/o extra instruction)
          measurement shows this is rare
      cons
        CC is extra state (special register on PowerPC, many flags in x86 -bdc)
        constrain ordering of instructions
          especially when codes are set unconditional (not a bit in instruction)
    Condition register
      tests arbitrary register with result of comparison
        Alpha, MIPS
      pros
        simple
      cons
        uses up a register
    Compare and branch
      compre is part of the branch, often limited to subset
        PA-RISC, VAX
      pros
        one instructions rather than two for branch
      cons
        may be too much work per instruction for pipelined execution
      by limiting to subset,
         such as branch conditionally on zero,
           can overcome pipeline issue
  Figure 2.22
    approximate comparison type frequency on alpha running SPEC CPU2000
      <= 35%
      <  35%
      == 15%
      !=  5%
      >= 10%
      >   0%
  DSP additional condition branch type
    repeat mode
      repeat instruction or block a fixed number of times say 256
        used with autoincrememnt/autodecrement to reduce kernel loop overhead
- Procedure Invocation Options
  Requirement
    transfer control
    save return address
      in special link register(LR) or general purpose register(GPR)
  Options
    save additional register state
  Conventions
    caller saved (volatile)
      save needed registers before function call
    callee saving (non-volatile)
      save changed before use, restore before return
    separate compilation
      makes it hard for compiler to optimize away caller saved
    application binary interface(ABI)
      convention defining register usage
        caller saved and callee saved
        also argument passing, frame pointer, stack pointer
SUMMARY
   PC-relative branch with at least 8-bit displacement
   register indirect for returns, etc
2.10 Encoding an Instruction Set 117
- opcode
- address specifier
  CISC way of identifying operand location
  RISC encodes as logically part of opcode
- tradeoffs between
  1.) want to have as many registers AND addressing modes as possible
  2.) size of register and addressing mode fields affects size of
      program and therefore average program ize
  3.) a desire for instruction lengths to allow efficient instruction
      decoding of for pipelining. fixed size preferable, multiple of bytes
      at the minimum
- options (Figure 2.23)
  variable
    VAX, x86
    opcode, number of operands, address specifier1, address field1, ...
  Fixed
    Alpha, ARM, MIPS, PowerPC, SPARC, SuperH
    opcode, address field 1, address field 3, address field 3
  hybrid (small set of regular options, can be different lengths)
    IBM 360/70, MIPS16, Thumb, TI TMS320C54x
    opcode, address specifier, address field
    opcode, address specifier1, address specifier2, address field
    opcode, address specifier, address field1, address field2
- Reduced Code Size in RISCs
  RISC variants with 16-bit wide and 32-bit wide instructions
    examples ARM Thumb and MIPS 16
      claim code size reduction of 40%
  IBM offers CodePack which uses code compression
    decompressed on fetch to i-cache
      downside: i-cache filled with 32-bit instructions
        hybrid's have 25% more logically instructions in i-cache
    CodePack uses RLE of PowerPC
    - specific to program
    - 2k table on chip for decode
    - hashtable to map aligned to unaligned addresses
    Performance?
    - 10% hit
    - 35%-40% code reduction
  SuperH
  - 16-bit wide instructions only
  - only 16 registers vs 32 to make it fit
  - otherwise looks like classic RISC
SUMMARY
- speed => fixed length
- size  => variable length
2.11 Crosscutting Issues: The Role of Compilers 120
- today almost all programming done in high level languages
  ISA is simply a compiler target
- historically (and today's DSP market)
  focus on making it easy on assembly programmers
- broken thinking
  trying to isolate ISA from compiler
    as bad as seperating ISA from hardware implementation
      tradeoffs between each interface
- The Structure of Recent Compilers (Figure 2.34)
  Front end per language
    -> intermediate representation
  High-level optimizations
    -> loop transformations
    -> procedure inlining (aka procedure integration)
  Global optimizier
    -> global and local optimizations
    -> register allocation
  Code Generator
    -> detailed instruction selection
    -> machine-dependant optimizations
    -> (peep hole optimziations -bdc)
- Goals
  1.) correctness
  2.) speed
- Complexity
  #1 limits optimizations that can help with #2
  phase ordering problem
    need to decide inlining before know size from code generator
- compiler-ISA interaction example
  global common subexpression elimination
    big impact if shared temporary location is register allocated
- Register Allocation
  probably the most important optimization
  graph coloring to try and achieve 100% allocation to registers
  NP-complete but heuristics work well
  However, need 16+ registers for heuristics to work well
- Impact of Optimizations on Performance
  Figure 2.25
  High-level (at or near the source level)
    Procedure inliming
  Local (within straightline code)
18% common subexpression elimination
22% constant propagation
    stack height reduction
      minimize resourced needed for expression evaluation
  Global (across a branch)
13% global common subexpression elimination
11% copy propagation
16% copy motion
 2% induction variable elimination
  Processor-dependant
    strength reduction
    pipeline scheduling
    branch offset optimization
  Figure 2.26
    optimizations eliminate 25%-90% of instructions
- The Impact of Compiler Technology on the Architect's Decisions
  Question 1 of 2:
    How are variables allocated and addressed?
  Question 2 of 2:
    How many registers are needed to allocate variables appropraitely?
  Answer 1
    stack
      local variables
    global data area
      static declared objects
        frequently arrays or other large data structures
    heap
      dynamically allocated objects that don't follow stack discipline
       accessed via pointers, typically not scalars
  Answer 2
    registers are more effective for stack than global
      very hard for heap allocated objects
      "aliasing" problem for heap and some stack variables
- How the Architect Can Help the Compiler Writer
  Make the frequent cases fast and the rare cases correct
  1.) provide regularity
      orthogonality of operations, data types, addressing modes
  2.) provide primitives, not solutions
  3.) simplify trade-offs among alternatives
  4.) provide instructions that bind the qualatites known at compile time as constants
      example of badness: VAX calls instruction dynamically interprets
                          register save mask known at compiletime at runtime
- Compiler Support (or Lack Thereof) for Multimedia Instructions
  SIMD problems
  - solutions not primitives
  - short of registers
  - data types don't match existing programming languages
  general vector architectures not bad
    compiler support for those
  limits on size
    MMX 64-bit total, altivec 128
    SSE introduced for 128-bit support (MMX not extensible)
  better design?
    seperate length and size
    let processor determine scheduling of 128 8-bit adds
      old processor might do 8 at at time, new 16 (64=8*8, 128=8*16)
  general vector architectures good for hiding memory latency
    access pattern visible ahead of time
    options like
      - strided addressing
      - gather/scatter addressing
      allow more programs to be vectorized
      the "register indirect" mode of vector computers
  because of problems, SIMD typically only found in hand coded libraries
  (out of date:
   - GCC4 does target SIMD now...
   - but what is usage in SPEC CPU2000?
  )
SUMMARY of role of compilers
- 16 general purpose registers to simplify register allocation
  not counting floating point
- orthogonality: all addressing modes for all load/store instructions
- KISS (keep it simply stupid) principple
  primitives not solutions
  simplify tradeoffs
  don't bind constants at runtime
- current SIMD is good marketing, not good design
2.12 Putting It All Together: The MIPS Architecture 129
- based on rationalizations of previous sections we come up with our ISA
  design which happes to be MIPS-64
  2.2 GPR registers with load-store architecture
  2.3 Addressing modes
      displacement with offset of 12-16 bits
      immediates of 8-16 bits
      register indirect
  2.5 data sizes and types
    8,16,32,64-bit integers
    64-bit IEEE 754 floating point
  2.7 load,store,add,sub.mv,,shift
  2.9 eq,ne,lt
      branch with PC-relative 8-bit offset
      jump, call, return
  2.10 fixed instruction encoding for performance
       (or variable for size)
  2.11 16+ GPRs, supported in all addressing modes
       separate set of FPRs
- MIPS64 (aka MIPS)
  - simple load-store instruction set
  - design for pipelining efficiency, including a fixed instruction set encoding
  - efficiency as a compiler target
- Registers for MIPS
  32 64-bit GPRs R0..R31
  32 64-bit FPRs F0..F31 (hold 1 float, 1 double, or 2 floats)
  R0 is always 0
  SPRs (for example floating-point status)
  Can move between GPR and FPR
- Data Types for MIPS
  8,16,32,64-bit integers
  32,64-bit floating point
  ALU operations on 64-bit integers, 32/64 floating point
    smaller integer types are zero or sign extended
- Addressing Modes for MIPS Data Transfers
  immediate and displacement both with 16-bit offset fields
    register indirect: use 0 for displacement
    absolute addressing with 16 bit field using R0 as base register
  byte addressable
    aligned memory accessed based on 8,16,32,64-bit types
  bit endian or little endian
- MIPS Instruction Format
  32-bit fixed size
    6-bit for primary opcode
    I-type (immediate)
      6-opcode 5-rs 5-rt 16-immediate
        loads and stores
        immediates
        conditional branch
          rs is register, rt unused
        jump register, junp and link register
          rt=0, rs=dest, immediate=0
    R-type (register)
      6-opcode 5-rs 5-rt 5-rd 5-shamt 6-function
        register-register ALU operations
          rd = rs funct rt (shamt for shift?)
          read/write special registers
          move registers
    J-type (jump)
      6-opcode 28-offset
        jump and jump and lick
        trap and return from exception
- MIPS Operations
  loads and stores
  ALU operations
    loading r0 always is 0
    all register-to-register
      add,sub,and,or,xor,shifts
        with and without immediate operands
    lui (load upper immediate)
      allows two instruction sequence to load 32-bit immediate
  branches and jumps
  floating point operations
    explict conversion between single and double precision
  r0
   LI:  loading a constant is just adding immediate to r0
   MOV: register-register move is just adding r0 to a register
- MIPS Control Flow Instructions
  compare two registers, store result in third register
    set-equal, set-not-equal, set-less-than
    (really just ALU operation)
  control via jumps and branches
    4 jumps,
     choose between PC-relative, absolute
     choose between plain or jump-and-link
    branches
      test register for zero or non-zero
      test register for negative
      test registers equal
      test floatint-point status register
  predicated instructions
    branches cause problems for pipeline
    change branches into conditional ALU operation instead
      MIPS example: MOVZ R1,R2,R3 (move r2 to r1 if r3 is zero)
- MIPS Floating-Point Operations
  move without type or precision conversion
  move with precision conversion
  move with type conversion to/from integer
  add,sub,mul,div
  FP compare sets floating point status register
  FP branch uses FP status registeer, testing true or false
  paired single (for graphics)
    special ops for operating one two 32-bit flats in one register
  MADD (for DSP)
     multiple-add for fast dot product
- MIPS Instruction Set Usage
  Figure 2.32 SPECint2000 MIPS instruction usage
    load 26%
    store 10%
    add 19%
    compare 5%
    cond branch 12%
    and 4%
    or (including mov) 9%
  Figure 2.33 SPECfp2000 MIPS instruction usage
    load 15%
    add 23%
    load imm 5%
    cond branch 4%
    load FP 15%
    store FP 7%
    add FP 7%
    mul FP 8%
2.13 Another View: The Trimedia TM32 CPU 136
- media processors
  Figure 2.35 example apps, benchmark
  data communication -  Viterb deconding
  audio coding - AC3 decode
  video coding - mpeg2 encode, dvd decode
  video processing - layer natural motion, dynamic nouse, reduction, peaking
  graphics - 3d renderer library
- Trimedia TM32 CPU
  set-top boxes, advanced TVs
  1.) 128 32-bit registers, integer or fp
  2.) partitioned ALU (or SIMD) instructions for operations on narrow types
  3.) saturating DSP arithmetic, in addition to twos complement
  5-way VLIW
    many NOPs, uses compressed instructions, decoded in I-cache
    2-3x code size over MIPS, even with with compression
  hand coded EEMBC apps have many pack/merge instructions to align data for SIMD
2.14 Fallacies and Pitfalls 142
- Pitfall
    Designing a "high-level" instruction set feature specifically
    oriented to supporting a high-level language structure
  fixing the "semantic gap" often results in a "semantic clash"
  too much overhead in general case that compiler might be able to optimize away
  standarize calling convetion in ABI, not in ISA
- Fallacy
    There is such a thing as a typical program.
- Pitfall
    Innovating at the ISA to reduce code size with accounting for the compiler
  Compilers often target fastest code, not smallest code
- Pitfall
    Expecting to get good performance from a compiler for DSPs
- Fallacy
    A architecture with flaws cannot be successful
  80x86
  1.) PC compatible big marketplace factor
  2.) Moore's law allowed internal RISC design
  3.) high volulmes cover extra design cost
- Fallacy
     You can design a flawless architecture
  Trade-offs in HW/SW
  VAX (1975)
    did not foresee need for lowcost decoding and pipeling
  RISC
    branch delay slots
  all architectures have problems with size of address space over time
  Planning for the future means losing efficieny now
    why not go from 32-bits to 128-bit addressing?
2.15 Concluding Remarks 147
- Supporting high level languages
  60s
    stack architectures
  70s
    reduce software costs by moving into hardware (CISC)
  80s
    compilers bring RISC revolution
  90s
    address size doubles
    optimization of conditional branches via conditional execution
    optimization of cache performance via prefetch
    support for multimedia
    faster floating-point operations
  00s
    long instruction words
    increased conditional execution
    blending of general-purpose and DSP architectures
    80x80 emulation
- the role of architect
  60s
    computer arithmetic
  70-85
    ISA design (until binary compatibility mattered)
  today
    design and evaluation of full computer systems
2.16 Historical Perspective and References 148
- accumulator
   univac I, edsac, IAS computers
- general-purpose register computers
    first: pegasus by Ferranti, 1956, 8 registers with R0 always 0
- Stack Architectures
   Burroughs B5000  (1963)
    first real HW/SW tradeoffs
    support for ALGOL and OS (MCP) written in high-level language
    first US computer with VM
    Burroughs  B6500 (1968) added hardware managed activation records
    top two stack elements in processor, rest in memory
    good code density but only two high speed locations
   IBM 360 (1964) and PDP-11 (1970) desginers argue against stack arch
    1.) performance is derived from fast registers, not the way they are used
    2.) the stack organization is too limiting and requires many swap and copy operations
    3.) the stack has a bottom, and when placed in slower memory there is a performance loss
   Only remnants are 80x87 floating point stack
   Java bytecode ISA is a stack, but JIT compiled to regular architectures
- Computer Architecture Defined
  IBM coined the term "computer architecture" (Amdahl et al)
    described the move to a singal ISA for the 360 line
    the begining of binary compatibility
    IBM/360 ISA: first widespread with byte addressing and general purpose registers
  CDC 6600 (Thorton, Cray et al)
    early pipelined architecture
    fist general-purpose, load-store architecture
    figured out in the 60s link between simple ISA and efficient pipelining
- High-Level Language Computer Architecture
  Andrew Tanenbaum argued in 1978 for stack architecture :)
  VAX, the ultimate CISC, super-othogonal design for compiler writers
  HLLCA (high level language computer architecture)
    code size problem disappeared, allowing RISC...
- Reduced Instruction Set Computing
  Ditzel and Patterson (1980) argued to lower the operations targeted by compiler
  Clark and Strecker (1980) of VAX argued the opposite
  CDC 6600 wins
  Berkely RISC, IBM 801, Stanford MIPS
  IBM 801 (John Cocke) 24-bit experimental machine
  Berkeley RISC (Patterson)
    supported hybrid 16/32-bit instruction encodings
    smalltalk and Lisp support
  Stanford MIPS (Hennessy)
    (Microprocessor with Interlocked Pipeline Stages)
    Hennessy explains performance with low CPI of 2 vs 10 for VAX
  Commerical
    MIPS R2000 1986
    HP PA-RISC 1989
    IBM RT-PC 1986
    IBM RS 6000 1990 super-scalar
    Sun SPARC (Berkeley RISC-II) 1988
    PowerPC (Apple, IBM, Motorola)
  RISC wins on designs, RISC wins on volume
   embedded processors are 90% RISC (embedded volume 2x PC volume)
- A Brief History of Digital Signal Processors
  80s first designs, most not commercial successful, including from Intel
    NEC uPD7710 1980 first in volumne, weak tools
    AT&T DSP1 1980 used within AT&T but later sold outside
    TI DSP TMS32010, strong tools, solid success
  mid80s-mid90s
    evolution of "conventional DSP"
  1995+
   TI TMS320C62xx family VLIW (1997)
    SIMD designs
  Today
    very commerical 32-bit microprocessor has some DSP meatures
  system-on-a-chip (90s)
    processor core, memory, app specific HW, peripherals, I/O interfaces
    example: second generation cellphones
- Multimedia Support in Desktop Instruction Sets
  color
    8-bit VGA
    16-bit high color RGB = 5-bits, 6-bits, 5-bits
    32-bit true color RGBA
  sound
    16-bit samples, 24-bit for professional
  Intel i860
    partitioned ALU for 8/16/32/64 bit operations
    useful for video, 3d graphics, digital photography, audio/image processing
  names
    partitioned ALU, subword parallelism, vector, SIMD
- Summary
  CISC is dead (VAX and Intel iAPX 432)
  Long live CISC (80x86)
    embraced and extend RISC
    binary compatibility is king
  VLIW? Itanium? Transmeta? NO
Exercises 161
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation
3.1 Instruction-Level Parallelism: Concepts and Challenges 172
- Intro
  all processors since 1985 have a pipeline to improve performance
  ILP = potential overlap among instructions
  limits due to data and control hazards
  dynamic (hardware) and static (compiler/vliw) scheduling
  dynamic examples
    Intel Pentium III, Pentium 4
    AMD Athlon
    MIPS R10k R12K
    Sun UltraSPARC III
    PowerPC 603, G3, G4,
    Alpha 21264
  Pipeline CPI = Ideal piple CPI + Structural stalls + Data hazard stalls + Control stalls
  Figure 3.1 major techniques for lower CPI

TECHNIQUE                                     REDUCES                                                                    SECTION
Forwarding and bypassing                      Potential data hazard stalls                                               A.2
Delayed branches and simple branch scheduling Control hazard stalls                                                      A.2
Basic dynamic scheduling (scoreboarding)      Data hazard stalls from true dependencies                                  A.8
Dynamic scheduling with renaming              Data hazard stalls and stalls from antidependences and output dependencies 3.2
Dynamic branch prediction                     Control stalls                                                             3.4
Issuing multiple instructions per cycle       Ideal CPI                                                                  3.6
Speculation                                   Data hazard and control hazard stalls                                      3.7
Dynamic memory disambiguation                 Data hazard stalls with memory                                             3.2 3.7
Loop unrolling                                Control hazard stalls                                                      4.1
Basic compiler pipeline scheduling            Data hazard stalls                                                         A.2 4.1
Compiler dependence analysis                  Ideal CPI, data hazard stalls                                              4.4
Software pipelining, trace scheduling         Ideal CPI, data hazard stalls                                              4.3
Compiler speculation                          Ideal CPI, data, control stalls                                            4.4

- What is Instruction-Level Parallelism?
  basic block - not much ILP
  convert loop-level parallelism to ILP
    statically via loop unrolling
    dynamically via hardware
    (or use vector instructions)
- Data Dependencies and Hazards
  need to determine if two instructions are in fact parallel at ISA level (free from R/W dependencies)
    even if so, structural hazards may prevent parallel execution
  Data Dependencies
    instruction i is "data dependent" on instruction j if either:
    - instruction i produces a result that may be used by instruction j
    - instruction j is data dependent on instruction j, and instruction k is data dependent on instruction i
    hardware with interlocks can stall to avoid dependencies
      or software can schedule to avoid pipeline dependencies if part of ISA
    dependencies are part of programs
      but hazards are a property of pipeline organization
    dependencies indicate potential for hazard
      but actual hazard and length of stall are property of pipeline
    dependencies
      1.) indicate potential hazard
      2.) determine order of evaluation
      3.) upper bound on potential parallelism
    overcoming dependencies
      1.) maintaining the dependence but avoiding the hazard
          scheduling instructions to avoid hazards...
      2.) eliminating the dependence by changing the code
          sometimes by software, but also by hardware
    register data dependencies are relatively easy to spot, but are complicated by branches
      memory dependencies caused by aliasing are much more difficult to disambiguate
  Name Dependencies
    name dependence is when two instructions used the same register or memory location (called a name)
      but there is no flow of data between the instructions associate witht that name
      1.) antidependence - when j writes an register/memory that i reads - i needs old value
      2.) output dependence - when i and j write the same register/memory - j needs to win
    register renaming, either by compiler or hardware, can eliminate name dependencies
  Data Hazards
    we must preserve program order
      but only where it affects the outcome of the program
    RAW
      j tries to read a source from i writes it, j incorrectly gets old value
      true data dependence
      example load followed by a ALU operation in 5 stage pipeline
    WAW
      j tries to write an operand before it is written by i
      output dependence
      present if write in multiple stages
      present if later instructions can proceed past earlier stalled instructions
    WAR
      j tries to rwite a destination before it is read by i, i incorrectly gets new value
      antidependence
      cannot occur in most static issues pipelines
        because reads are done early and writes are done late
      need to have instructions that write results realy and some that read late
        or when instructions are reordered
    RAR
      not a hazard
- Control Dependencies
  "control dependence" determines the order of instruction i so that is in correct program order with respect to a branch
    1.) an instruction that is control dependent on the branch cannot be moved before the branch
        so that its execution is no longer controlled by the branch (move "then" code before "if")
    2.) an instruction that is not control dependent on a branch cannot be moved after the branch
        so that its execution is controlled by the branch (move code before "if" to "then")
  Simple pipeline
    1.) instructions execute in program order
    2.) pipeline prevents execution of dependant instructions until branch direction is known
  control dependence is not fundamental limit
    we can execute extra instructions so long as they do not affect program correctness
    however "exception behavior" and "data flow" must be preserved
  exception behavior
     reordering must not cause any new exceptions
     moving load before branch cannot cause exception if branch ends up skipping load
  data flow
    branches make data flow dynamic
    sometimes have to evaluate branch before we know "predecessor" for data flow input
  sometimes violating control dependence has no affect on correctness
    example move operation before branch if target location is dead after branch
      an example of compiler speculation
  delayed branches
    compiler preserves the data flow
3.2 Overcoming Data Hazards with Dynamic Scheduling 181
  simple static pipeline
    if data dependence cannot be hidden via bypassing or forwarding, stall the pipeline
  dynamic scheduling
    rearrange ordering while preserving exception behavior and data flow
    handles dependencies unknown at compile time, such as memory aliasing
    compiler doesn't have to model pipeline and schedule accordingly
      and output code can run on different pipeline implementations...
  compiler static scheduleing
    separate dependant instructions to avoid hazards
  Dynamic Scheduling: The Idea
    simple pipeline in order issue and execution
      structural and data hazards all checked during instruction decode
    dynamic pipeline
      check for structural hazards during instruction decode
      dynamically wait for data hazards
      out-of-order execution
        implies out-of-order completion
        causes WAR and WAW hazards which simple pipeline avoids
    exception behavior
      allow imprecise execptions
        may have already complete later instructions or not started earlier instructions
        make it hard to restart execution after exception
      better to maintain precise exceptions
        (use in-order commit (aka in-order retirement))
    converting the pipeline
      IF
        fetch into buffer or queue
      ID
        1.) issue deocde instructions, check for structural hazards
        2.) read operations - wait until no data hazards, then read operands
      EX
        potentially multiple cycles depending on operation
        "begins execution" -> "in execution"  -> "completes execution"
        parallel execution needs
          1.) multiple execution units  OR
          2.) pipelined execution units OR
          3.) both
    scoreboarding
      allows out-of-order execution if sufficient resources and no data dependencies
    Tomasulo's algorithm
      several major enhancements over scoreboarding
  Dynamic Scheduling Using Tomasulo's Approach
    From IBM 360/91 floating-point unit
    tracks when operands are available to minimize RAW hazards
    introduces register renaming to minimize WAW and RAW hazards
    designed to overcome few (4) FP registers and long memory access and FP execution times
      (sounds like modern x86: few registers, big memory wall, long pipelines for many ALU operations)
    RAW hazards avoided by delaying for operands
    WAR and WAR are handled by register renaming
      register renaming effectively reintroduces compiler temporary names from before register allocation
      register renaming in Tomasulo's algorithm uses
        reservation stations - buffers for operands
        issue logic - picks correct reservation station for execution output
        sucessive writes to the same register go to different stations, only last need to to actual register
      removes name dependencies caused by lack of unique names for compiler to use (lack of free registers)
    centralized to decentralized control
      1.) hazard detection and execution control are distributed
          information held in reservation stations at each function unit determine when execution can start at that unit
      2.) results are passed directly to functional units from the reservation station rather than going through registers
          common data bus allows all units waiting for a result to be loaded simultaneously
          in pipelines with multiple execution units and issuing multiple instructions per cycles, more than one bus needed
    reservation station contains
      instruction to be issues
      either the values of the operand if already computed
        or names of the reservation stations that will provide the operand
    load and store buffers act like reservation stations
      store buffer takes values from common data base
      load does not take input but does output to common data bus
    stages of Tomasulo's algorithm
      1.) Issue
          - get instruction from queue
          - if available reservation station at functional unit, send there, otherwise stall (structural hazard)
          - if operands available, fetch values from regisers from and send to selected reservation station
            otherwise note which other reservation station's results will provide operand value
          This renames registers, removing WAR and WAW hazards
      2.) Execute
          - if operands not available, just keep watching bus
          - if result available, fill operand value
          - if all operands available, execute instruction
          This waiting removes RAW hazards
          NOTE: multiple instructions could become ready in the same cycles
            for floating point, the choice of which to execute can be arbitrary
            loads and stores are maintained in program order to prevent hazards through memory
              loads only have one operand, the address, and execute when memory unit is available
              stores need two operands, value and address
          exception behavior
            no instruction starts execution until all previous branches are executed
              wait until branch prediction is verified
            alternative is to note exception but not raise it until write result
              more on this in speculation below
      3.) Write result
          - write result to common data bus (CDB)
            - and to any needed registers
            - and to any needed reservation stations (include store buffer)
          - stores also write to memory unit in this step when both operands are available
    reservation station tags
      indicates name of reservation station or load buffer entry in place of register names
      because there are more reservation stations that actual regisers, we have no WAW and WAR
        other approches use extended register set or reorder buffer (more below)
    reservation station fields
      Op    - the operation to perform
      Qj Qk - the source reservation stations (zero means satisified)
      Vj Vk - the operand values
      A     - address information, starts as immediate offset, then stores effective address
      Busy  - marks station as in use
    register file field
      Qi    - the reservation station that will supply this register (zero means value is current)
3.3 Dynamic Scheduling: Examples and the Algorithm 189
- Advantage of Tomasulo's Algorithm
  1.) distribution of hazard logic
      CBD broadcast allows all waiting operations to continue in parallel
        instead of waiting to cotending to access register file
  2.) the elimination of stalls for WAW and WAR hazards
        by register renaming
- Tomasulo's Algorithm: The Details
  Figure 3.5 for details
  loads/stores go though functional unit for effective address calculation before load/store buffers
    loads take a second step "execution" step to access memory and then a "write result" step to send to CDB
    stores complete in first "write result" step which writes to memory
    note all register/memory writes occur in "write result" step, critical to Speculation in section 3.7
- Tomasulo's Algorithm: A Loop Based Example
  loop is logically unrolled by the hardware (assuming correct branch prediction, see below)
  reordering memory operations
  - load/stores to different addresses can be reordered
  - can reorder buffer if effective address computation still in program order
    (note loads can be reordered relative to each other arbitrarily, its the load/store and store/store ordering that matter
     that is RAR does not matter, only RAW, WAR or WAW...)
  - negative of Tomasulo's Algorithm: cost of hardware
    each reservation station needs fast associative buffer
    complex control logic
    each extra CDB addes more associative searching logic
3.4 Reducing Branch Costs with Dynamic Hardware Prediction 196
  Analogy
    Tomasulo's Algorithm => data hazards
    Branch Prediction    => control hazards
  Crucial for multiple issue processors
    1.) branches arrive n-times faster in n-way processor
    2.) control stalls will be larged with the lower potential CPI
  Static schemes (Appendix A)
  - delayed branches
  - predict not taken
- Basic Branch Prediction and Branch-Prediction Buffer
  names
    "branch-prediction buffer" or "branch history table"
  basic mechanism
    table indexed by *lower* bits of branch PC
    entry hints whether branch was taken or not on last visit
    - only hint because of possible collisons of multiple PC's to entry
    - no comparison of high bits
    - if prediction wrong, we just update to other predict the opposite next time
    basically a small cache
  minor improvement
    with only 1-bit, entry wrong twice for each loop
    with 2-bits, only wrong once for each loop
      (can generalize to n-bit saturating counter, but 2-bits is good enough)
  where in pipeline?
    can lookup bits in IF with instruction address
      or attach bits to instruction cache
    doesn't help for 5-state pipeline
      generally find out whether branch is taken and branch target at the same time
  accuracy (Figure 3.8)
    SPEC89
    2bits per entry, 4k entries
    99% to 82% accuracy (misprediction 1% to 18% of the time)
      typically better for floating points apps with more regular access patterns
  impact on programs
    accuracy is not enough, need to understand frequency of branches in program
  size matters not (Figure 3.9)
    4k is pretty large, and results not much better with infinitely sizied predictor
- Correlating Branch Predictors (aka two-level predictor)
  observation
    often consecutive branches have correlated behavior
  1 bit of correlation
    look not just at branch PC, but also whether last branch was taken or not taken
    (1,1) predictor means look at last 1 branch and keep 1 bit of history state
    (m,n) predictor means look at last m branch and keep n bit of history state
    (0,2) predictor is our original 2-bit history predictor
  implementation
    global history of last m branches a simple shift register, used to index table with low bits of PC
  accuracy
    (2,2) often beats (0,infinite) (for same sized table (because 4x the state, only 1/4x the entries)
  other extreme
    (12,2) predictor with NO BRANCH PC INPUT
    can often outperform (0,2) predictor
- Tournament Predictors: Adaptively Combining Local and Global Predictors
  multi-level branch predictors
    use several levels of branch-predictors tables with algorithm for choosing between multiple predictors
    effectively make use of more predictor bits (diminishing returns for other predictors even with infinite size)
  accuracy (Figure 3.19)
    local 2-bit < 7% misprediction
    correlating < 4% misprediction
    tournament  < 3% misprediction
  An Example: The Alpha 21264 Branch Predictor
    4k 2-bit counters indexed by branch address to pick between global and local
    global also has 4k entries indexed by last 12 branches (same (12,2) predictor as above)
    local is a two level predictor
      top level 1024 10-bit entries, 1 bit for each of the last 10 branches
        selected entry used to index a table of 1k entries with 3-bit saturating counters
    implementation
      29k bits total
    accuracy
      less than  1   in 1000 instructions for SPECfp95
      less than 11.5 in 1000 instructions for SPECint95
3.5 High-Performance Instruction Delivery 209
- Predicting branches not enough, need to deliver 4-8 instructions per cycle for multi-issue processor
- Branch-Target Buffer
  "branch-target buffer" or "branch-target cache"
    predicts destination
  5-state pipeline needs to predict where to branch by end of IF
    need to guess before ID if instruction is a branch and if so what next PC will be
    ** With a basic "branch-prediction buffer" accessed during ID we know
    - branch target at end of ID
    - fall through address at end of IF
    - prediction
    ** With a "branch-target buffer" accessed during IF we know a prediction for target a cycle earlier
  A real-cache, different than cache predictor
    we cannot afford mis-predict in case that instruction is not a branch
    we must compare tag of high order bits to make sure this is a branch
  Only store taken branches, since we will predict PC+4 if not in branch-target buffer
    2-bit predictors complicate this, since we need to track target in taken case even after a non-taken cases
    PowerPC uses both a target buffer and prediction buffer to solve this
  performance in 5-stage pipeline
    in prediction case, we have no branch stall
    in misprediction case, we have a 2 cycle stall when we find out the bad news in EX
      IF: check for entry in branch-target-buffer (if not found and it later turns out to be a branch, in EX we update branch-target-buffer)
      ID: if found, fetch predicted PC
      EX: if correct, win! if incorrect, kill, restart IF for correct target, update branch-target-buffer
  variation
    store one or more target instructions (possible instead of target address
    speculation involves actually starting to execute actual instruction, not just fetch
    "branch folding"
      zero-cycle unconditional branches, we can just skip to the next instruction after the branch
      sometimes zero-cycle conditional branches if we can use preset condition codes to select destination
- Integrated Instruction-Fetch Units
  No longer a simple pipeline stage, a whole unit with several features
  1.) Integrated branch prediction
  2.) Instruction prefetch
  3.) Instruction memory access and buffering
  (See also trace caches in Chapter 5)
- Return Address Predictors
  85% of indirect jumps in SPEC89 are for procedure returns
  branch-target-buffer ineffective if common procedure called from multiple sites
  Instead, use a small buffer that pushes return address on call, pops on return
    1-16 entries accuracy (li is worse for of all SPEC89 apps)
     1 50% miss rate on "li"
     2 35% miss rate on "li"
     4 15% miss rate on "li"
     8  5% miss rate on "li"
    16 <5% miss rate on "li"
- Alternative to branch prediction
  try to reduce penalty of misprediction
  - prefetch both paths
    dual ported memory, interlaved cache, or read in sequence (all costly)
3.6 Taking Advantage of More ILP with Multiple Issue 215
- Below CPI=1.0
  Last two sections try to remove data and control hazards to get CPI=1.0
    To go lower we need multiple instructions to be issued per cycle
 multiple-issue processors
   super-scalar
   VLIW (aka EPIC expliclty parallel instruction computers)
 Summary of 5 primary approachs for multiple issue processors (Figure 3.23)
  Superscalar (static)       Sun UltraSPARC II/III
  Superscalar (dynamic)      IBM Power2
  Superscalar (speculative)  Pentium III/4, MIPS R19K, Alpha 21264, HA PA 8500, IBM RS64III
  VLIW/LIW                   Trimedia,i860
  EPIC                       Itanium (not pure VLIW because dependencies marked in bundles)
- Statically Scheduled Superscalar Processors
  0-8 instructions per cycle
  instructions issue in order with all hazards account for at issue time
    any data hazard or structural hazard causes a stall of instruction and all after it
  issue packet
    the set of instructions actually deliver for issue as a unit
    logically inspected in program order, in practice in paralle with each other
  issue checks typically complex enough to now be multiple pipeline stages
  - increases complexity because now handling multiple issue packets at once
  - first stage  typically checks for dependencies within issue packet
  - second stage typically checks for dependencies with other in flight instructions
  - harder to further pipeling, making this a potential limit factor in pipeline
  - more stages means higher branch penalties, more reliance on good branch prediction
- A Statically Scheduled Superscalar MIPS Processor
  MIPS example, similar to HP 7100
  - two instructions per cyle
    1 is load/store/branch/integer ALU
    2 is floating point
  - fetch 64 bits per cycle
    early superscalar required fixed placement although not common now
      (for example integer/floating in that order)
    if unaligned access at end of cache block, simple fetchers may only one instructions
      "independant function prefetch unit" (above) can solve this problem
  - integer/floating point case simplifies hazard checking
    need to check for integer-to-floating point transfer
      either register-to-register move or load/store of FP combined with FP ALU operation
  - hardware implementation issues
    floating point can quickly become the bottleneck
      either need multiple FP units or a more pipelined FP unit
    need dual-ported FP register file to allow FP load/store with simultaneous FP ALU operation
    need larger bypass paths with multiple instructions feeding each other
  - imprecise exceptions
    an FP instruction can finish after a integer instruction that was logically before it
    an FP instruction exception could be detected after the integer instruction completes
    solutions
    - early detection of FP exceptions (appendix A)
    - software mechanism to restore a precise exception state
    - delay instruction completion until we know exception is impossible (See 3.7 on speculation)
  - load latency
    increased from one clock cycle to two
    result of load cannot be used on the *same* clock cycle or the *next* clock cycle
      since we are issuing them in parallel, we logically have an extra instruction delay
- Multiple Instruction Issue with Dynamic Scheduling
  Don't block issue, just ship instructions to reservation stations
  - tricky since can be dependencies in issue packet
    1.) do each instruction in half a clock
    2.) complex logic to handle parallel issue to reservation stations while maintaining in order queueing
    3.) a little bit of 1 and 2
  - simple loop example bottlenecked on integer unit, not enough FP ops to keep it busy
    1.) load imbalance between integer and fp unit...
    2.) ...that is too much loop overhead, need to unroll (next chapter 4)
    3.) control hazard holding up load, if only we could speculate past branch (next section 3.7)
3.7 Hardware-Based Speculation 224
- Limits of branch prediction
  prediction reduces stalls from control hazards
  wide issue processor may need to execution one branch per cycle
  need to overcome control dependence
  solution
  - do not limit prediction to instruction fetch
  - instead, also start execution based on prediction
  three ideas
  1.) dynamic branch prediction (choose what to execute)
  2.) speculation (execute prediction without commiting results)
  3.) dynamic scheduling (deal with scheduling of different basic blocks)
  extending Tomasulo's algorithm to handle speculation
    (as done in PowerPC 603+, MIPS R10k+, Pentium II+, Alpha 21264, AMD K5+)
    seperate bypassing of rsults among instructions from completion of instructions
    instruction commit: when a instruction is no longer speculative, allow update to reg/mem
      also delay exceptions to commit, preserving precise exceptions
  EXECUTE OUT OF ORDER, COMMIT IN ORDER
    in 5 stage pipeline, we always had writes/exceptions at end of pipeline
    now we add another stage to Tomasulo's, "Commit", to act in same manner
  reorder buffer (ROB)
    new hardware to hold speculative results before commit stage writes them to real reg/mem
    replaces reservation stations as source of operands
    replaces store buffer as well
    fields
      instruction type  : branch, store, register
      destination field : register (loads/alu) or address (stores)
      value field       : value until comit
      ready field       : instruction complete (valid value)
    stores
      still two part, second part is now done by new "Commit" phase
    reservations stations still exist
      used to schedule units
      however, references for operands now use ROB indexes
    steps
    1.) Issue (aka "dispatch" in some dynamically scheduled processor)
        issue requires reservation station and ROB entry
          full ROB a structural hazard, stalls issuing
        reservation stations operands get register values or ROB pointers
    2.) Execute (aka "issue" in some dynamically scheduled processor)
        when operands available, start possible multi-cycle exection
    3.) Write Result
        write result across CDB to waiting reservations stations and ROB
          store not ready until destination and value calculated
            other instructions are always ready at this point
        free reservation station for reuse
    4.) Commit: completion, graduation, retirement
        Commit instructions in order
        - if head is ready, perform reg/mem write as needed
        - if head is not ready, stall
        - if head is correctly   predicted branch, branch is done
        - if head is incorrectly predicted branch, flush ROB and restart with correct instruction
        exceptions are raised here, in order, as needed
        free ROB entries as they commit
    branch misprediction
      try to handle even before instruction is ready to be commited if possible
    store optimization
      value to store can actually pass from write results immediately to waiting stores for commit the same cycle
    memory aliasing problems
      WAW and WAR: no issues because of in-order retirement of stores
      RAW:
        1.) stall 2nd half of load if any ROB entry has matching destination
        2.) maintain program order for computation of effective address of load with respect to earlier stores
        bonus points: bypass store value directly to load when RAW detected
- Multiple Issue with Speculation
  "multiple issue" issues (instruction issue and monitoring CDB's for completion) just keep getting harder
  in addition, need to support multiple commits per cycle to have CPI<1.0
- Design Considerations for Speculative Machines
  Register Renaming versus Reorder Buffer
    replace ROB with set of physical registers larger than the set of architectural registers
      maintain mapping of architectural registers to physical registers
    commit is simplifed
      update map of architectural to physical
      free old physical register that was held previous architectural value
    freeing physical registers is a bit complex
       1.) need to make sure it is not used for an architectural register
           AND
           need to make sure it is not an operand of any functional unit queue
     OR
       2.) just wait until the original architectural register is written again
           consumes physical register a bit longer but simpler, used in MIPS R10k
    issue is simplified
      don't need to check ROB and registers for operand values
    checkpointing register operations for context switch etc needs a bit of extra logic
      sometimes (IA-64!?!) physical registers are exposed architecturally as well!?!
    usage:
       MIPS R10k/R12k, Alpha 21264, P2,3,4 all use register renaming
       20-80 extra registers
         replace primary function of ROB
         largely determine how many instructions can be in execution (b/w issue and commit) at one time
  How Much to Speculate
    danger in speculating many costly exceptional events such as cache misses in parallel
      ties up other system resources, potentially hard to stop if misprediction is found
    often just limit to one such costly event to be speculated at a time
    examples: L2 miss, TLB miss
  Speculating through Multiple Branches
    three reasons for Speculating through Multiple Branches
      1.) very high branch frequency
      2.) significant clustering of branches
      3.) long delays in functional units
    slight complication of speculation recovery
      harder to predict and speculative on more than one branch per cycle
      POWER2 can resolve two branches per cycle but not speculate on any other instructions
      as of 2001, no processor can provide full speculation with resolving multiple branches per cycle
3.8 Studies of the Limitations of ILP 240
- Limits of ILP
  pipelining started in 60s, responsible for 80s and 90s performance growth
  even given "perfect" hardware, their are limits to ILP
    some assumptions on compiler technology may change
    perhaps some hope from new tech like value prediction
  however, realistic hardware has problems before fundamental limits
- The Hardware Model
  assumptions
    1.) register renaming - infinite registers
    2.) branch prediction - perfect prediction
    3.) jump prediction   - perfect prediction
    4.) memory address alias analysis - addresses known exactly and independent load can move before store
  implications
    2/3 remove control dependencies
    1/4 remove all but true data dependencies
  other assumptions
    no restrictions of types of instructions issued
    all functional units 1 cycles
    perfect caches, 1 cycle load/store
  methodology
    MIPS optimizing compiler
    create trace
    trace analysis to create optimal schedule
- Limits on the Windows Size and Maximum Issue Control
  checking dependencies in "n" instructions needs "n^2-n" comparisons
    to issue 50 instructions needs 2450 comparisons
    2000 instructions needs 4 million
  "window" size (how many instructions analyzed for simultaneous execution)
    comparisons = max completion rate * windows size * number of operands
    total windows size limited by storage, comparisons, and issue rate
      makes larger window less helpful
  large window sizes are impractical and inefficient
*   assume window size of 2k entries
*   assume maximum issue capability of 64 instructions per clock
- The Effects of Realistic Branch and Jump Prediction
* assume ambitious tournament predictor that uses two levesl of prediction and a total of 8k entries
- The Effects of Finite Registers
* assume 256 integer and 256 floating point
- The Effects of Imperfect Alias Analysis
  three models
  1.) global/stack perfect
      perfect predictions for globals and stack, but no heap
  2.) inspection
      assume offsets from same register are not aliases
      assume addresses in different segments not aliases
  3.) none
      all addresses could conflict
  NOTE: for fortran without heap, 1==2
  results (Figure 3.43/3.44)
    1 does very close to ideal
    1 factor of 2 better than 2
    2 not much better than 3
  hardware limits to alias analysis
    1.) to disambiguate load, need to know memory address of earlier stores
        use memory address speculation
          however, when wrong, need to redo
    2.) only a small number can be disambiguated per clock cycle
    3.) number of load-store buffers determines how much ealier or later we can move access
3.9 Limitations on ILP for Realizable Processors 253
  assume
    1.) 64 instructions per clock with no issue restrictions
    2.) tournament predictor with 1k entries and 16 entry return predictor
    3.) perfect disambiguation of memory references dynamically
    4.) register renaming with +64 integer and +64 fp registers (in addition to architected registers)
  vary
    window size from 4 to infinite
  results
    integer programs reach limits pretty early, curve pretty flat approx 10-15 even in infinite case
    floating point could use more window size, scaling all the way to infinite
- Beyond the Limits of This Study
  other limitations in the perfect model
  1.) WAR and WAW hazards through memory
      speculating through functional call/return/call causes unneeded dependencies on stack frame
      need to rename stack memory as well
  2.) Unnecessary dependences
      don't really need a loop dependency to do i++
        compiler can help here
      return address register and register stack pointer also cause unnecessary dependencies
        can't speculate on two two parallel function calls
  3.) overcoming the data flow limit
      value prediction... assume function does not return error code or exception...
  help for less than perfect processors
  1.) address value prediction and speculation
      easier than value prediction
  2.) speculating on multiple paths
      are HW resources better spent on a bigger predictor?
3.10 Putting It All Together: The P6 Microarchitecture 259
- Intel P6 Microarchitecture
  Pentium Pro, Pentium II, and Pentium III
  differences
    some have MMX and SSE
    clock rate
    cache architecture
    memory interface
  IA-32 decode
    translates IA-32 into uops (micro-ops, aka RISC instructions)
    3 IA-32 instructions fetched/decoded/translate per cycle
    if CISC instruction needs more than 4 uops
      microcode sequence generates uops over multiple cycles
    max uops per cycle=6
        4 for 1st instruction of 3, other two for 2nd and 3rd IA-32 instructions
  uops execution
    OOO specualtive pipeline with register renaming and ROB
    3 utops per clock can be renamed and dispatched
    14 stage pipeline
    - 8 stages for in-order instruction fetch, decode, and dispatch
      branch predictor: 512-entry, 2-level
      40 virtual registers
      20 reservation stations
      40 ROB entries
    - 3 stages for OOO exuction in 1 of 5 functional units
      1-integer, 2-FP, 3-branch, 4-memory address, 5-memory access
      integer ALU can be as little as 1 cycle
      floating point divide is 32 cycles
   - 3 stages for instruction commit
- Performance of the Pentium Pro Implementation
  Pro has smaller caches than II, but faster, high BW secondary cache
  What are sources of stalls
    largely failure to keep above pipelines busy
    sometimes because of particilar instructions, sometimes hardware resource limits
  Stalls in the Decode Cycle
    integer programs have more instruction cache misses than FP
    floating point programs have more HW resource capacity stall
  Data Cache Behavior
    out of order can mask some l1 miss latency, but not l2
  Branch Performance and Speculation Costs
    branch target buffer (BTB) dominates mispredict frequency
      so we know which way to go, but not where to go
        perhaps a bigger BTB, with slightly worse predict rate would be better
  Putting the Pieces Together: Overall Performance of the P6 Pipeline
    SPECint CPI is 1.15
    SPECfp  CPI is 2.0
- The Pentium III versus the Pentium 4
  Pentium 4 uses "NetBurst" microarchitecture
    some similarites
      fetch up to 3 IA-32 instructions per cycle
      graduate up to 3 uops per cycle
    differences
      much deeper pipeline
        example integer add:
          p6:       10 cycles
          netburst: 20 cycles, including 2 wire only cycles
      register renaming vs ROB
         128 outstanding results vs 40
      7 vs 5 integer units
        extra integer ALU and address computation unit
    aggresive ALU (2x clock) and aggressive data cache
      ALU 1/2 cycle vs 1 cycle
      loads 2 vs 3 cycles
    trace cache to improve instruction fetch performance
      versus prefetch buffer and instruction cache
    BTB is eight times larger and has better predictor algorithm
    data cache
      smaller (8k vs 16k) but larger 256k L2 with higher bandwidth
    SSE2
      2 floating point operations per instruction
      considerable advantage over PIII on floating point
  A Brief Performance Comparison of the Pentium III and Pentium 4
    better absolute performance, especially for FP
    however, higher CPI for integer apps, reasons:
      1.) clock overhead because of skew and jitter
      2.) deeper pipeline
    alternative
      wider issue width instead of deeper pipeline
    (what is unsaid, basically Intel MHz marketing decided architecture)
3.11 Another View: Thread-Level Parallelism 272
- TLP could be less costly than ILP especially for some workloads...
- existing desktop code lacks TLP in software, harder to find, focus remains on ILP...
3.12 Crosscutting Issues: Using an ILP Data Path to Exploit TLP 273
- multithreading (aka simultaneous multithreading (aka Intel hyperthreading))
  reuse ILP resources for TLP
3.13 Fallacies and Pitfalls 273
- Fallacy
    Processors with lower CPI will always be faster
- Fallacy
    Processors with faster clock rates will always be faster
  Power3-II 400MHz beats Pentium III 800MHz on SPECfp 344 to 237
    of course Pentium beats Power on SPECint 344 to 249
- Pitfall
    Emphasizing an improvement in CPI by increasing issue rate while
    sacrificing clock rate can lead to lower performance
  HP PA-7100 with simple dual issue (99MHz) beats TI SuperSPARC flexible three issue (40MHz)
  Simulation is to blame, rewards of CPI look good, hard to estimate clock impact
  1.) hard to know clock rate of an approach until design is well under way
  2.) simulation tools for improving for CPI are really good compared to improving cycle time
- Pitfall
    Improving only one aspect of a multiple-issue processor and
    expecting overall performance improvement
  Amdahl's law...
    better predictor may not matter if other resources are limiting facto
- Pitfall
    Sometimes bigger and dumber is better
  simple 2-bit predictor better than Alpha tournament predictor for OLTP
    larger code size, higher branch frequency
      better local predictions
3.14 Concluding Remarks 276
- typically doubling clock is better than doubling issue width
    however when this is just because of really deep pipelining result is less sure
      Pentium 4 tradeoff decision
- Practical Limitations on Exploiting more ILP
  more transistors, wire delays for clock and critical path
    longer clock needed
      or more stages like Pentium 4
        but that costs more branch delays etc
  Power! Power! Power!
    more switching transisters
    switching more frequency
    power/performance is rising rapidly
3.15 Historical Perspective and References 280
- IBM 360 Model 91
    tagging of data
    register renaming
    dynamic detection of memory hazards
    generalized forwarding
    Tomasulo's algorithm
    branch prediction
  complexity made it late to market
    outperformed by Model 85 that had IBM's first cache
- Branch-Prediction
- The Development of Multiple-Issue processors
  70s
    IBM ACS (surpass Stretch, beat CDC 6600/6800)
    John Cocke, ..., Amdahl
  80s
    IBM America "superscalar" => Power1
  90s
    Johnson and AMD K-5, speculative superscalar
- Studies of ILP and Ideas to Increase ILP
- Going Beyond the Data Flow Limit
  "value prediction"
  deciding when to speculate also important
- Recent Advanced Microprocessors
Exercises 288
Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches
4.1 Basic Compiler Techniques for Exposing ILP 304
- Basic Pipelining Scheduling and Loop Unrolling
  Need to find sequences of unrelated instructions to be overlapped in pipeline
  Need to push dependent instructions apart enough to tolerate cycle latency
  For example 5 stage pipline, we assume:
  Figure 4.1
    Instruction   Instruction        Latency
    producing     using              in
    result        result             clock cycles
    ------------------------------------------------------------------------
    FP ALU op     Another FP ALU op  3
    FP ALU op     Store double       2
    Load double   FP ALU op          1
    Load double   Store double       0
  Loop Unrolling
    Simple C loop adding constant to a every vector element
      for (i=100; i>0; i--)
        x[i] = x[i] + 5;
    MIPS (optimized with strenth reduction of *8 to -8, elimination of seperate i variable)
      loop: l.d    f0,0(r1)   ; f0=array element
            add.d  f4,f0,f2   ; add scalar in f2
            s.d    f4,0(r1)   ; store result
            daddui r1,r1.#-8  ; decrement pointer 8 bytes
            bne    r1,r2,loop ; branch r1!=r2
    with stalls, 10 cycles:
      loop: l.d    f0,0(r1)   ; f0=array element
              stall
            add.d  f4,f0,f2   ; add scalar in f2
              stall
              stall
            s.d    f4,0(r1)   ; store result
            daddui r1,r1.#-8  ; decrement pointer 8 bytes
              stall
            bne    r1,r2,loop ; branch r1!=r2
              stall
    reordering to reduce stalls, 6 cycles (3 cycles of overhead)
      loop: l.d    f0,0(r1)   ; f0=array element
            daddui r1,r1.#-8  ; decrement pointer 8 bytes
            add.d  f4,f0,f2   ; add scalar in f2
              stall
            bne    r1,r2,loop ; branch r1!=r2
            s.d    f4,0(r1)   ; store result
    overhead
      3 cycles to do loop body (load/add/store)
      3 cycles of loop overhead
    unroll of basic loop: 7 cycles per element (4 of overhead)
      loop: l.d    f0,0(r1)
            add.d  f4,f0,f2
            s.d    f4,0(r1)
            l.d    f6,-8(r1)
            add.d  f8,f6,f2
            s.d    f8,-8(r1)
            l.d    f10,-16(r1)
            add.d  f12,f10,f2
            s.d    f12,-16(r1)
            l.d    f14,-24(r1)
            add.d  f16,f14,f2
            s.d    f16,-24(r1)
            daddui r1,r1,#-32
            bne    r1,r2,loop
    schedule unrolled loop: 3.5 cycles per element (only .5 overhead)
      loop: l.d    f0,0(r1)
            l.d    f6,-8(r1)
            l.d    f10,-16(r1)
            l.d    f14,-24(r1)
            add.d  f4,f0,f2
            add.d  f8,f6,f2
            add.d  f12,f10,f2
            add.d  f16,f14,f2
            s.d    f4,0(r1)
            s.d    f8,-8(r1)
            daddui r1,r1,#-32
            s.d    f12,-16(r1)
            bne    r1,r2,loop
            s.d    f16,8(r1)   ; 8-32=-24
- Summary of Loop Unrolling and Scheduling Example
  Necessary Decisions and Transformations
  1.) determine it was legal to move s.d after daddui and bne and adjust the s.d offset
  2.) determine that unrolling the loop would be useful by finding that iterations were independent (excluding maintenance code)
  3.) use different registers to avoid unnecessary constraints that would be forced by using the same registers
  4.) eliminate the extra test and branch instructions and adjust the loop termination and iteration code
  5.) determine that the loads and stores in the unrolled loop can be interchanged (independence and aliasing analysis)
  6.) schedule the code, preserving any dependences needed to yield the same result as the original code
  Example
    removing daddui removes data dependency between r1
  Example
    Use register renaming to remove dependencies between "iterations"
  Three different types of limits to gains
    1.) always some loop overhead
    2.) growth in code size
    3.) register pressure
- Using Loop Unrolling and Pipeline Scheduling with Static Multiple Issue
  We can unroll the loop 5 times and schedule L.D in parallel with ADD.D (integer and fp operations)
    2.4 clock cycles per element (versus 3.5 for non-superscalar)
4.2 Static Branch Prediction 313
- See also MIPS branch delay slot in Appendix A)
- Fill stall slots with code that has a control dependence
  1.) either because its safe to do (target register unused in target block)
  2.) or if we are sure we will take it, we can undo the op in unlikely case that we do not
- Compile time prediction schemes
  1.) always predict as taken
  2.) predict backward as taken (loops...) and forward as not taken
  3.) predict backward as taken (loops...) and forward as taken (better for SPEC)
  4.) use some program context info to improve 2 and 3
  5.) use profile information
4.3 Static Multiple Issue: The VLIW Approach 315
- Approaches to multiple issue
  dynamic super-scalar
    flexible but lots of hardware
  static super-scalar
    easier to implement buts needs some compiler assistance
  VLIW
    HW does not need to think, compiler does all the work
- The Basic VLIW Approach
  Each component of VLIW instruction not as wide as RISC instruction because position implies unit
  Scheduling code for issue
    local scheduling
      for loop unrolling etc
    global scheduling
      dealing accross branches
    trace scheduling (type of global)
      (section 4.4)
  Unrolled loop example (5 way VLIW with 2-mem, 2-fp, 1-integer)
    1.29 cycles per result (versus previous best 2.5 for static supersclar)
    only 60% of slots used
  Code bloat
    causes
      1.) from unrolling to generate parallelism
      2.) from fillerNOOPs
    solutions
      - limit to one immediate field per packet
      - compress code
      see also 4.7 and 4.8
  lock step operation
    seems simpler...
      if one unit suffers a cache miss, all stall...
    newer VLIW designs only assume no dependencies in issue packet
      do manage dependencies from stalls of independent functional units
  binary compatibility
    typically none in older VLIW
      either with pre-VLIW code
      or with one VLIW generation to another, as more units are added
    solutions
      translation/emulation
      IA-64 defines VLIW instructions independent of internal architecture
  multiple-issue (even not VLIW) versus vector processor
    multiple-issue can extract some paralellism even from irregular code
    vector good for very regular numerical programs
      as extension to multiple-issue processor
4.4 Advanced Compiler Support for Exposing and Exploiting ILP 319
- Detecting and Enhancing Loop-Level Parallelism
  typically done at or near source level (on loops themselves)
    basic ILP is typically scheduled at lower level
  "loop-carried dependence" analysis
    figuring out dependencies between loop iterations
  Types of dependencies
    some depend on previous iteration
      A[i+1] = A[i] + ... // S1
      have to execute in order (S1 depends on S1)
    sometimes dependence does not prevent parallelism
      A[i]   = A[i] + B[i] // S1
      B[i+1] = C[i] + ...  // S2
      although S2 depends on S1, there is no cycle like S1 depends on S1
  dependence analysis sometimes inexact
    dependence "may" exist
  recurrences (depending on earlier iterations)
    Y[i] = Y[i-1] + ... // S1 depends on S1
    dependences force ordering of successive iterations (because "dependence distance" is 1
    however, if distance is greater (say 5)
      Y[i] = Y[i-5] + ... // S1 depends on S1
    then we can still unroll and find significant parallelism
  Finding Dependencies
    affine array indexes
      expressible as a*j+b (or c*k+d)
      gcd test for two elementsa
        if a loop-carried dependence exists, then GCD(c,a) must divide (d-b)
      sparse array indicies (x[y[i]]) are not affine
    classifying dependence
      true dependencies
      false dependencies
        antidependences
        output dependence
        solve with renaming
    array oriented dependence analysis useless for:
      - objects referenced via pointers
      - array index is inderect through another array (sparse array case)
      - input data dependent dependence
        (why not generate code for both cases...)
      - when an optimization needs to know more than possiblity of dependence, but needs to know exactly when
    "points-to" analysis
      relies on:
      1.) type information
      2.) information derived from when a object is allocated or when the address of an object is taken
          (if p always points to object allocated at line x and q was taken before that line)
      3.) information derived from pointer assignments
      successes
      - determine if two parameters point to the same object
      - determine what type of data currently pointed to by pointer (for unions, classes, etc)
      - separate local vs global pointers
      limitations
      1.) pointer arithmetic
      2.) "interprodecural analysis" difficulty
      modern programming languages (Java) better at fixing both limitations
  Eliminating Dependent Computations
    eliminate or reduced dependences through back substitution
    copy propagation
      turn
        daddui r1,r2.#4
        daddui r1,r1.#8
      to
        daddui r1,r2.#8
    tree height reduction
      turn
        add r1,r2,r3 // S1
        add r4,r1,r6 // S2 depends on S1
        add r8,r4,r7 // S3 depends on S2
      to
        add r1,r2,r3 // S1
        add r4,r6,r7 // S2
        add r8,r1,r4 // S3 depends on S1
    recurrence
      unrolling:
        sum = sum + x;
      creates:
        sum = sum + x1 + x2 + x3 + x4 + x5;
      change to:
        sum = ((sum + x1)) + (x2 + x3)) + (x4 + x5);
      basically tree height reduction
    similar solution to array index calculations
- Software Pipelining: Symbolic Loop Unrolling
  software pipelining
    interleave instructions from different loop iterations
      separates dependent instructions from within one loop iteration
    software counterpart to Tomasulo's algorithm
    example (minus start-up/wind-down code)
      loop: s.d    f4,16(r1)  ; stores into M[i]
            add.d  f4,f0,f2   ;     adds to M[i-1]
            l.d    f0,0(r1)   ;       loads M[i-2]
            daddui r1,r1.#-8  ;
            bne    r1,r2,loop ;
    usually tricky to get register usage correct
    much better code density than straight loop unrolling
    loop unrolling can still reduce loop overhead instructions (can use both together...)
- Global Code Scheduling
  moving code between basic blocks for better scheduling
    example move assignment after "if" in inner loop before if
      can we be sure that body of "if" does not affect assignment?
    potential use conditional or predicated instructions to make changes safe, especially to preserve exception behavior
  Trace Scheduling: Focusing on the Critical Path
    use profiling to find common case in inner loop conditionals
      "trace selection" determins the likely path, aka "trace"
    unroll loop
      "trace compaction" schedule trace for common case
      if common case is found to be false at runtime
        jump to more expensive code, may have to compensate for
          operations moved early in common case trace code
        rturn
    downsides: exit and return to trace can lead to very complicated exception code
  Superblocks
    similar to trace scheduling
      however on exception to common case, just bail out and don't return until after "superblock"
        "superblock" contains common case fast path, exception only slow path
        "tail duplication" creates duplicate code in both common case and exception paths
          prevents having to return after exception as was needed in trace scheduling
4.5 Hardware Support for Exposing More Parallelism at Compile Time 340
- Conditional or Predicated Instructions
  processor can evaluate condition/predicate in parallel with result
    only commit result if condition turns out to be true
    example (MIPS conditional move)
      before
          bnez r1,L
          add u r2,r3,r0
        L:
      after
          CMOVZ r2,r3,r1
    aka "if conversion" in vector processors
    helps eliminates simple branches, simplifying branch issues in pipeline
      not so useful for conditional execution of longer blocks of code
  full predication
    all operations can controlled by predicate
    convert if/then/else into single block with partial exec via predicate
      removes all non-loop based branches
    need to preserve proper exception behavior
      defer exceptions until condition is resolved, similar to out-of-order execution
    "annulling" - when to cancel instructions with false predicates
      early annulling
        need to know condition early to avoid hazard => pipeline stalls
        not used in practice for this reason
      late annulling
        consumes function unit resources => potential negative for performance
        complicates forwarding implementation
  conditional/predication summary
    pros
      useful for short alternative control flows
      eliminate some unpredictable branches
      reduce overhead of global code scheduling
    cons
      annulled instructions waste resources
        fetch cost at minimum
        can tie up functional units from other uses (although might be idle units anyway)
      most useful when predicate can be evaled early
        otherwise we stall for data hazard
        might have been better of with branch predictor in this case
      use limited with complex alternatives
        need to calculate compound predicates, taking extra cycles
      speed penalty for conditional over unconditional
        potentially more cycles or slower clock rate
    in practice
      MIPS, Alpha, PowerPC, SPARC, Pentium+
        conditional move only
      Itanium
        full predication for all instructions
- Compiler Speculation with Hardware Support
  move speculative instructions early
    not just before branch (like predication)
    but before condition evaluation
    needs:
    1.) find instructions to move early without affecting data flow      (COMPILER)
    2.) ignore exceptions until non-speculative                                (HW)
    3.) speculatively interchange loads and stores with potential dependencies (HW)
  Hardware Support for Preserving Exception Behavior
    1.) HW and OS cooperatively ignore exceptions
        okay for "correct" programs, but SIGSEGV in buggy program might be lost, receiving undefined value
        consumes registers to hold speculative results
        no speculative stores
    2.) special speculative instructions that require explicit exception check
        two of the same issues as above:
          consumes registers to hold speculative results
          no speculative stores
        extra code
    3.) poison bits on result registers that cause exception when bad speculative value read
        non-precise exceptions, since exception appears to happen on later read
        no speculative stores
        special OS support for context switching "poison bits"
    4.) hardware buffering for speculative results
        reorder buffer approach
        need to indicate in speculative instruction:
        - how many branches the speculative instruction was moved accross
        - what the prediction was for the branch
        compiler still responsible for register renaming
  Hardware Support for Memory Reference Speculation
    - "special" instruction at site of load
    - move speculative version of load early
    - similar to ll/sc, if we see a store to speculative load address, need to cope
      - if there was a change, we can just reload
      - other dependant code was run, "special" instruction needs to provide fixup code address
4.6 Crosscutting Issues: Hardware versus Software Speculation Mechanisms 350
- extensive speculation requires memory disambiguation
- HW-based works better with unpredictable control flow
- HW-based preserves precise exception model (some SW-based version support this as well)
- HW-based does not require compensation or bookkeeping code like many SW-based
- Compiler-based approaches benefit from large analysis window than HW-based
- HW-based does not require different code for different implementations of the same ISA
  (IA-64 addresses this somewhat by making issue packet 3 wide even if architecture is narrower/wider)
- SUMMARY
  Even with HW advantages, complexity and additional HW add to cost...
- odd interactions
  condition move interacts with register renaming
    annulled condition move still has to write something to destination register!
4.7 Putting It All Together: The Intel IA-64 Architecture and Itanium Processor 351
- The Intel IA-64 Instruction Set Architecture
  The IA-64 Register Model
  - 128 64-bit GP registers (actually 65 bits...)
    register stack style
       0-31  always avaialble
      32-128 registers stack
       0-96  for local function use
      on call, shift so arguments visible in r32...
      alloc instruction to indicate output register usage
      "register stack engine" HW for handing stack overflow
    special load/stores for saving registers stack
  - 128 82-bit FP registers (2 more than 80-bit IEEE)
  - 64 1-bit predicate regiters
  - 8 64-bit branch registers (for indirect branches)
  - misc system registers
  - register rotation
    can rotate GP And FP registers 32-128
    for software pipelined loops
      combined with predication, can avoid pipeline prologue/epilogue code
  Instruction Format and Support for Explicit Parallelism
  - EPIC = VLIW + ability to specify dependencies
    implicit parallelism
      instruction groups
      can span multiple "bundles"
      seperated by explict "stops"
    ease of decode
      "bundles" of three instructions
      128-bit wide
        5-bit template and 3x41-bit instructions
          template implies
            instruction format of 3 instructions
            location of "stops" between instruction groups
          template cannot express all combinations of formats/stops
            only 24 of 32 possible currently defined
  Instruction Set Basics
    A=alu, I=immediate, M=memory, B=branch, F=floating, L+X=-move-immediate-long
    41-bit instruction
      high 4-bits => major opcode
      low  6-bits => predicate register
  Predication and Speculation Support
    nearly all instructions predicated
      conditional move == move with predicate
    compare/test instructions set pair of predicates with opposite values
    control speculation (delay exceptions)
      65th bit of GPR serves as poison bit
      special non-IEEE exponent used for poisoned FP
      "chk" instruction can test for poison
    memory reference speculation (dealing with aliasing)
      "ld.a" performces "advanced load"
      "ld.c" checks that "ld.a" value is valid
      "chk.a" checks that register derived from "ld.a" value is valid
- The Itanium Processor
  2001
    800MHz, 6 issues/clock(3 branch,2 memory)
    3-level cache, L3 is 4MB off chip (but within package)
  Functional Units and Instruction Issue
    9 units (2 immediate, 2 memory, 3 branch, 2 floating)
    issue 2 x 3 instructionsbundles at a time
      max 6/clock
      worst when split by signle stop 4/clock (1+3)
      split on blocked functional units, consuming 1 bundle on next cycle
      Itanium issues
        MMF causes stop before and after
    10 stage pipeline
      Front-end
      1.) IPG
      2.) Fetch
      3.) Rotate
      Instruction delivery
      4.) EXP
      5.) REN (rename)
      Operand delivery
      6.) WLD
      7.) REG register
      Execution
      8.) EXE execute
      9.) DET detect exceptions
      10.) WRB writeback
    NOTE:
      even though IA-64 is suppose to depend on smart compiler
      Itanium has many features of super-scalar out-of-order machine
  Itanium Performance
    SPECint
      60% of Pentium 4
      68% of Alpha 21264
    SPECfp
      108% of Pentium 4
        solely based on 4x "art" performance, w/o art it is slower
      120% of Alpha 21264
        Alpha wins with non-baseline compiler flags
          with tuned compiler, beats Itanium
    Performance/Watt
      similar to Alpha 21264
      58% of Pentium 4...
4.8 Another View: ILP in the Embedded and Mobile Markets 363
- The Trimedia TM32 Architecture
  closest to "classic" VLIW
  5-wide VLIW
    each instruction has predicate register to control execution
  staticly scheduled
    need noops for unused slots
    need noops for data dependencies
    if 2 of 5 are branches, need to be sure they have opposite predicates
  loads can be speculated
    no virtual memory
  comparison to MIPS
    MIPS is general purpose with some extensions for embedded
    MIPS has much smaller code size, even with TM32 instruction compression
- The Transmeta Crusoe Processor
  Running x86 on a VLIW using software translation
  Instruction Set Basics
    instruction formats
       64-bit instructions with 2 operations
      128-bit instructions with 4 operations (M,C,A,Immediate or M,C,A,Branch)
    instruction types
      ACMBI = ALU, Compute, Memory, Branch, Immediate
    registers
      64 integer, 32 floating point
  Software Translation and Hardware Support
    options
      interpret x86
      translate "hot" code to native VLIW
        cache translation
        at least a basic block at a time
    coping with exceptions
      need to preserve precise exceptions of x86
      but need to allow reordering for performance
    reordering support
      1.) shadowed register file
      2.) program-controlled store buffer
      3.) memory alias detecthion hardware with speculative loads
      4.) conditional move instruction (select)
      1&2 allow OOOe without commiting state until entire x86 sequence completes
      2 is like a small TCC cache! software controls commit/rollback
  Performance Measures
    goal of power with low performance
      4.22 shows that for mp3/dvd playback TM is .32/.42 of PIII power
      however, 4.23 shows what a PIII is only 8% of the power during DVD playback
        I/O is over 74% of power. display and DVD are over 50%.
4.9 Fallacies and Pitfalls 370
- Fallacy
    There is a simple approach to multiple-issue processors that yields
    high performance with a significant investment in silicon area or
    design complexity.
4.10 Concluding Remarks 372
- Symmetry of ideas in HW and SW
  "Software" techniques in hardware-centric approaches
  - support for conditional instructions
  - prefecth instructions and other cache hints
  - branch prediction hints
  - specoal support for speculative (nonexcepting) loads
  "Hardware" techniques in software-centric approaches
  - scoreboard scheduling of instructions
  - dynamic branch prediction
  - rollback or trap-and-fix-up support for speculation
  - hardware for checking speculated load correctness
- IA-64
  not a breakthrough
  - in scaling ILP
  - in avoiding complexity
  - in power consumption
- Alternative
  move from ILP to multiprocessors
  SMT (hyperthreading) is half way point
- Chip Multiprocessors
  Power4
    2x Power3 processors and L2 cache
    memory interface and multiprocessor interconnect
      allows glueless 8 processor module using 4x Power4
  TI TMS320C80 (before Power4!)
    4 DSPS and a RISC processor as controller
  many multi-core embedded MIPS systems
4.11 Historical Perspective and References 373
- The Development of Multiple-Issue Processors
  1981, Charlesworth, Floating Point Systems AP-120B
  Stanford MIPS had 2 ops per instruction feature (not in commercial MIPS)
  Yale, Fisher
    VLIW with 512-bits wide instruction
    trace scheduling
      *Ellis* [1986]
  Multiflow processor, Colwell 1987
     Yale + controllable store buffer
     100 million sold
  Cydrome Cydra 5 [1989]
    more HW than Multiflow
  Rau's "polycyclic scheduling" => software-pipelining
  Apollo DN 10000
  Intel i860
  pipeline organization
    traditional "self-draining" pipelines
    alternative is "push-and-catch" (AP-120B and i860)
      operations advance only when another operation pushes from behin
      pushing instructions specifies destination register of pushed instruction
       eliminates need to detect WAW and WAR hazards in HW
       increases code size because of NOOPS needed to push results
- Compiler Technology and Hardware Support for Scheduling
  D. Kuck et al U o fIllinois 1970s
    terms "antidependence" "output dependence"
    GCD and Banerjee dependence tests
  Recent dependence research
    variety of exact tests ending with linear programming algo called Fourier-Motzkin
      D. Maydan and *W. Pugh* shows that the sequences of exact tests pratical solutio
  uncovering and scheduling ILP
    *Lam* 1988 software pipelining for Warp VLIW
  speculative code scheduling
    Smith, Horowitz, and Lam 1992
      checking and recover mechanism similar to IA-64 and Crusoe
  U Illinois 90s
    IMPACT project
      Hwu superblock scheduling
      profiling based optimizations
      special buffer for conflict detection
  delayed branches
    early RISC processor idea from microprogramming
- EPIC and the IA-64 Development
  HP proposed 64-bit VLIW to replace PA-RISC
  Intel wanted 64-bit replacement for x86
  Together they created IA-64 and the Itanium implementation
Exercises 378
Chapter 5 Memory Hierarchy Design
5.1 Introduction 390
- Want unlimited ammounts of fast memory
  memory heirarcy
    registers
    cache
    memory
    disk
  principle of locality
  CPU performance outstripping memory performance
  markets
    desktops
      concerned more with memory latency
    server
      concerned more with memory bandwidth (aka throughput)
    embedded
      1.) worst case performance
      2.) power/battery life
      3.) often one application, little or no OS
      4.) very little memory, often no disk
  harware supported memory protection
5.2 Review of the ABCs of Caches 392
  cache
    cache hit
    cache miss
    block
    locality
      temportal
      spatial
  virtual memory
    page
    page fault
- Cache Performance Review
  memory stall cycles
  miss penalty
  miss rate
  address trace
    less style ld/st trace used to study miss rate etc
  missed per instruction
- Four Memory Hierarchy Questions
Q1: Where can a block be placed in the upper level?
  BLOCK PLACEMENT
    direct mapped
    full associative
    set
    n-way set associative
Q2: How is a block found if it is in the upper level?
  BLOCK IDENTIFICATION
    valid bit
    block address
    block offset
Q3: Which block should be replaced on a miss
  BLOCK REPLACEMENT
    tag field
    index field
    random replacement
    least-recently used
Q4: What happens on a write
  WRITE STRATEGY
    write buffer
    write through
    write back
    dirty bit
    write stall
    write allocate
    no-write allocate
- An Example: The Alpha 21264 Data Cache
  data cache
  instruction cache
  unified cache
5.3 Cache Performance 406
  average memory access time (AMAT)
    AMAT = hit time + miss rate * miss penalty
  hit time
- Average Memory Access Time and Processors Performance
  1.) other reasons for stalls such as bus contention with other devices
  2.) in-order processor is stuck, OOOe can allow other instructions to proceed
  3.)
- Miss Penalty and Out-of-Order Execution Processors
  "miss penalty" defined to not be total stall time, but "exposed" stall time
  need to decide how to measure
    length of memory latency
    length of latency overlap
  measure stall when processors does not retire maximum number of instructions per cycle
    oldest instructions are faulted for stall
  latency can be measured from one of:
  - queued instruction window
  - effective address generated
  - request sent to memory system
- Impoving Cache Performance
  17 optimizations in 4 categories
  (sections 5.4, 5.5, 5.6, 5.7)
5.4 Reducing Cache Miss Penalty 413
- First Miss Penalty Reduction Technique: Multilevel Caches
  make L1 smaller so thats its faster, add larger L2 to lower overall penalty
  local miss rate vs global miss rate
  associativity usually increased at L2 to lowers its local miss penalty
  multilevel inclusion
    all data at level N is found in level N+1
    inclusive L2 needs to be bigger, otherwise just mirror of L1
    inclusion nice because processsor caches levels are consistent for I/O and SMP consistency
    challenges to inclusion
      complexity with different block sizes
  multilevel exclusion
    NO data at level N is found in level N+1
    AMD Athlon
  trade offs
    L1 has more hits, needs faster hit time
    L2 has fewer hits, make it larger to have fewer misses
- Second Miss Penalty Reduction Technique: Critical Word First and Early Restart
  critical word first (aka wrapped fetch or requested word first)
    return important word to CPU while filling in rest of cache line
  early restart
    fill words in order, but let CPU know when its requested word is done
- Third Miss Penalty Reduction Technique: Giving Priority to Read Misses over Writes
  write-through case, have a read miss with outstanding writes
    option 1: wait to read until write buffer is empty
    option 2: check the write buffer on read miss
              if not address not in write buffer, give read miss priority over draining write buffer
  write-back cache
    if read needs to evict dirty block
      move it to write buffer
      perform read
      then peform write from buffer
- Fourth Miss Penalty Reduction Technique: Merging Write Buffer
  merge related blocks into single blocks in write buffer
  NOTE: with memory mapped I/O regions want to avoid this!
- Fifth Penalty Reduction Technique: Victim Caches
  VC is small fully-associative cache
    handles conflict misses of directly mapped L1 cache
  AMD Athlon 8 entries
  *Jouppi* [1990] found 1-5 are effective for small direct mapped L1s
    4 entry VC can remove 1/4 misses of a 4KB direct mapped data cache
- Summary of Miss Penalty Reduction Techniques
  most desktop and servers use techinques 1,2,3,4
5.5 Reducing Miss Rate 423
- 3 C's (4th C is Coherence, see 5.15)
    Compulsory - "cold start" or "first-references misses"
    Capacity   - program cannot run in cache alone
    Conflict   - "collision misses" or "inteference misses" from associativity (including direct mapped)
    reduce compulsory with larger block size
    reduce capacity with a larger cache
    reduce conflict with larger associativity
- First Miss Rate Reduction Technique: Larger Block Size
  reduces compulsory by taking advantage of spatial locality
    but too large can increase conflict misses
- Second Miss Rate Reduction Technique: Larger Caches
    but longer hit time and higher cost
- Third Miss Rate Reduction Technique: Higher Associativity
  rule 1: 8-way closely approximates fully-associative
  rule 2: "2:1 cache rule of thumb"
          a direct mapped cache of size N has about the same miss rate
          as a two-way set-associative cace of size N/2
          (for caches less than 128kb, which is typically for direct mapped caches)
- Fourth Miss Rate Reduction Technique: Way Prediction and Psuedoassociative Caches
  way prediction
    cache bits to predict which will be next block accessed
    Alpha 21264 uses on 2-way I-cache
      1 clock cycle hit if predictor correct, 3 cycle if incorrect
      85% accuracy for SPEC
    works by setting multiplexor is set early and only one tag comparison is necessary
  psuedoassociative (aka column associative)
    direct mapped cache with two entries
      (different from 2-way because search is not in parallel but serial)
      always try "fast" entry first, then check second entry
      like way prediction, want to guess which one should be in the fast slot
- Fifth Miss Rate Reduction Technique: Compiler Optimizations
  reduce conflict misses in I-cache through layout of related functions
    reduced misses 50% for 2KB direct mapped i-cache and 75% for 8KB cache
  align basic block entry points on cache block boundary for spatial locallity
  cache line aware array access
    Loop interchange
      don't stride between lines when you can traverse along a line
    Blocking
      work on submatrixes (aka blocks) that are sized to cache lines
      combines spatial and temporal locaity
      blocking also can help if register allocator can bring whole block into registers
- Summary of Miss Rate Reduction Techniques
  processor-memory gap means cache misses are common problem with underperforming systems
5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism 435
- First Miss Penalty/Rate Reduction Technique: Nonblocking Caches to Reduce Stalls on Cache Misses
- Second Miss Penalty/Rate Reduction Technique: Hardware Prefetching of Instructions and Data
- Third Miss Penalty/Rate Reduction Technique: Compiler-Controlled Prefetching
- Summary of Miss Penalty/Rate Reduction Techniques
5.7 Reducing Hit Time 443
- First Hit Time Reduction Technique: Small and Simple Caches
  SMALL enough to fit L1 on chip
  SIMPLE direct mapped allows overlap of tag check with transmission of data
  L2 hack
    keep tags on chip and data off chip
  L1 size has not been growing even with larger chips
    need to keep hit time within single clock cycle
    (maybe no longer true backoff now that MHz marketing is over)
- Second Hit Time Reduction Technique: Avoiding Address Translation during Indexing of the Cache
  virtual cache vs physical cache
  indexing vs tag bits
  virtual indexed, virutally tagged caches
    pros
      eliminates translation time from cache hit
    cons
      need to check protection (could copy info from TLB to cache on miss)
      need to restrict access by process (could add PID (aka ASID (aka ASN)) to blocks)
      synonyms or aliases (two virtually addressed cache entries could be for te same physical address)
        HW: antialiasing can be used to solve this in restricted cases
          check all possible locations indexes on miss, flushing matches
        SW: page coloring
          force aliased pages to map to same cache index (force last bits of virtual/physical to match)
      I/O uses physical addresses
  virtual indexed, physically tagged caches
    pros
      fast lookup, index does not have to be translated (must be equal for both physical and virtual addresses)
      tag match happens after access on physical address
    cons
      directed mapped cache can be no bigger than page size
      associativity helps by allowing larger cache sizes
- Third Hit Time Reduction Technique: Pipelined Cache Access
    Pentium Pro and Pentium III I-cache
      pros
        increased instruction bandwidth
      cons
        higher branch mispredict penalty
- Fouth Hit Time Reduction Technique: Trace Caches
    trace cache
      instead of static cache blocks, use previous dynamic traces including taken branches
      acts as form of branch prediction as well
    Pentium 4 (NetBurst) approach
    pros
      potentially better space utilization than partially used wide cache lines
    cons
      complexity, especially of alignment
      instructions are duplicated when conflicting taken branches in recent history
- Cache Optimizations Summary
5.8 Main Memory and Organizations for Improving Performance 448
- Introduction
  cache  design is about lowering low latency
  memory design is about improving bandwidth (latency is more fixed)
  caches that access memory have long cache lines to take advantage of this bandwidth
- First Technique for Higher Bandwidth: Wider Main Memory
  install pairs or quads of memory modules
    make memory bus as wider, along with block size
  pros
     wider means more bandwidth
  cons
     need multiplexor to narrow from wide cache lines to narrower levels closer to processor
       (okay between L2 and L1, but not really between L1 and CPU for performance reasons)
     consumer needs to add memory modules in pairs or quads...
     error correction over wider memory unit means loading whole unit to update ECC on word sized write
       (can arrange to ECC on words to avoid this problem with extra cost)
- Second Technique for Higher Bandwidth: Simple Interleaved Memory
  organize memory into banks, striping physically addresses accross banks
    keep memory bus same width, no multiplexing
    typically strip at word granularity but interleaving factor could be higher
  pros
    can pipeline reads accross banks
    write buffer can write to banks in parallel
  cons
    have to time-share narrower bus
    as sinhle memory chip capacity increases, costly to distribute memory accross multiple banks
    complexity of interleaving hetergenous size and speeds of different memory modules
  how many banks?
    rule from vector computers
      number of banks >= number of clock cycles to access word in bank
- Third Technique for Higher Bandwidth: Independent Memory Banks
  independant memory banks with multiple memory controllers
5.9 Memory Technology 454
  memory latency terms
    "access time" - time between read requested and desired word arrives
    "cycle time"  - time between requests to memory
     cycle time > access time
       because need the address lines stable between accesses
- DRAM Technology
  to reduce number of pins, address lines multiplexed
    address is divided in half
      RAS and CAS (row access strobe and column access strobe)
      row is like cache index
      column is like selecting cache set
  D is for Dynamic
    single transistor per bit
      even reading can corrupt value
        need to write back after read
        can read and refresh a row at a time
    DRAM hardward periodically refreshes all of memory
      example every 8 milliseconds
    refresh can interfere with outside memory access
      introduced variability into memory access time
      try to keep to less than 5% of total time
  speed
    row access time growing very slowly 5%/yr (related to latency)
    CAS (or data transfer time) growing slightly more 10%/yr (related to bandwidth)
  DIMMs (dual inline memory modules)
    typically 4-16 DRAMS
    normally 8 bytes wide for desktops
- SRAM Technology
  S is for Static
    6 transistors per bit to prevent need for refresh
    need low power to retain values unlike DRAMs constant refreshing
  DRAM vs SRAM
    DRAM - cost/bit and capacity
    SRAM - speed and capacity
    SRAM
      no multiplexed addressed lines due to speed requirements
      no differerence between cycle and access time
    DRAM capacity 4-8 times SRAM for same transistor budget
    SRAM cycle time 8-16 times DRAM
    SRAM       cost 8-16 times DRAM
- Embedded Processor Memory Technology: ROM and Flash
  ROM
    single transistor per bit
    nonvolatile
      don't need power
    indestructible
      don't need memory protection (for code/data in embedded systems)
  Flash
    read  speed approximately that of DRAM
    write speed 10x-100x slower than  DRAM
    DRAM 4x-8x more megabytes/dollar (2001)
- Improving Memory Performance in a Standard DRAM Chip
  1.) fast page mode
    repeated access to different colums in the same row
    saves repeated RAS time
  2.) SDRAM (Synchronous DRAM)
    replace asynchronous interface that requires synchronization
      use clocked interface with registers to store request parameters such as byte count
  3.) DDR (Double Data Rate)
    transfer on both rising and falling edge of DRAM clock
  all three are small modifications that exploit high internal DRAM bandwidth
- Improving Memory Performance via a New DRAM Interface: RAMBUS
  RDRAM
    turn memory component into memory system
    replace RAS/CAS interface with a bus
      allows multiple outstanding requests
      (aka packet-switched bus or split-transaction bus)
      4 inch maximum bus length
    4 banks
  DRDRAM (Direct RDRAM)
    seperate row/column/data busses
    4-16 banks
    4-8 row buffers
- Comparing RAMBUS and DDR SDRAM
  caches need low-latency to the first byte and bandwidth for the rest of the block
    RAMBUS focussed on only second part of the problem (Amdahl's law issue...)
  price premimum for RAMBUS in 2001 is 2x
    limited to niche markets willing to pay for absolute performance
5.10 Virtual Memory 460
  virtual memory
    protection
    speed: programs start faster due to demand paging
     easier than manual overlays
    relocation
      don't need contiguous physical memory
      (alternative is Multics/Honeywell/i386 style relocation/segment registers)
      (or rewriting addresses in code at load time instead of link time)
    sharing
      within a process and between processes
      read-only to share resources such as program executables
      read-write for cross process data sharing
  in cache terms
    block => page or segment
    miss  => page fault or address fault
  memory mapping or address translation
    map virtual address to physical address
  cache vs VM
    cache replacement in HW, VM in SW
      longer miss penalty means we can take time to make a good decision
    size
      cache size is variable while processor address controls size of VM
    files
      cache is mem only, disk is multiplexed for files and VM
      (except in Multics :))
  pages VS segments (Figure 5.34)
    fixed length            VS variable length
    single word             VS segment and offset
    invisble to programmer  VS usually programmer (and compiler) visible
    trivial replacement     VS hard to find contiguous space
    internal fragmentation  VS external fragmentation
    easy map to disk blocks VS need to manage swap file allocation as well
  paged segments...
  multiple page sizes...
- Four Memory Hierarchy Questions Revisited
Q1: Where can a block be placed in the upper level?
  BLOCK PLACEMENT
    full associative
Q2: How is a block found if it is in the upper level?
  BLOCK IDENTIFICATION
    page table
    inverted page table
    translation lookaside buffer (TLB) (aka translation buffer)
Q3: Which block should be replaced on a miss
  BLOCK REPLACEMENT
    least-recently used
    use bit (aka reference bit)
Q4: What happens on a write
  WRITE STRATEGY
    write back
    dirty bit
- Techniques for Fast Address Translation
  Page tables
    usally large
      often paged themselves
    in main memory
      translating VA to PA could take two memory refs
        one for translation, one for actual data
  TLB (Translation Lookaside Buffer)
    locality
      if addresses have locality, so should translation
    TLB is cache of page table entry including
      translations
      protection
      valid bit
      use bit
      dirty bit
    OS may need to invalidate TLB entries at it changes page tables entries
      software managed coherence
    Entries may contain PID/ASID/ASN to identify a specific address space
      prevents need for invalidation on context switch
    Alpha example
      128 entry full associative TLB
    TLB is often critical path
      pressure to keep it small to make it fast
      Alpha uses virtual tagged i-cache to avoid TLB cost
        (less chance of aliasing since RO data)
- Selecting a Page Size
  larger page sizes
    pros
      smaller page tables (fewer entries)
      allows larger virtually indexed, physically tagged caches (section 5.7)
      larger unit for I/O
      TLB maps more of memory (superpages argument)
        serious impact on CPI
    cons
      more internal fragmentation
      slower startup from demand paging more than needed
- Summary of Virtual Machines and Caches
  Figure 5.37 shows overview of TLB+L1+L2
5.11 Protection and Examples of Virtual Memory 469
  process => address space
  process switch or (context switch)
    swapping approach
      move entire process out of memory to prevent other processes from view
      too slow
    VM provides easier approach by switching page tables per process
- Protecting Processes
  Simple base and bound approach
     two registers that check each address for valid access range
       absolute (flat) address
         base <= address <= bound
       relative (segment) addresses
         (base+address) <= bound
    how do we change these registers without letting the user change them?
    1.) two modes: user vs kernel (aka supervisor or executive)
    2.) read only CPU state in user mode such as
        base/bound registers
        user/supervisor mode bit(s)
        exception enable/disable bit
    3.) system call and return
  virtual memory provides fine granularity than base/bound
    permission flags per page or segment
      read only, no execute
  protection rings
    of questionable value for originally designed purpose
      hypervisor...
  capabilities
    use "key" to access "lock"
    need to allow programs to pass keys to each other without forgery
    HW support if performance desired...
- A Paged Virtual Memory Example: The Alpha Memory Management and the 21264 TLB
  segments based on upper address bits (similar to VAX layout)
    kseg for kernel
      uniform protection, no memory management
    seg0: from bottom (0) and upward
    seg1: from top of memory downward
 48-bit virtual address
   2-bits for segment (VAX style)
   10-bits for level1 page table
   10-bits for level2 page table
   10-bits for level3 page table
   13-bits for page offset
   45-bits (so far...)
   three level page table
     8-bytes per PTE
     2^10=1024 entries per level
     each level needs 8k
       exactly equal to page size!
       not a coincidence
   using larger 64k pages we have 3 more offset bits
     bringing us to the maximum total of 48-bits.
- A Segment Virtual Memory Example: Protection in the Intel Pentium
  8086
    segments (base) but no bound
    no virtual memory
  i386 (aka IA-32)
    4 rings (0 kernel)
    flexible segments and paging
    operating system and user code in same address space
    system call accesses argugements from caller with callers protection level
       avoids passing reference to something you cannot read but kernel can
  Adding Bounds Checking and Memory Mapping
    segments changed from base to index in descriptor table
      similar to alpha page table
      segment descriptor includes
        present bit (aka valid)
        base field  (base physical address of segment)
        access bit  (reference or use bit)
        attributes  (valid operations and protection levels)
        limit field (to go with base)
        more details
          D for 16 vs 32 bit default address
          G for granularity (bytes versus pages)
          DPL (Descriptor privilege level) => required ring level to access
          code segments
            Conforming = use codes privlidges, not callers (for RPC like libraries)
          data segments
            expand down = used for stacks, implies limit is negative offset
          call gates
            word count = amount of stack to copy on call
            destination selector/destination offset = entry point for system call (base+offset)
    paging with segmentation
      32-bit address
        upper  bits - segment descriptor
        middle bits - index into page table selected by descriptor
        lower  bits - offset
  Adding Sharing and Protection
    split address space
      global address space vs local address space (1/2 each originally)
      each half has its own descriptor table
      user segment registers descripor index
        - plus 1 bit for global vs local
          can access any segment with protection level greater than or equal to current level
        - plus 2 bit "requested protection level" (see below)
  Adding Safe Calls from User to OS Gates and Inheriting Protection Level Parameters
    call gates
      restrict location of entry points (don't allow to jump into middle of code)
      word count indicates how much data to copy from user stack to kernel stack
        popped from both on return, result copied to user stack
        does not allow arguments to be passed in registers
          which is ironic because DOS and BIOS interrupt calls did do this :)
      bad pointers
        kernel can set segment registers "requested protection level" bits
          allows to read data via pointer using callers protection
- Summary: Protection on the Alpha versus the IA-32
  IA-32 very complex to implement
    even worse, features are largely unused by Unix and Windows
    even with security problems, no one seems to be using them
      (except Nooks proposal...)
5.12 Crosscutting Issues: The Design of Memory Hierarchies 478
- Superscalar CPU and Number of Ports to the Cache
  need nonblocking cache
    so can speculate beyond misses
  x86
    allows instructions to be unaligned
  UltraSPARC III
    4-instructions per clock means 128-bits from I-cache per cycle
    2 of the 4 instructions can be ld/st, need 2 separate 64-bit values from D-cache
- Speculative Execution and the Memory System
  need to supress exceptions of speculative instructions
  don't want to tie up cache on speculative miss
    again nonblock cache needed
- Combining the Instruction Cache with Instruction Fetch and Decode Mechanisms
  NetBurst
    trace cache stores micro-ops (uops)
      uops are 64-bits long, actually take more memory in trace cache
        relative to x86 with i-cache
  embedded also have bigger instructions
    decompress code in memory into i-cache
- Embedded Compuer Caches and Real-Time Performance
  I-caches
    typically okay because code behavior relatively predictable
  D-Caches
    allow portions to be locked down
      for set-associative, lock a subset of the n-ways
        Xbox 360 does this, allowing DMA from cache to GPU etc
- Embedded Compuer Caches and Power
  more power efficient to access on chip memory, avoids:
    driving pins
    driving memory bus
    driving memory
    and return trip
  way prediction alloys only half of 2-way cache to be power up a time
- I/O Consistency of Cached Data
  cache coherence
    I/O could read from cache but:
      this interferes with CPU access
      this could destory cache data locality
        on page fault, load whole page but only need a few words
      hard to access on chip caches externally
    I/O could read from memory but:
      need write through caches
    Input
      1.) mark input buffer pages as non-cacheable
      2.) OS flushes address range from cache
      3.) HW can check duplicate tag bits of cache contents (CPU has own set of bits for fast access)
    multiprocessors
      with I/O rare to want to have copies of data
      with SMP, common to want copies of data
5.13 Putting It All Together: Alpha 21264 Memory Hierarchy 482
- Overview
  detail of bootup through TLBs, PTEs, L1, L2, faults, etc
  Bootup
    chip loads cache serially from external PROM
      64KB for 16k instructions
    chip loads system parameter from PROM
      off chip cache speed/timing
      system port speed/timing
      other interface logic parameters
    run instructions in PAL (privileged architecture library) mode
      update instruction TLB and PTEs for OS
      let OS take over from PAL initialization code
- Performance of the 21264 Memory Hierarchy
  compare OOOe 21264 to in-order 21164
    higher miss rate helps 21264 which can run other instructions during miss
  TPC-C
    much higher demands on overall memory hierarchy than SPEC CPU apps
    (as in Barroso)
5.14 Another View: The Emotion Engine of the Sony Playstation 2 490
- PS/2
  Emotion Egnine
    64-bits MIPS III core with SIMD extensions
    two vector units
      VPU0 attached to CPU for DSP
      VPU1 operates independantly of CPU
  Memory
    32 MB DRDRAM
  I/O Processor
    old 32-bit MIPS for PS/1 games
  Graphics Synthesizer
    16 parallel pixel processors
      1024-bit path each way to 4MB multiported embedded DRAM
  Rather than traditional desktop/server unified memory mdoel
    more like embedded with 9 distict memory modules
    softare controlled coherency
5.15 Another View: The Sun Fire 6800 Server 494
  4th "C": coherence misses
  ECC SEC/DED (Single Error Correcting/Double Error Detecting)
    used between memory chips and processor to catch end to end errors
  NUMA memory access
    not a strict hierarchy
  up to 4 dynamic system domains
    partitioning for fault tolerance
    can be changed without reboot (like Multics :))
  redundant paths
    both between 24 processors and to I/O
5.16 Fallacies and Pitfalls 498
- Fallacy
    Predicting cache performance of one program from another
- Pitfall
    Simulating enough instructions to get accurate performance measures
    of the memory hierarchy.
  1.) small trace may not have time to fill cache
  2.) cache behavior may change over programs time
  3.) cache behavior may be data dependant
- Pitfall
    Too small an address space
    older IBM/360 outlasted younger PDP-11
      24-31 bits vs 16 bits
     x86 will die, need IA-64 :)
       NOT! AMD producing x86-64
- Pitfall
    Emphasizing memory bandwidth in DRAMs versys memory latency
  RDRAM takes 20% more die than SDRAM
    takes space away from redundant rows that improve yield
  RDRAM is good for small memories with high bandwidth needs
    that is PS/2 and ...
    servers can get better BW through many parallel chips
- Pitfall
    Delivering high memory bandwidth in a cache-based system
  Streaming computing
    caches hurt since they just get in the way
- Pitfall
    Ignoring the impact of the operating system on the performance of
    the memory hierarchy
  25% of misses from OS or OS interference evicting user data
- Pitfall
    Relying on the operating system to change the page size over time
  Alpha planned go grow architecture address space by increasing page size
    in the end, had to find other ways since OS wanted to stick with 8k pages
  TLBs and superpages
    OSs have not adopted superpages except for very specific functions
      largely mapping display memory and I/O devices
5.17 Concluding Remarks 504
  scaling
    terabyte main memories vs 8kb pages
  tradeoffs
    cost/complexity tradeoffs to getting good balance
      different tradeoffs for different applications
5.18 Historical Perspective and References 504
  Virtual Machine
    pioneers coined the term "memory hierarchy"
    1962 before Kilburn et al proposed automatic management of 2 levels
      Atlas system
      1 year before IBM/360
    IBM/370 added VM
      TSS OS couldn't really handle it though
      coined the term "TLB"
    all computers today except super and embedded have VM
    GE 645
      segmented virtual memory for Multics
      conceptual predecessor to i286/i386
    capabilities
      simplicity of idea lost in complexity of efficient implementation
    24->32 bit
     IBM/360 programs used upper 8 bits for their own purposes
     Motoroal 68000 (68k) had similar 24->32 bit migration issues
     today, trap if unused bits are non-zero
  Caches
    Wilkes 1965
      direct mapped cache paper
    University of Cambridge
      built a direct mapped cache a year before wilkes paper
    IBM 360/85
      first commercial machine with cache
      correctly used AMAT, not miss rate, to measure performance impact
      coined the term "cache"
      killed the 360/91 because cache beat OOOe
    Smith 1982
      "temporal locality" and "spatial locality" terms
    Hill* 1987 3 C's
    Jouppi* 1998 said Hill's analysis lead to victim cache idea
    Koft 1981 non blocking cache
    Koft dual ported non blocking cache
    Baer and Wang 1988 inclusion
    Jouppi* and Wilson 1994 exclusion
    Jouppi 1990 prefetching via streaming buffers
    90's ISCA and ASPLOS full of cache papers
Exercises 513
Chapter 6 Multiprocessors and Thread-Level Parallelism
6.1 Introduction 528
- Introduction
  always the next big thing...
  current reasons
  1.) easiest way to improve performance over single processor is to use more processors
  2.) limits of ILP
  3.) progress on parallel software, especially server and embedded, desktop is the challenge
  No room for all issues in H&P
    Culler, Singh, and Gupta: another 1000 pages just on multiprocessors
  Focus on mainstream design
    multiprocessors with <= 128 processors
- A Taxonomy of Parallel Architectures
  Flynn 1966
  1.) SISD - uniprocessor
  2.) SIMD - multimedia extensions and vector architectures
  3.) MISD - streaming computers (to some approximation)
  4.) MIMD - multiprocessors
  MIMD rules
    1.) flexible. can run
        - multiple single threaded apps
        - one multithreaded app
        - mixture of both
    2.) cost-performance advantage by leveraging existing uniprocesor (both design and implementations)
  multiprogramming
    running multiple independent processes
  multithreading
    using multiple processors for one process
  thread-level parallelism
    term applies to both multithreading and multiprogramming
  centralized shared-memory architecture
    at most a few dozen processors in 2000
    symetric (shared-memory) processors (SMP)
    uniform memory access (UMA)
  distributed memory multiprocessors
    non-uniform memory access (NUMA)
    1.) cost effective way to scale memory bandwidth (if mostly to local node)
    2.) reduces latency for local accesses
    two different programming models
      distributed shared memory (DSM)
      message passing interface (MPI)
  I/O also distributed among the nodes
  DSM nodes maybe SMP internally
- Models for Communication and Memory Architecture
  two models:
  1.) shared address space
      DSM
      NUMA
  2.) multiple private address spaces
      message passing multiprocessors (aka multicomputers)
      RPC
        synchronous or asynchronous
      MPI (Message Passing Interface) standard
        loose some absolute performance for portability
  Performance Metrics for Communication Mechanisms
    1.) Communication bandwidth
        limits of processor, memory, or interconnection bandwidth
          hopefully not communication mechanism
        bisection bandwidth of interconnect
        node bandwidth
    2.) Communication latency
        Communication latency = Sender overhead + Time of flight + Transmission time + Reciever overhead
        naming and protection
          DSM - implemented by shared memory model in hardware at low cost
          OS  - could be higher overhead if pushed to SW
    3.) Communication latency hiding
        hide latecy by overlapping communication with computation
        better to reduce latency than hide if possible
          hiding is more application specific
            often requiring programmer effort
    characeristics of communication
      size of data
      regularity in the communication pattern
  Advantages of Different Communication Mechanisms
    DSM
      pros
        compatability with SMP
          example OpenMP
          easier to program with irregular communication
          simpler compiler support
          develop for SMP, tune for DSM
          lower overhead and better use of bandwidth for small items
          hardware controlled caching to reduce frequency of remote communication
            also serves to improve latency and contention for shared data
    Message Passing
      pros
        simpler hardware
        explicit communication makes analyizing communication patterns clear
        explicit communication makes programmer be aware of cost of communication
        synchronization is coupled to message passing
        easier to use sender-initiated communication
    Can build MP on top of DSM easily
    Build DSM on top of MP is harder
      depend on OS to handle shared memory on large page granularity
      "shared virtual memory" of section 6.10
    MPP (Massively Parallel Processors)
      > 100 CPUs
    SMP >> DSM (market size) section 6.5
    multicomputers (aka message-passing multiprocessros (aka clusters)) chapter 8
- Challenges of Parallel Processing
  1.) insufficient parallelism
  2.) long-latency remote communication
  Chapter focus on #2
6.2 Characteristics of Application Domains 540
  many issues
    load balance
    synchronization
    sensitivity to memory latency
    data distribution
    structure of parallel algorithm
    spatial and temporal access patterns to data
- A Commercial Workload
  (Barroso paper :))
  1.) OLTP (TPC-B)
  2.) DSS  (TPC-D)
  3.) Web Search (AltaVista)
  I/O causing more kernel time and more idle time
- Multiprogramming and OS Workload
  2 copies of Andrew Benchmark
    5 seconds on 8 processors
    no paging needed
    3 phases
    1.) compile
        CPU
    2.) installing object files in library
        I/O
    3.) removing object files
        mostly I/O and idle
- Scientific/Technical Applications
  The FFT Kernel
    3 data structures
      1.) input array
      2.) output array
      3.) precomputed read only roots of the unity matrix
    6 steps
      1.) transpose data
      2.) perform 1D FFT on each row
      3.) muliply roots by data and write to data
      4.) transpose data
      5.) perform 1D FFT on each row
      6.) transpose data
    partitioning
      contiguous chunks of rows
      all processors have local data
      transpose ensures that data is local for step 5
        all-to-all communication
          no temporal locality but good spatial locality
  The LU Kernel
    LU factorization is dense matrix
      similar to QR factorization, Cholesky factorization, and eigenvalue methods
    blocking key for sequential version
    partitioning
      1.) blocks assigned using 2D tiling
      2.) dense matrix multiplication done by owner of *destination* block
      data structure representation
        natual to use 2D array
          however, then chunks are not contiguous
        instead use 4D array
          first two dimensions are grid partitioning
          last  two dimensions are sub-array
  The Barnes Application
    Barnes-Hut n-body simulation
      n^2 interations between n bodies
      take advantage of inverse square law to treat groups of remote bodies as one body
    octree data structure
      8 nodes represent 8 parts of a cube (cells)
      each node represents the bodies in its cell
      density varies so trees unbalanced
    algorithm
      for each body we traverse the tree ones
        if cell is far enough away, approximate it as a single mass in the cells center
        otherwise open cell and recurse
          if we reach leaves we calculate force directly
      irregularity because of varying density but
        1.) system evolves slowly
        2.) forces fall off quickly
            few cells involved
            most were same as last timestep
    partition
      each processor gets seperate subtree
        frequent communication within subtree because of physical locality of octree
      partitioning not just on number of nodes, but on approximate amount of work
      patterns still relatively irregular
        need efficient fine grained communication
  The Ocean Application
    red-black Gauss-Seidel multigrid technique
      red-black means alternating grid point updates like a checkerboard
      multigrid techniques
        solve finite difference equations
        use iterations over hierarchical grids
          each grid has is approximation of the fewer points of the lower grids
          finer grids have more accuracy and need more compuation to converge
        move up/down the grid dependign on rate of change at previous step
        solution reached when we have convergence at finest level
      data representation
        dynamically allocated and sized grids for a particular problem
      partitioning
        ocean basin is partitioned into square subgrids
          each processor has local subgrid
      communication
        5 steps per iterations
        data exchanged between subgrids at each step
        communication along boundaries to nearest neighbors
  Computation/Communication for the Parallel Programs
    high comp/comm is desirable for scalability
    often sensitive to data set and number of processors
      FFT    - log n (all-to-all)
      LU     - sqrt(n)/sqrt(p)
      Barnes - sqrt(n)/sqrt(p)
      Ocean  - sqrt(n)/sqrt(p)
   sqrt(n)/sqrt(p) is because of partitioning and communication
     need to increase problem sizes to keep comp/comm as we add procesors
     LU and Ocean communication is determined by boundaries
6.3 Symmetric Shared-Memory Architectures 549
  bus-based multiprocessors helped by addition of cache
    cache lowers bus bandwidth needs of processor to main memory
    allows us to attach more CPUs to existing bus
- What is Multiprocessor Cache Coherence?
  coherence
    what values can be returned by a read
    behavior of reads and writes to the same memory location
  consistency
    when a written value will be returned by a read
    behavior of reads and writes with respect to other memory locations
  coherence requirements
    1.) PROGRAM ORDER
        P writes A to X
        no one else writes to X
        if P reads from X it should see A
    2.) COHERENT VIEW OF MEMORY
        P write A to X
        no one else writes to X
        Q reads X after some "sufficient period of time, it should see A
    3.) WRITE SERIALIZATION
        two writes to the same location are seen in the same order by all processors
        if values 1 and 2 are written to a location, processors should not see 2 then 1
- Basic Schemes for Enforcing Coherence
  coherence multiprocessor caches provide both migration and replication of shared items
    migration
      reduces latency since data is in local cache
      reduces bus bandwidth since cached access does not use bus
    replication
      similarly reduces latency
      prevents contention for read shared data, again lowering bus bandwidth
  two classes of cache coherence protocols
    directory based
      directories maintain sharing status of each block of physical memory
    snooping based
      caches with copies of blocks also track the block sharing status
      requies shared bus for caches to snoop to see operations to shared blocks
- Snooping Protocols
  write invalidate
    take exclusive access before write
    ensures only one copy in all caches
    enforces write serialization
  write update (aka write broadcast)
    track if block is shared
    if shares, broadcast on write
  performance differences
    1.) multiple writes to the same word
        write invalidate requires one invalidate
        write update require multiple broadcasts
    2.) word updates to cache blocks
        write invalidate requires one invalidate
        write update requires multiple broadcasts
          can try merge word writes into single line in write buffer (chapter 5)
    3.) write on one processor, read on another
        write invalidate requires reader to reload shared block (even if it was previously in cache)
        write update has update values available immediately (if it was in cache)
  performance summary
    bus and memory bandwidth usally limit factor in snooping scalablity
      invalidation generally generates less bus and memory traffic
    update also does not play well with memory consistency models
      makes 3.) less valuable than it seems
    update can be fine for
      small CPU counts (2-4)
      chip multiprocessors (CMP)
    focus on invalidate as common case for rest of chapter
- Basic Implementation Techniques
  write invalidate
    acquire bus access
    write invalidation address
    snooping caches check for address
    invalidate if present
  write serialization
    one processor must obtain bus access before the other
    write to shared data cannot complete until bus access acquired
  locating block on cache miss
    write through
      look in memory, although write-buffers complicate this
    write back
      could be in other caches
      caches snoop loads to memory
        if see matching dirty block, cancel memory request, and return results
      focus on write-back as common case
  handling writes in write-back
    add shared bit to cache block
    only need to invalidate if block is shared
  contention for cache tags
    don't want to block CPU from accessing cache
    - duplicate tags with second port to access
    - inclusive L2 provides similar effect
      but need to keep L1/L2 tags in sync
      need to get data from L1 on snoop hit
- An Example Protocol
  simple protocol in Figure 6.10
  extensions
    ownership or upgrade misses
      instead of treating write hit as write miss
      upgrade read to write, already have data
     "owner" has exclusive copy to write
    shared vs shared+private
      shared+private avoids bus traffic on write hit
    cache to cache transfers
      avoid going to memory for blocks in other caches
  CPU state machine similar to plain cache
    just need to add bus action to invalidate on write miss
  Bus state machine only needed for SMP coherence
  protcolo property
    shared blocks always mirror memory values
  atomicity problems
    protocol assumes atomic transactions
      not compatible with modern split transaction buses
    non-atomic actions can cause deadlock
  small scale implementation
    Pentium III Xeon and Pentium 4 Xeon
      can connect two without any glue logic
    two extra chips for up to 4 ways system
6.4 Performance of Symmetric Shared-Memory Multiprocessors 560
  sharing
    true sharing
      semantic sharing of data between nodes
    false sharing
      unrelated items in same cache block
- Performance Measurements of the Commercial Workload
  (Barroso paper)
  Figure 6.14
    increasing cache size helps :)
  Figure 6.15
    especially I-cache
  Figure 6.16
    true sharing a scaling issue
  Figure 6.17
    increasing block size helps, even up 256 bytes
    mostly helps true sharing misses (good shared data locality)
    does not help instruction misses (poor instruction locality)
- Performance of the Multiprogramming and OS Workload
  Figure 6.18
    higher miss rate for kernel than user
    also drops slower for kernel vs user as cache size increased
    kernel issues
      has to initialize (zero) user pages
      has true sharing
    user processes are not paralel
      only suffer certain misses on process migration
  Figure 6.19
    increasing cache decreases capacity misses for kernel :)
  Figure 6.20
    increasing block size has significant impact on kernel misses...
    (remember all those zeroes pages?)
  Figure 6.21
    comfirms that the compulsory misses are the reason for 6.20 improvement
  Figure 6.22
    however, increasing block size increases memory traffic
      byes/reference grows quickly for kernel traffic
- Performance of the Scientific/Technical Workload
  Figure 6.23
    miss rate breakdown as processors increases
      FFT    - capacity become coherence
      LU     - flat
      Barnes - capacity become coherence, with coherence dominating
      Ocean  - capacity increases, then decreases with flat coherence
  Figure 6.24
    miss rate breakdown as cache size increased
      FFT    - slight drops on capacity and coherence
      LU     - flat
      Barnes - capacity to zero, coherence scaling
      Ocean  - drop in capacity, coherence flat
    working set effect (plateaus)
      FFT isn't helped much because primary working set already in cache
        need to have both matrices fit before noticible improvement
  Figure 6.25
    miss rate breakdown as block size increased
      FFT    - both capacity and coherence improve
      LU     - both capacity and coherence improve
      Barnes - mostly coherence misses, miss rate actually worsens due to false sharing (but still less than 1%)
      Ocean  - both capacity and coherence improve
  Figure 6.26
    bus traffic for figure 6.25 as clock size increases
      FFT    - getting worse (miss miss rate)
      LU     - low and flat
      Barnes - low and flat
      Ocean  - getting worse (high miss rate)
- Summary of Snooping Cache Schemes
  commerical
    very large off chip caches (4+MB) with large block sizes (64-128)
  multiprogrammed
    user time similar to best scientific/technical
    kernel time similar to worst scientific/technical
  scientific/technical
    these highly tuned have reasonable component for coherence misses
      untuned apps may have a lot more (, false sharing, etc)
6.5 Distributed Shared-Memory Architectures 576
- Scalable shared memory
  one approach: non-coherent memory
    Cray T3D/E
    shared data marked uncachable
    application can copy shared to private to build its own caches
      and be responsible for its own coherence and consistency
    problems
    1.) poor compiler support for managing such coherence
        large granularity
        too conservative
    2.) no small block granularity spatial locality
        no free cache hits for the approximate cost of a word access
    3.) prefetching not as effective
        want to grab small granularity cache blocks
        want to keep coherent
  snooping issues
    even if hardware busses scale to more devices
    snooping cache coherency protocols don't scale
  the directory alternative
    basic approaches scale up to approximately 200 processors
    additional "scalable directory" work needed beyond that
    no central directory
      directory is distributed amongst node memories
    key property
      sharing status of block known in a single location
        avoids broadcast
- Directory-Based Cache Coherence Protocols: The Basics
  two primary operations
    read miss
    write to shared clean block
    (write miss is combinartion of the two)
  cache block states
    shared    - one or more processors and memory have up-to-date copy
    uncached  - no processors has a copy
    exclusive - owner has writen copy of block, memory is out-of-date
  sharers vector
    bit vector listing sharers in shared state (and owner in exclusive)
  local caches mirror directory state for contained items
  comparison to snooping
    identical states and transitions in cache state machine with slightly different actions
    interconnect is not a bus 
    1.) not an source of arbitration
    2.) message oriented, many messages need responses
  nodes
    local  node  - where request originates from 
    home   node  - location of directory entry for a specific cache block
    remote nodes - other involed parties, such as sharers or owner
  mapping block to home node
    statically partitioned physical address space
  example assume ordered interconnect
- An Example Directory Protocol
  directory entry has global of all sharers
    unlike snoopy where noone had global view
  sharers vector
    up to 64 notes, usually a simple bit vector
    beyond 64, "scalable directory" approach needed
      switch from dense to sparse representation
      if more than "n" sharers, only track clusters of sharing
  extensions
    forwarding
      ship data directly from remote node to new local node, not via home node
6.6 Performance of Distributed Shared-Memory Multiprocessors 584
  location of shared data depends on 
    initial allocation
    sharing patterns
  local versus remote misses
    breakdown separately since significantly different miss penalties
  Figure 6.31
    miss rate as processors increased
      FFT    - flat
      LU     - flat
      Barnes - flat
      Ocean  - rises at 64 cpus because of coherence misses (data set too small...)
  Figure 6.32
    miss rate as cache size increased
      FFT    - remote drops, local constant
      LU     - drops
      Barnes - drops
      Ocean  - remote drops, local constant
  Figure 6.33
    miss rate as block size increased
      FFT    - drops
      LU     - drops
      Barnes - increases, but still less than 1%
      Ocean  - drops
  Figure 6.34
    bytes/reference (tracking 6.33)
      FFT    - increasing
      LU     - flat
      Barnes - increasing
      Ocean  - increasing
  Figure 6.36
    average cycles per reference (similar to AMAT)
      FFT    - decreasing
      LU     - flat
      Barnes - flat
      Ocean  - 6,16,32,64 CPUs => 3, 2.5, 2.75, *5.5* (matching figure 6.31, higher missrate => higher memory cost)
  issues in real directory protocls
    nonatomicity of operations
    finite buffering
      increases possibilities of deadlock
6.7 Synchronization 590
  high level synchronization
    built with user-level software
    supported by hardware primitives
  small scale and low contention cases
    uninterpretable instruction OR
    atomicly read and change a value
  spin locks
    easy to build with coherence based instructions
- Basic Hardware Primitives
  need to be able to atomically read and update a value, testing if operation was successful
  - atomic exchange
    unlocked = 0, locked = 1
    lock() atomic exchange 1 for 0, test success bit
  - test-and-set
    very similar to atomic exchange
  - fetch-and-increment
    same basic code lock()
  - load linked / store conditional (aka load locked)
    two instructions instead of one
      easier hardware implementation
        (especially on split transaction busses?)
    can be used to implement other "primitives" such as fetch-and-increment
    issues
      fails if another store store to address
        keep sequence short to prevent live lock issues
      fails if any processor exception
        generally only safe to do register operations between ll/sc
        other memory references could fault
- Implementing Locks Using Coherence
  spin locks: busy wait to acquire lock
  with no cache coherence
      LOCK
                daddui r2,r0,#1  ; load immediate
        lockit: exch   r2,0(r1)  ; atomic exchange
                bnez   r2,lockit ; already locked?
      UNLOCK
                st     r0,0(r1)  ; just write 0
    keep variables in memory
    atomic instructions operate on memory directly
    (in fact may be implemented there for better scalability even on cache based systems)
    (SGI Origina paper: moved fetch-and-op to run directly at memory system)
  benefits with cache coherence
    1.) spin on local cached copy
    2.) locality in lock access
    #1 requires that we do spin with read only operations, no writes
      LOCK
        lockit: ld     r2,0(r1)  ; read lock
                bnez r2,lockit   ; spin if bad value
                daddui r2,r0,#1  ; same
                exch   r2,0(r1)  ; same
                bnez   r2,lockit ; same
    ll/sc version already splits read from write
      clear to avoid sc unless load was expected value 
- Synchronization Performance Challenges
  spin lock contention problems
    directory or bus acts as point of serialization
      interconnect traffic is an issues in addition to contention
    bus "fairness" delays quick release
  Barrier Synchronization
    simple barrier has problem with reuse
    sense-reversing barrier solves this by alternating test on each reuse
    barrier performance 
      lots of little lock operations 
        therefore lots of traffic
      contention on barrier lock is not uncommon
- Synchronization Mechanisms for Larger-Scale Multiprocessors
  GOAL: synchronization mechanism with:
    1.) low latency in unconteded cases
    2.) minimize serializtion under contention
  Software Implementations
    spin lock issues
      contention overhead when many processes contending
      one solution is exponential backoff
    queueing locks
    barrier issues
      contention during both gather and release stages
      one solution is combining tree
        4-way tree structure seems to be good way to avoid contention on single lock
      MPPs hardware support for barriers (T3D and CM-5) 
  Hardware Primitives
    queueing locks
      spin lock suffers from notifyAll() problem of waking up all people when only one can take lock
      instead use notify() to just wake up one waiter from "queue"
      bus-based
        just implement in software
      directory-based
        directory tracks waiters of locks using sharing vector like structure
          need to identify access to lock
          need to identify lock release
          need to cope with context switch (and possible migration) of process with lock
    barrier performance
      fetch-and-increment
6.8 Models of Memory Consistency: An Introduction 605
  Trying to run if block on at most 1 processor
    P1: A = 0;        P2: B = 0;
        A = 1;            B = 1;
    L1: if (B == 0)   L2: if (A == 0)
  However, if "= 1;" assignments delayed relative to program order, both if's can run
  "sequential consistency"
    maintain program order
      no suprises
    SC reduces potential performance
      option 1.) try to hide latency to preserve performance
      option 2.) relaxes consistency models
- The Programmer's View
  sequential consistency
    simple to understand but limits hardware and compiler reordering techniques
  synchronized programs (aka data-race-free programs)
    avoid "data races"
    all access to shared data ordered by synchronization primitives
      one synchronization primitive after  write of shared data
      one synchronization primitive before read  of shared data
    most programs are already synchronized
      using sequential consistency directly can be tricky
      building your own synchronization primitives can be error prone
      most people use library routines for synchronization primitives
        library can be architecture specific, dealing with SC or relaxed consistency as needed
- Relaxed Consistency Models: The Basics
  (Adve paper)
  relaxed consistency
    allow reordering of reads and write 
    1.) relax W->R ordering
        TOTAL STORE ORDERING or PROCESSOR CONSISTENCY
        retains order among writes
          most sequentially consistent programs work with TSO/PC
    2.) relax W->W ordering
        PARTIAL STORE ORDER
    3.) relax R->W and R->R ordering
        WEAK ORDERING, ALPHA, POWERPC, RELEASE CONSISTENCY
- Final Remarks on Consistency Models
  1.) write synchronized programs
    using standard synchronization 
    with architecture specific implementation
  2.) speculation allows SC or PC to be implemented cheaply
  see pages 618-619 below for more
6.9 Multithreading: Exploiting Thread-Level Parallelism within a Processor 608
- multithreading
  allow multiple threads to share functional units
  multiple sets of thread state
    register file
    PC
    separate page table
  fine-grained multithreading
    switch threads on each instructon
    round robin, skipping any stalled threads
    pro: hides latency delay of stalls
    con: slows down non-stalled threads
  coarse-grained multithreading
    switch threads on "costly" stalls (L2 misses)
    pro: does not slow down non-stalled threads
    con: doesn't cope with short pipeline stalls
- Simultaneous Multithreading: Converting Thread-Level Parallelism into Instruction-Level Parallelism
  run multiple threads on different functional units at the same time
    1.) superscalar with no multithreading support
        "AA" "AX" "XX" "XX" "AA"
    2.) superscalar with coarse-grained multithreading
        "AA" "AX" "BB" "BX" "AA"
    3.) superscalar with fine-grained multithreading
        "AA" "BB" "AX" "BX" "BB" "AA"
    4.) superscalar with simultaneous multithreading
        "AB" "AB" "AB" "BB" "AA"
    hardware
      per-thread renaming table
      separate PCs
      multi-thread commit: logically separate reorder buffer
  Design Challenges in SMT Processors
    preserving single thread performance
      "preferred" thread gets priority
        not perfect
          preferred thread is run most of the the time
          less mix of other threads to choose instructions from
          when preferred stalled, not many options available
      fetch issues
        single-thread performance needs to prefetch as much as possible
          space being consumed by other threads
        mispredicted branches needs to fetch immediately
          could be fetching for another thread
      generally preferred thread approach is sufficient, its just not perfect
        1.) overhead is small and simple choices (preferred thread) are good enough
        2.) superscalar efficiency (utilization) is low, enough extra resources 
    notes
    - highly attractive for throughput oriented server market (Niagara)
    - can be disabled for single threaded performance (Intel Hyperthreading)
  Potential Performance Advantages from SMT
  - 8 contexts
    apps
       multiprogrammed SPEC CPU INT
       Apache
       barroso (OLTP and DSS)
    speedups of 1.7 to 4.2
  - utilization
    much higher functional unit utilization
    much higher fetch utilization
    slightly lower data cache hit rate
  - commerical
    Alpha 21464 with 4 SMT threads (canceled)
    Pentium 4 Xeon with 2 SMT threads
      30% performance improvement for server applications
      not as good because of segregated instruction issue
      need to share everything for best performance
6.10 Crosscutting Issues 615
- Memory System Issues
  Inclusion and Its Implementation
    "multilevel inclusion" (aka "subset property")
    large L2 blocks contain multiple L1 blocks
      on L2 miss, we potentially evict multiple L1 blocks
      most processors direct L2 to evict L1 blocks
      otherwise L1 has to snoop to see changes to blocks it contains that are not in L2
  Nonblocking Caches and Latency Hiding
    nonblocking caches import for multiprocessors:
      1.) hides long multiprocessor memory latencies
      2.) allows weak and relaxed consistency models
      3.) is critical for prefetching
    prefetching and coherence
      "nonbinding" prefetching
        bring values to memory, keep coherent
      "binding" prefetching
        bring value to register, out of view of coherence model
      nonbinding prefetch critical for multiprocessor cache coherence
    nonbinding prefetch implementation
      1.) local node needs to keep track of multiple outstanding requests
      2.) need to check for multiple requests for same block
      3.) cannot assume stall on miss, increasing cache controller complexity (both snoop and directory)
  Compiler Optimizations and the Consistency Model
    sequential consistecy 
      prevents reordering of loads and stores to shared data
        prevents register allocation because it can cause load/store reordering
    synchronized program model
      programmer tells compiler synchronization points explicitly
      (or implicitly in HPF)
  Using Speculation to Hide Latency in Strict Consistency Models
    use speculation to provide stroger consistency without big performance impact
      write invalidate causes speculation to fail
      common case is no invalidation, no extra cost
    Hill* 1998 paper
      1.) SC/PC with speculation fast enough relative to relaxed models
      2.) very little hardware cost
      3.) easier to program
    MIPS R10000 (mid-1990s) followed this approach
    downside
      compilers with ability to performance alias analysis might want access to relaxed models
- Using Virtual Memory to Build Shared Memory
  let the OS/VM system implement coherence instead of hardware
    "distributed virtual memory" or "shared virtual memory"
  similar to directory, just very, very large blocks
    each page has home node
    migration and replication
  problems
    false sharing major issue with large blocks
      fixing leads to sparsely used pages
    significant software overhead
  substitute for loosely coupled message passing
  future? 
    better implementation could lower OS overhead
    hardware support could lower OS overhead
    compiler could help with false sharing
    multiple (smaller) page sizes could help with page usage and false sharing
- Performance Measurement of Parallel Processors
  relative to uniprocessor code and hardware
  not just performance for single configuration but scalability
    measure for fixed problem size
    measure by scaling problem size
      memory-constrained scaling
        keep memory/processor constant
      time-constrained scaling
        keep total execution time constant
    often need to change algorithm with problem size
      example number of iterations for convergence
  see "linear speedup" pitfall below
6.11 Putting It All Together: Sun's Wildfire Prototype 622
- Intro
  hybrid SMP/DSM
  trying to keep memory access as uniform as possible
  more important for commerial workloads with less predictable access patterns
- The Wildfire Architecture
  2-4 nodes
    each node is E6000 with 14 boards with 2 processors
    112 total processors (4*14*2)
    upper 2 bits of physical address partition memory among 2-4 nodes
  simple directory
    directory is local cache backed by remote main memory
    remote requests take 4-5 cycles (need to check local cache first)
- Using Page Replication and Migration to Reduce NUMA Effects
  (similar to SGI Origin 2000 ccNUMA)
  "coherent memory replication" CMR 
     a version of Simple-COMA
       "cache-only memory architecture" COMA
     counters track misses to remote pages
       replicate or migrate based on sharing patterns
     still HW cache coherent, page movement just optimization to help
     need node map of "local" address to "global" address for moved pages
- Performance of Wildfire
  Basic Performance Measures: Latency and Bandwidth
    Wildfire prototype system 
      up to 112 nodes
      GOAL: hide NUMA for commercial workloads
    SGI Origin 2000 system
      up to 2048 nodes
      GOAL: scalable bandwidth for scientific/technical workloads
    latency measurement summary
      Wildfire's large nodes increase amount of "local" memory, but have typically higher "remote" memory costs
    bandwidth measurment summary
      Origin generally has greater bandwidth, 3x/processor for local and bisection
- Application Performance of Wildfire
  Performance on the OLTP Workload
    numbers validate importance of:
      CMR replication and migration
      locality aware scheduler
      as many processors per node as possible
      intelligent data placement
    relative performance proportional to local access
    unfortunately only evaluate 16 node system where single E6000 is better than wildfire
  Performance of Wildfire on a Scientific Application
    24 cpus
      3-way (3*8) wildfire beats single node e10000 (1*24) and e6000 (1*24)
    36 cpus
      4-way (4*9) wildfire beats 2-way (2*18) wildfire and e10000 (1*36)
      super-linear speedup for 4-way vs 24 cpus
      2-way suffers from bus congestion
    CMR powerful
      can start with data all one one node
      after 2 or 3 minutes data has moved to where it should be
      costs
        migration is very low cost
        replication is harder, needs reverse memory maps
- Concluding Remarks on Wildfire
  commerical market favors fat nodes
    1.) more people want to buy a box with a dozen processors than 1000 cpu beast
    2.) irregular memory access favors UMA over NUMA
  cons
    lower scalability and higher remote access costs (latency and bandwidth)
6.12 Another View: Multithreading in a Commercial Server 635
  IBM RS64 III (Pulsar) PowerPC processor
    RS/6000 and AS/400 (where its known as A50)
    goals
      1.) reuse staticly scheduled Northstar processor
      2.) small performance penalty in silicon and clock
      3.) single-threaded performance must not suffer
    implementation
      1.) two threads (#2 above)
      2.) coarse scheduled multithreading (#1 and #3)
    details
      < 10% area impact for dupliate register file and PC
      new register to limit time spent on one thread
      significant improvement on server application throughput
6.13 Another View: Embedded Multiprocessors 636
  MXP processor for VOIP systems
    1.) serial voice stream interfance
    2.) pactet routing and channel lookup
    3.) ethernet interface including MAC layer
    4.) 4x MIPS32 R4k processors with private caches
  4xMIPS cores handle
    - QoS
    - echo cancellation
    - simple compression
    - packet encoding
  embedded good for multiprocessing
    1.) binary compatability not as critical
    2.) nartual parallelism in set-top box, network, game system
6.14 Fallacies and Pitfalls 637
- Pitfall
    Measuring performance of multiprocessors by linear speedup versus execution time.
  relative speedup vs true speedup
    machine relative vs best cast uniprocessor
  superlinear suspicious
    unless cache effects
- Fallacy
    Amdahl's Law doesn't apply to parallel computers.
- Fallacy
    Linear speedups are needed to make multiprocessors cost-effective.
- Fallacy
    Multiprocessors are "free."
  glueless processors are nice
    but still need better memory controller, etc
    puts more demands on other links in the system
- Fallacy
    Scalability is almost free.
  Hardware
    AlphaServer is much cheaper price/performance than a Cray
    in Sun Enterprise line, interconnect cost grows faster than linear
  Software
    much more work to create parallel versions of code
      load balance, locality, contention, serial sections of code
- Pitfall
    Not developing the software to take advantage of, or optimize for, a
    multiprocessor architecture.
  SPARCCenter
    SunOS was not ready for a NUMA machine
      no page placement, etc
  IRIX
    single lock on page table
      prevents scaling of even an multiprogrammed workload
        need to move to fine-grained locking
- Pitfall
    Neglecting data distribution in a distributed shared-memory multiprocessor.
  Ocean would be 2x slower with round robin versus tiled memory partitioning
6.15 Concluding Remarks 643
- Future is bright
  - better understanding of software via scientific/technical and commerical workloads
  - multiprocessor and cluster commonly accepted approach for scaling performance
  - multiprocessor great for multiprogrammed workloads (OS issues fixes)
  - multiprocessors effective for certain commerical workloads (although some cluster)
  - chip multiprocessors: 
    1.) natural parallelism in embedded 
    2.) more effective use of more transistors in desktop/server
- The Future of MPP Architecture
  4 alternatives
    1.) scale up large-scale multiprocessors using proprietary interconnects
        SGI Origin and Cray T3E
        1.) not cost effective at small scale
        2.) programming model may be more complex
    2.) clusters of midrange multiprocessors with proprietary and standard interconnects
        Wildfire (HP, Sun, SGI)
        sometimes need to switch to message passing
    3.) clusters of uniprocessor with custom interconnect
        IBM SP-2
        custom interconnect is costly
    4.) cluster of all off-the-shelf components
        PC or Workstation nodes
        ATM or GigE
        standard OS and tools
        message passing
- The Future of Multiprocessor Architecture
  ILP dead
    performance/area, performance/connection, performance/watt all dropping
  chip multiprocessors?
    Power4 (2x Power3 cores)
- Evolution versus Revolution and the Challenges to Paradigm Shifts in the Computer Industry
  Level 1
    microprogramming
    pipelinning
    cache
    time-sharing MIMD
  Level 2
    virtual memory
    vector instructions
  Level 3 
    there is no level 3
  Level 4 
    RISC
  Level 5
    Massive SIMD
    Latency MIMD
    Special Purpose
  How could multiprocessor adoption accelerate?
    1.) no other choice
    2.) hw/sw evolution
  (today #1 and #2 are in play)
6.16 Historical Perspective and References 649
- SIMD Computers: Several Attempts, No Lasting Successes
  Illiac IV 1972
    disaster of cost and performance
  Connection Machine 1985
    65k 1-bit computers
  Need SISD to do control flow for SIMD
    also want to mask operation of individual SIMD nodes
  data parallelism
    great for loop parallelism
  custom processors cannot leverage multiprocessor industry
    often very had single CPU performance
  array processor special cases
    good for image and signal processing
- Other Early Experiments
  first Eckert-Mauchly computer had duplicate units for availability
  CMU 1970s
    C.mmp
      16x PDP-11s
      crossbar to 16 memory units
      shared memory
      focus on OS
    Cm*
      cluster with NUMA distributed memory
      no caches
- Great Debates in Parallel Processing
  TFLOP machine by 1995?
    well 1999, and it cost the estimated $100million
    not exactly scalable performance
  few (<100) or many (>1000) data streams
    few wins, MPP dead...
- More Recent Advances and Developments
  The Development of Bus-Based Coherent Multiprocessors
    Bell 1985 multiprocessor vs multicomputer
    Goodman snooping cache coherence
  Towards Large-Scale Multiprocessors
    Message pasing
      Caltech Cosmic Cube
        advances in routing and interconnect, lowering cost
        Intel iPSC 860 based on these ideas
      Intel Paragon (alsi i860)
        lower dimensionality, more individual links
      CM-5
        off the shelf processor and fat-tree interconnect 
        direct user access to communication to lower latency
    Shared memory
      IBM RP3, NYU Ultracomputer, UIUC Cedar, BBN Butterfly and Monarch
        NUMA without cache coherence
        RP3 and Ultracomputer synchronization research such as fetch-and-op
        interconnect more costly than processing nodes
    DSM (before snooping)
      IBM 3081
      Distributed Directories Agarwal* 1988
        Stanford DASH
          first operational ccNUMA
          nodes were "plump" 4-way SMPs
        KSR-1 first commerican
          COMA approach (memory as cache)
        Swedish Instute of Computer Science
          true COMA, no predfined nodes for memory
        COMA-F (Flat COMA)
          home location allocated
        Reactive NUMA Wood*
          best of NUMA + S-COMA
        Convex Exemplar
          two level: 8 processr modes, 32 module ring
        SGI Origin (Stanford DASH++)
          bit vector
            fine grained:   1 bit = 1 node  = 2 procs
            coarse grained: 1 bit = 8 nodes = 16 procs
        MIT Aleweif Agarwal*
          processor support for multithreading
          cooperative hw/sw handling of coherence
        Stanford FLASH
          programmable (MAGIC) processor for:
            coherence
            message passing
            synchronization
            performance measurement
  Developments in Synchronization and Consistency Models
    Lamport 1979 sequential consistency
    weak ordering, 1986
    adve and hill*, 1990 better definition weak ordering data-race-free
    Charachorloo 1990 release consistency
- Other References
  Kai Li* Ivy: usign VM to implement DSM
- Multithreading and Simultaneous Multithreading
  multithreading dates back to TX-2
    MIT lincoln labs 1959
    one of first transistor computers
    context switching to handle I/O
  CDC 6600 used in I/O processor
  HEP fine grained multithreading to hide cache and pipeline latencies
  SPARCLE in MIT Alewief Agarwal* coarse grained
  1990s
    1.) fine-grained for maximum performance (keep pipelines going) Horowitz*
    2.) need ILP+TLP to fill all units
    SMT came quickly after dynamically scheduled superscalars 
      Eggers* first realistic simulation and coined "SMT" term
Exercises 665
Chapter 7 Storage Systems
7.1 Introduction 678
- Introduction
  neglected by CPU architects
    institutional prejudice: use of CPU time for measurement
    even term "peripheral" is a bit pejorative
  common sense
    computer useless without I/O
    response time is what matters to users
- Does I/O Performance Matter?
  just run other compute while waiting for I/O
    thoughput only
    ignores interactive users
    on single user systems, nothing to do while waiting for I/O
    even in TPC and other throughput benchmarks, response time requirements matter
  I/O's revenge
    with great increases in CPU performance, Amdahl's law leaves I/O as the performance limit
- Does CPU Performance Matter?
  old view
    need to keep big expensive CPU busy (scientific/technical)
  new view
    cheap CPUs, just enough to keep I/O busy (commercial)
- Does Performance Matter?
  would you rather have a faster or more stable CPU?
    people tolerate machine crashes for faster performance
  would you rather have a faster or more stable disk?
    people DO NOT tolerate data loss
7.2 Types of Storage Devices 679
- Magnetic Disks
  cylinder
  platters
  tracks
  sectors
  constant bit density
    more sectors on the outer tracks than the inner tracks
  arm
  read/write head
  seek 
  seek time
  average seek time
    sum of the time for all possible seeks divided by the number of possible seeks.
  rotational latency (aka rotational delay)
  transfer time
  read ahead
  disk controller
  controller time
  queuing delay
  disk arrays
  array controller
- The Future of Magnetic Disks
  areal density
  diameter
    smaller diameter means smaller mass
    saves power as well as physical volume
  access time
    DRAM latency 100,000x better
    DRAM bandwidth only 50x better
- Optical Disks
  optical compact disks (CDs)
  digital video discs (aka digital versatile discs) (DVDs)
- Magnetic Tapes
  disk vs tape
  - fixed rotating platters,  limited area,   sealed in device
  - linear scan only,        "unlimited" area, removable media
  helical scan tapes
    tape moves at slower speed
    head records at angle to tape direction, moves faster
      20x-50x improvement in recording density
      VCRs, camcorders, ... brought costs down
  issues
  - tapes wear out, helical faster than longitudinal tapes
  - helical read/write heads wear out
  - reind, eject, load times for all tapes
    spin-up times for helical
  - R&D costs higher for industry compatibility
  future
  - tapes no longer bigger than disks (2001)
  - price per gigabyte no longer less (2001 40GB disk and tape the same cost)
  - bandwidth failing behind, hurting recovery times for disks backed up to tape
  - future: backup over network to remote disks...
- Automated Tape Libaries
  "nearline" tapes
    tape robots
    terrabytes within ten seconds
    STC PowderHorn 6000 tapes for 300TB
      10 tape readers
      450 tapes per hour
      1.1 kilowats
      4+ tons
      can hook 16 together and have them pass tapes!
    10x failure rate for all those moving parts
- Flash Memory
  similar to EPROM or EEPROM
    larger block size than EEPROM to reduce die area for control logic
  compared to disks
    low power
    small sizes
    DRAM read times
  compared to DRAMs
    slow write times
    erase first, write new data second
  NOR vs NAND
    NOR  based erases faster, writes slower,   100,000 cycle wear out
    NAND based erases slower, writes faster  1,000,000 cycle wear out
  2001, 6x cost of DRAM, 600x cost of disk
7.3 Buses - Connecting I/O Devices to CPU/Memory 692
  bus
   communication bottleneck
    replaced by networks and switches
      storage area networks
      next chapter
  CPU-memory buses versus I/O buses (versus mezzanine)
    CPU-memory 
      short
      high speed
      proprietary
      fixed speed
      synchronous
    I/O buses 
      longer
      slower
      standard
      devices vary in latency and bandwidth
      asynchronous
    mezzanine (a little of both)
      PCI
        short
        synchronous
        often bridges to slower I/O buses
  bus transactions
    read transaction
    write transactions
    send address and read/write data
      reads often require waiting for data
      writes often just write and forget
- Bus Design Decisions
  decisions affecting cost/performance
    FEATURE                             FASTER          CHEAPER
    separate addres and data lines      seperate        multiplex
    wider data lines                    wider           narrower
    multiple word transfers             muliple         single 
    bus-master                          arbitration     no arbitration
    split transaction                   request/reply   no multiple bus-master
    clocking                            synchronous     asynchronous
  split transaction
    aka connect/disconnect
    aka pipelined bus 
    aka pended bus
    aka packet-switched bus
  synchronization
    skew limits bus length
    asynchronous tolerates varying speed devices, but adds handshake overhead
- Bus Standards
  PDP-11 Unibus
  IBM PC-AT Bus (16-bit ISA)
  PCI
- Examples of Buses
  parallel
    IDE/Ultra ATA
    SCSI
    PCI
    PCI-X
    (PCI Express)
  serial
    I2C
    1-wire
    RS-232
    SPI
    (USB?)
  CPU-memory
    HP HyperPlane Crossbar
    IBM SP
    Sun Gigaplane-XB
- Interfacing Storage Devices to a CPU
  where? memory or cache?
    memory is typical
  how?
    memory-mapped is typical
    I/O instructions (Intel x86 and IBM/370)
  what?
    status and control
    DEC LP11 line printer
      status register
        done bit
        error bit
      data register
        wait for done, then write data
        *polling*
    interrupt-driven I/O (versus polling)
      DEC LP11 can interrupt CPU when done/error status changes
      interrupts can be higher overhead the polling 
    hybrid
      clock interrupt specifies when to poll
    dynamic
      livelock paper: switch from interrupt to polling under high load
- Delegating I/O Responsibility from the CPU
  direct memory access (DMA)
    let I/O devices talk to and from memory themselves
  scatter/gather
    allows I/O from virtually contiguous addresses which are not physical contiguous
    allows I/O from multiple callers to be sent together
  I/O processors (aka channel controllers)
    I/O control blocks (little programs for the I/O processor)
    only one interrupt when control block is complete
    typically restricted programs
    usually do not change data, just transfer it
  embedded often has lots of I/O
   Au1000 MIPS chip
    10 DMA channels
    20 I/O device controllers
    all on chip
7.4 Reliability, Availability, and Dependability 702
  what is failure
    programming mistakes
    design time or runtime 
    alpha particle
      error if we don't access changed bit?
      error if caught by ECC?
    operator error
- Defining Failure
  "dependability"
    quality of delivered "service"
    observed "actual behavior"
    specified behavior
      specified by "server specification"
    "failure" when actual deviates from specified
      failure to "error" in module
      "fault" is cause of error
    "fault" creates "latent error"
    latent becomes "effective error" after "error latency"
    error   is manifestation of fault "in the system"
    failure is manifestation of error "in the service"
  alpha particle
    fault that causes latent error
    on read, we have an error
    if ECC corrects error on read, we have no error
  summary
    faults => latent errors
    errors => more errors
    if error affects service => failure
  service
    1.) service accomplishment
        service delivered as specified
    2.) service interruption
        service differs from specification
    1 => 2 on failure
    2 => 1 on "restorations"
  module reliability
    MTTF - time from 1=>2
    MTTR - time from 2=>1
  module availability
      MTTF
    --------- = module availability
    MTTF+MTTR
  MTBF (between failures) = MTTF+MTTR
  faults
    1.) hardware faults      - device failure
    2.) design faults        - software or hardware design issues
    3.) operational faults   - operation or maintenance human error
    4.) environmental faults - fire, flood, power, sabotage!!!
  duration
    transient
    intermittent
    permanent
  reliability improvements
    valid constructions
    error correction
    latent error processing
    effective error processing
      masking
      recovering
    error correction
      backward (undo)
      forward  (redo)
  methods
    1.) fault avoidance   - by construction
    2.) fault tolerance   - by redundancy
    3.) error removal     - by verification
    4.) error forecasting - by evaluation
7.5 RAID: Redundant Arrays of Inexpensive Disks 705
  disk arrays
  striping
  MTTR/MTTF
  RAID 
    Redundant Arrays of Inexpensive Disks
    Redundant Arrays of Independent Disks
  hot spares
  not swapping
- No Redundancy (RAID 0)
- Mirroring (RAID 1)
  mirroring (aka shadowing)
  RAID 1+0 (aka RAID 10) stripped mirrors
    4 pairs of RAID 1 disks
  RAID 0+1 (aka RAID 01) mirrored strips
    2 sets of 4 RAID 0 disks
- (Memory Style ECC (RAID 2))
- Bit-Interleaved Parity (RAID 3)
  protection group
  parity
  RAID 1 mirror in special case of RAID 3 with 2 disk
  RAID 3 has to write all disks sits bits-interleaved between disks
- Block-Interleaved Parity and Distributed Block-Interleaved Parity (RAID 4 and RAID 5)
  RAID 4 parity disk is bottleneck, all writes need to update it
  RAID 5 rotates location of parity block to prevent bottleneck
- P+Q Redundancy (RAID 6)
  tolerate two failures instead of parity's one
- RAID Summary
  higher throughput (MB/s and I/O/s) as well as higher availability
  size and power savings from small disks
7.6 Errors and Failures in Real Systems 710
  1.) academics can't afford to build real systems
  2.) industry does not want to disclose failure data
- Berkeley's Tertiary Disk
  before
    assumed SCSI data disks would be higest source of failure
    assumed passive objects like cables would be error free
  after
    50% of ethernet switches died
    28% of Disk enclosure backplances
    23% of IDE disks
    a higher percentage of SCSI cables died than SCSI disks!
- Tandem
  hardware ever more reliable
    less maintence
    fewer chips and other components
    (operator error and environmental factors underreported)
  software the ever growing problem
- VAX
  OS from 70% to 28% of failures from 1985 to 1993
  operator error increasing problem (better data than Tandem for this category)
- FCC
  mandated error reporting with separate investigators from operators
  include measure of impact (number of customers, length, etc)
  results
    overload less and less of a problem, cheaper to add capacity
    as a percentage, human error ever growing component
    did not include vandalism (very small) or environment (very big)
7.7 I/O Performance Measures 716
  I/O throughput (bandwidth)
  I/O latency    (response time)
- Throughput versus Response Time
  transaction time
    entry time
    system response time
    think time
  transactions per hour
  ***people need less time to think if given a faster reponse!!!
    response time under one second shows non-linear productivity increase
    novice engineer with fast response time as produtive as experienced engineer with slow respose time
    major economic motivation for keeping response time low
  reponse time today
    launching applications slow due to disk I/O
    web surfing due to network delays
- Response Time versys Throughput in Benchmarks
  TPC 90% of transactions must be below specific time
  SPEC average below a specific time
7.8 A Little Queuing Theory 720
  initial startup to steady state?
    markoff chains or simulation
    too much effort
  instead focus on steady state
    input rate == output rate
    interarrival times are exponential distributed
    infinite population model (number of in service requests does not affect number of requesters)
    server starts one job right after another
    no queue length limit
    one server
  this is M/M/1
    Markov / Markov / 1 server
    Markov => exponentially random request arrival rate
  Little's Law
    Mean number of tasks in system = Arrival rate * mean response time
  queue (aka waiting line)
    single server for now
  metrics
    Time[server] (micro)
    Time[queue]  = Length[queue]*Time[server] + "average residual service time" (time left of existing job)
                 = ...
                 = Time[server] * (server utilization/(1 - server utilization))
    Time[system] = Time[server] + Time[queue]
    Arrival rate (lambda)
    Length[server] = Arrival rate * Time[server] (Little's Law)
    Length[queue]  = Arrival rate * Time[queue]  (Little's Law)
                   = ...
                   = (server utilization^2) / (1 - server utilization)
    Length[system] = Length[server] + Length[queue]
    service rate = 1/Time[server]
    server utilization (rho) = Arrival rate * Time[server]
      must be between 0 and 1 in steady state
      aka traffic intensity
    random variables
     distribution
       histogram with buckets
    arithmetic mean time (amt)
    variance
      square of standard distribution
    squared coefficient of variance (unitless)
      C^2 = variance/(amt^2)
       C  = standard-deviation/amt
    exponential distribution
      C=1 
      memoryless
      Poisson distribution
      Poisson process
    average residual service time = 1/2 * weight mean time * (1+C^2)
  M/M/m
    N[servers]
    Utilization = (Arrival Rate * Time[sever]) / N[servers]
    Length[queue] = Arrival rate * Time[queue]  (Little's Law)
    Time[queue] = Time[server] * (P[tasks>=N[servers]]) / N[servers]*(1-Utilization)
      that is we need to factor in the probability that sometimes we have empty servers
7.9 Benchmarks of Storage Performance and Availability 731
  queueing theory is nice, but real numbers are meaningful
- Transaction Processing Benchmarks
  transaction processing (TP) (aka online transaction processing (OLTP))
    I/O rate (disk access/second) not data rate (bytes/second)
    banking and airline reservation systems
  TPC (Transaction Processing Council) 
    first benchmark was anonymous!
  TPC-C order entry of wholesace suppplier
    entering orders, devering orders, recording payment, checking state, stock level
      SPECjbb...
    tpmC (tranactions per minute)
  TPC reporting
  - price is include with benchmark results
  - the data set geneally must scale in size as the throughput increases
  - the benchmark results are audited
  - throughput is the performance metric, but reponse times are limited
  - an independent organization maintains the benchmarks
- SPEC System-Level File Server (SFS) and Web Benchmarks
  SFS is for NFS
    scales data with throughput like TPC
    also limits average reponse time
    no price/performance information
    retired because of bugs!
  SPECWeb
    load increased until response time suffers significantly
    response sizes
      35% < 1k
      50% 1-10k
      14% 10-100k
       1% 100k-1m
    content
      static pages
      rotating advertisements
      dynamic pages
      user registraion
    linux crushing windows!
- Examples of Benchmarks of Dependability and Availability
  TPC-C requires tolerating a disk failure => usually RAID
  study of injecting disk faults into SPECweb system
    performance impact
      linux
        auto correct,     rebuilds slow, no impact on performance during rebuild
      linux
        auto correct,     rebuilds fast, major impact on performance during rebuild
      windows
        operator correct, rebuilds fast, major impact on performance during rebuild
    handling of transient faults
     linux too paranoid, gives up too easily on disks, consuming hot spares quickly
     soliar and windows not paranoid enough, ignore transient errors that cause performance degradation
7.10 Crosscutting Issues 737
- Unix anecdote
  early unix boxes had 16-bit I/O processors
  system code limited requests to 63kb to work well with these
  later controllers were limited to this request size, even if they were more capable...
- DMA and Virtual Memory
  issue: sequential virtual pages are not necessarily sequential physically
         scatter gather I/O
  issue: OS moving or paging out pages that are involved in DMA
         pinning page
  scatter/gather I/O can also avoid copying from user to kernel just to do DMA
  "virtual DMA"
    each page to be transfer needs relocation information in controller
    OS can update this information on page relocation
- Asynchronous I/O and Operating Systems
  synchronous I/O
    issue request and wait for it to be done
  asynchronous I/O
    do something else while waiting
    (just like tolerating cache miss delays in OOOe)
- Block Servers versus Filers
  logical units, logical volumes, physical volumes
    collections of disk blocks
  logical units
    configured with a certain RAID layout
  physical volumes
    OS interface to logical unit
  logical volumes
    combining multiple physical volumes into bigger volumes
    split or stripe
  Q: where is file defined?
    A1: server
     direct attached storage
    A2: disk subsystem
     network attached storaged (NAS)
     access via netowork protocols such as NFS and CIFS
     "filer" is a NAS with only file interface
       NASs can also provide block interface for use by databases
- Caches Cause Problems for Operating Systems - Stale Data
  cache. memory, disk
  stale data
    keeping data updated by CPU and/or I/O in sync
    1.) I/O system writes bad data because memory is not up-to-date
        - write through cache
        - write back cache with OS flush
        - DMA controller snoops second set of cache tags
    2.) CPU reads bad data because cache is not up-to-date
        - flush from cache before I/O operation
        - mark region as non-cachable
        - cache snoops bus, invalidates blocks
- Switches Replacing Buses
  active component costs droping
    passive bus cost savings is less substantial
  IDE - from PATA to SATA
  PCI => infiniband?!? how about PCI Express
- Replication of Processors for Dependability
  IBM 390
    ECC for caches and main memory...
    ECC for register file as well
    14 processors including spares
    redundant instruction fetch/decode, execution units, L1 cache, register file
      all to check for errors
      retry inconsitent instructions to mask transient failures
      swap in backup processor in less than one second for non-transient failure
7.11 Designing an I/O System in Five Easy Pieces 741
  tradeoffs
    1.) cost
    2.) variety of devices
    3.) performance
    4.) expandability
    5.) dependability
  steps
    1.) list types of devices or standard busses to support
    2.) list physical requirements
        size, power, connectors, bus slots, expansion cabinets
    3.) cost of each I/O device and controller
    4.) list the reliability of each device
    5.) record cpu resource demands for each device
        - clock cycles for instructions to initiate I/O, handle interrupts, complete I/O
        - CPU clock stalls due to waiting for I/O to finish using memory, bus, or cache
        - CPU clock cycles to recover from I/O activity such as a cache flush
    6.) List the memory and I/O bus depands of each I/O device
    7.) assess performance and availability of different configurations
        using simulation (or approximate with queueing theory)
        reliability with independent failures and exponential distribution of time to failure
        availablility = f(reliability, estimated MTTF, estimated MTTR)
    use cost, performance, and availablility goals to evaluate
- First Example: Naive Design and Cost-Performance
  I/O system with two types of disks
  ignores dependability
  simplifying assumption: 100% utilization of I/O resources
- Second Example: Calculating MTTF of First Example 
  poor availablity of #1
- Third Example: Calculating Response Time of First Example 
  impact of response time due to 100% utilization
- Fourth Example: More Realistic Design and Cost-Performance
  more realistic utilization to keep response time reasonable
  rules of thumb
  - no disk should be used more than 80% of the time
  - no disk arm should be seeking more than 60% of the time
  - no disk string should be utililized more than 40%
    SCSI string bandwidth low because extra 20% for command overhead
  - no I/O bus should be utililized more than 75%
- Fifth Example: Designing for Availablity
  use RAID 5 to improve availablity
  MTDL = mean time to data loss
    to lower, increase MTTF or lower MTTR
  orthogonal RAID
    don't put all disks for one logical volumn on one controller
    contoller death makes all data unavailable
    instead use orthogonal approach
      each controller gets one block from each logical volume
7.12 Putting It All Together: EMC Symmetrix and Celerra 754
  Symmetrix is directly attached
  Celerra is network attached front end to Symmetrix
- EMC Symmetrix 8000
  stats
    disk
      384 disks
      RAID-1 or RAID-S (variant of RAID-5)
      up to 28TB raw capacity with 73GB drives
    interconnect
      4 buses with ECC
      each component on 2 busses for reliability
    controllers
      channel directors
         server interface
           SCSI, FC-AL (fiberchannel arbitrated loop), ESCON (IBM mainframe I/O)
         up to 16, 2 per card
         manage the disk caches
      disk cache memory
        2-32GB 
        90-95% read hit rate for largest size
        acts as write buffer
          entire array has battery backup to avoid losing write buffer
        cache block size is 32kb
      disk directors
        up to 16, 2 per card
        each has 2 Ultra1 SCSI strings
        each string is connected to two directors for redundancy
      12 SCSI disks per string * 2 strings per director * 16 disk directors = 384 disks
  director architecture (channel and disk)
    2x PowerPC 750s each
  disk enhancements for priority queueing
    provided by hardware manufactures at EMCs request :)
  CRC is tracked between disk/cache and cache/host
  finding latent errors
    cache is scrubbed (re-read) to correct latent errors with ECC
      frequently bad cache blocks are "fenced" off from future use
    disks also scrubbed during idle time
      fences off bad tracks with many bad sectors
  move parity XOR from controllers to disk
    moves operation from single location to N disks
    allows read/write without intervening seek
  mirror policy
    depending on access patern:
      sequential pattern "interleaves" access between disk 0 and disk 1
      random pattern ues "mixed" pattern to send requests to first "half" to disk 0, second to 1
  service processor
    laptop connected to directors via ethernet
    predicting potential failures
      errors are logged to service processor
      calls home to EMC via telephone to report issues
      can be called by EMC to collect more data
    also for installation, upgrades, configuration
- EMC Celera 500
  "filter" for Symmetrix
  stats
    14 slots for "data movers"
      data movers are just PC motherboards
        two SCSI busses to hook to symmetrix channel directors
        700Mhz Pentium III
        network options
          ATM
          FDDI
          two GigE
          8 100M bit ehternet
      DART real time OS
      each appears as autonomous file server
    control stations (similar to service processor)
      two for reliability
      same hardware as "data mover"
      OS is Linux instead of DART
    dependability
    - redundant fnas, power, batters, power cords
    - each data mover can reach all Symmetrix disks
    - each data mover has two ethernet cards to talk to each control station
    - each data mover has at least two client interfaces
    - data movers can act as standby spares
    - space for a redundant control station
    Symmetrix service processor also reports Celera errors
- EMC Symmetrix and Celera Performance and Availablity
  focus on dependability, but also has leading performance
  1.) cache fault
      different directoy processors discover fault over time
      after fault repair, I/O rate increases above steady state to catch up
  2.) lock fault
      response time suffers because of rogue director holding lock
      nothing to do but wait for suffer through it
7.13 Another View: Sanyo VPC-SX500 Digital Camera 760
  CCD -> 4mb frame buffer -> 8mb Flash (or 340mb ibm microdrive)
    19 fine or 31 normal quality images on 8mb flash
  CompactFlash
    PCMCIA-ATA card interface
    ATA interface makes it look like a disk
      allows easy interface to existing systems with ATA controllers
      allows easy yield improvments by treating problems as bad disk blocks
  SOC (system on a chip)
    lowers battery requirement from 4 AA to 2 AA batteries
  buses
    10 bits from CCD
    32 bits interally
    16 bits to video d/2
    16 bits to SDRAM
    16 bits to smart media, serial, irda, pcmcia, audio, ...
7.14 Fallacies and Pitfalls 763
- Fallacy
    The rated mean time to failure of disks is 1,2000,000 hours or
    almost 140 years, so disks practically never fail.
  This industrty deception assumes replacement every 5 years at "life of disk"
  Looking another way, about 5.5% of disks will fail in the five year period
    1 in 20 chance it will be your disk :)
- Fallacy
    Compoents fail fast.
  Tertiary disk project
    many intermittent failures
    operator had to decide to "terminate" component
- Fallacy
    Computer systems achieve 99.999% availablity ("five nines") as advertised
  excludes operator error, application faults, and environmental faults
  excludes scheduled downtime
  Microsoft and the five nines
    their own website was down for 22 hours in January 2001
    needs 250 years to get back to five nines.
  well managed servers in 2001 get 99% to 99.9% availability
- Pitfall
    Where a function is implemented affects its reliability.
  software RAID
    culture of release and patch
    compatability issues between versions
  hardware RAID
    more testing before initial release because of high cost to fix
    more independent of OS versions
- Fallacy
    Semiconductor memory will soon replace magnetic disks in desktop and
    server computer systems.
  first edition
    memory on course to wipe out tapes
  in between
    disk industry realized they needed to match DRAM growth
  current edition
    disk wiping out tapes
- Fallacy
    Since head-disk assemblies of disks are the same technology
    independent of disk interface, the disk interface matters little in
    price.
  1.) SCSI controllers sold in lower volume => less competition
  2.) SCSI disks often have higher RPM and lower seek
  3.) SCSI controllers sold in lower volume => higher cost because of
      "manufacturing learning curve"
- Fallacy
    The time of an average seek of a disk in a computer system is the
    time for a seek of one-third the number of cylinderss
  1.) seek time not linear with distance
      accelerate, decelerate, settle time
      also some waiting to control vibration
  2.) average assumes no locality
7.15 Concluding Remarks 769
  dependability
    reliability and availability
  maintainability
7.16 Historical Perspective and References 770
- Magnetic Storage
  tape
    invented to record sound, competitive in 1941
    eniac 1947 used mag tape for digital data
    1970s real to real for removable media
    1980s IBM 3480 tapes
    9840 cartridge in PowderHorn
      serpentine recording
        zigzag pattern (back and forth, reversing head each time)
  disk
    1953 IBM Almaden (San Jose, CA)
      concept to prototype for magnetic disk in 24 months
      RAMAC-350
        50 platters, 24 inch each, 5mb, access time 1 second
    later innovations
      head flies on cushion of air
      removable disk
      "winchester" disk
        integrated circuits
        sealed enclosure
        "30-30" (two 30mb spindles) let to "winchester" nickname
        magnetoresistive heads (enabling current densities)
    competition keeps gap betwen research and development short
    future market
      quiet drives for PVRs
      small drives for portables versus flash for portables
        a little bit of both
- RAID
  EMC 
    had been a DRAM manufacture for IBM computers
    switched from DRAM to RAID because of business issues
- I/O Buses and Controllers
  intelligent devices
    offload tasks and queuing to devices (example SCSI disks)
  disk bus standards
    SCSI (small computer systems interface)
    FC-AL (fiberchannel arbitrated loop)
  I/O bus standards
    PDP-11 Unibus
    PCI, PCI-X, Infiniband
  interrupts
    from anomalies to async I/O
  DMA
  IBM 360 channel = I/O bus
Exercises 778
Chapter 8 Interconnection Networks and Clusters
8.1 Introduction 788
  networks (aka communcation subnets)
  end systems (aka hosts)
  internetworking
  architecture and networks
  1.) moore's law allows networks to be used within computer
  2.) almost all computers are connected to networks
  cluster
  naming
    wire area network (WAN) (aka long-haul network)
      ATM
    local area network (LAN)
      Ethernet
    storage area network (SAN) (aka system area network)
      Infiniband
  Terms from Figure 8.2
    adaptive routing
       picking path based on measure of delay of outgoing links
    ATM
    attenuation
       lost of signal over distance
    back pressure flow control
      wires to tell sender to stop or slow down
    bandwidth
    base station
      wireless to wired
    bisection bandwidth
    bit error rate (BER) 
      typically in per million bits
    blade
    blocking 
      contention
    bridge
     OSI layer 2
    carrier sensing
      detect collisions (Ethernet and wireless (aloha))
    category 5 wire (cat5)
    channel
       wireless channels like in 802.11
    checksum
    circuit switching
    cluster
    coaxial cable3
    collision
    collision detection
    colocation site
    communication subnets
    credit-based flow control
      sort of like leases, get bandwidth allotment ahead of time, need to check to get more
    cut-through routing
      start transmitting based on header, before all data received
      message may get strung out, but often tail catches up with head within switch
    destination base routing
    deterministic routing
    end systems
    end-to-end argument
    Ethernet
    fat tree
       extra links at each level so bandwidth between levels is constant
    FC-AL
        fiber channel arbitrated loop (SAN)
    frequency-division multiplexing
       time share medium using different frequencies, not time slicing
    full duplex
      two way communication
    header
    host
    hub 
     OSI layer 1
    Infiniband
       SAN
    interference
    internetworking
    IP
      OSI layer 3
    iSCSI
       SCSI over IP 
       SAN using Ethernet
    LAN
    message
    multimode fiber
       cheap optical fiber that reduces bandwidth and distance for cost
    multistage switch
    OSI layers
    overhead
    packet switching
      vs circuit switching
    payload
    peer-to-peer protocol
    peer-to-peer wireless
    protocol
    rack unit
       1U = 1.7" 
       single slot in standard 19" VME rack
       44U in standard 6 foot rack
    receiver overhead
    router
       OSI layer 3
    SAN
      storage area network (aka system area network)
    sender overhead
    shadow fading
      wireless signal blocked by objects
    signal-to-noise ratio (SNR)
    simplex
      one way communication (aka half duplex)
    single-mode fiber
       more expensive (and narrower) fiber with more bandwidth and distance
    source-based routing
    store-and-forward
    TCP 
      OSI layer 4
    throughput
    time of flight
    trailer
    transmission time
    transport latency
    twister pairs
      two wires twisted together to reduce electrical interference
    virtual circuit
    WAN
    wavelength division multiplexing
      send several signals on one fiber with different wavelengths of light
    window
    wireless network
    wormhole routing
      similar to cut-through routing
      different is tail of message does not catch up with head
8.2 A Simple Network 793
  paraphrased quote:
    bandwidth can be cured with money
    latency is harded because of speed of light
  message
    header
    payload
    trailer
    checksum
  protocol
    single vs multiple messages in flight
    byte order
    duplicates
    ordering
    flow control
  performance terms
    bandwidth
      aka throughput
      bits/second
    time of flight
      time from sender to receive
    transmission time
      time from start of transmission 
    transport latency
      = time of flight + transmission time
    sender overhead
      hw and sw cost to send message
    receiver overhead
      usually higher than sender
        interrupt costs
    total latency
      = sender overhead + time of flight + message-size/bandwidth + receiver overhead
      ~= overhead + message-size/bandwidth
    effective bandwidth
      = message-size/total latency
  message size
    message size is important to get full benefit of fast networks
    unfortunatelym most apps send many small messages
    more request/reply(ack) are more common than data 
8.3 Interconnection Network Media 802
  medium
    twister pair
      Cat1 cable
        mainstay of phone company
        slow signal over many kilometers 
      Cat3 
        10Mb/s
      Cat5
        100Mb/sec <=100 meters
    coaxial cable
      50ohm 10Mb/s over a kilometers
      T-junction vs vampire tap
        T requires cut and splice (annoying)
        vampire tab allows new drops to be added without cut
    fiber optics
      simplex vs duplex like twisted pair and coax
      multimode
        cheap LED
        1000Mb/sec < 1km
         100Mb/sec few kilometers
      single-mode
        expensive laser diode
        gigabits/second over hundreds of kilometers
          phone company backbone
        attenuation limits length
        have limits on bending radius
        glass instead of plastic fiber
      wavelength division multiplexing
        multiple wavelengths on same fiber
           8 wavelengths for  40Gb/sec (2001)
          80 wavelengths for 400Gb/sec (planned)
    cable vs fiber
      fiber has to use t-junctions
      1.) passive fused into fiber
          failure means cable replacement
      2.) active repeater 
          optical to electrical to optical 
          failure means repeater replacement, but network is down
      fiber optoelectronic components add significant cost
8.4 Connecting More Than Two Computers 805
- Shared versus Switched Media
  shared media 
    like a bus
      inexpensive
      limited bandwidth
      arbitration scheme for sharing
    sharing via arbiter
      okay for small scale (like bus)
      need some way to talk to arbiter
    sharing via carrier sensing (aka collision detection)
      look before you leap
        if busy wait until free
      transmit
        if detect collision retry
        random retry delay helps prevent repeated collision 
        exponential backoff helps prevent network overloading (livelock)
    token passing
  point-to-point communication
    switching
      aka data switching exchanges
      aka multistage interconnection networks
      aka interface message processors (IMPS)
    each node has dedicated line to switch
    switch has dedicated lines to everyone else
    tradeoff arbitration for switch overhead
    moore's law has made switches more attractive
      single chip 64x64 switch in 2001
  shared advantages
    broastcast
      sending to all nodes of shared medium is easy
    multicast
      just send to all and let some ignore
  switch advantages
    higher aggregate bandwidth
      pairs of nodes can communicate at full link speed simultaneously
- Connection-oriented versus Connectionless Communication
  connection
    phone call
    frequency-division multiplexing
      each call gets dedicated frequency to use
    "busy" based on number of connections, not "ammount" of usage
  connectionless
    postal model
  circuit switching
    traditional way to deliver connection-based service
  packet switching
    packets (aka frames)
    queueing theory says cannot use all of bandwidth
      however uses more of the bandwidth given waste of circuit switching
    beware
      connection does not imply circuit switching
      connectionless does not imply packet switching
      TCP is a connection-based service built on top of packet switching
- Routing: Delivering Messages
  shared media routing
    broadcast
  switched media routing
    source-based routing
      message specifies path to destination
    virtual circuit
      estabilish circuit between source and destination
      message names circuit to take
      ATM
      simpler switch
    destination-based routing
      message states destination
      switch has to decide where to route
      IP
      more complex switch
      approaches: deterministic vs adaptive vs randomized
        deterministic always takes same path
        adaptive chooses between paths based on network status (load, failures, congestion, cost)
        randomized (either deterministic or adaptive) tries to distribute load
  packet storage in routing
    store-and-forward routing
      receive whole packet before acting on it
    cut-through routing
      receive header before acting on it
      room for whole packet if output blocked
    wormhole routing
      receive header before acting on it
      no room for entire packet if output blocked, blocks inbound
    store-and-forward has higher latency
      switch latency proportional to packet length * number of switches
  ordering
    can be provide at this layer but often managed by higher layer (often in software)
- Congestion Control
  circuit switched avoid dynamic congestion
    congestion checks happen at circuit setup time
    once connection established, dedicated resource
  packet switches dynamic congestion control
    packet discarding
      UDP
    flow control
      backpressure
        separate channel (wires) to let receiver supress sender
          supercomputer networks
          SANs
          GigE switches via fake collision signals
      credit-based flow control
        sender needs "credit" from receiver to send
        "window" approach is form of credit
      choke packets
        overloaded switches return packets to source as "choke packets"
        allows adaptive sender to reduce percentage of packets sent via outbound link
8.5 Network Topology 814
- Centralized Switch
  single logical switch
  multistage switches
    switches made up of other switches
    crossbar
      n^2 swtiches
      source routing: list in header
      destination routing: destination picks outbound by table lookup
      spanning tree routing: on the fly
    omega network
      fewer switches
        n/2 lg n switches
        more chance of congestion
    fat tree
      more bandwidth on higher levels
      multiple paths between nodes
      CM-5
- Distributed Switch
  switch at each node
    ring
      pipelined bus
      + point-to-point links
      - bus repeater delays
      EIO between Cell PE and SPEs
    token ring
      combines ring with arbitration
    fully connected
      very expensive
    2D grid or mesh
    2D torus
    hypercube tree
  bisection bandwidth
    divide nodes approximately in half
    sum the bandwidth of all lines crossing the diving line
    fully connected: (n/2)^2
    bus: link speed
    example with 64 nodes
                          BUS     RING    2DTORUS 6-CUBE  FULLY CONNECTED
      bisection             1       2      16      32     1024
      ports/switch         NA       3       5       7       64
      lines                 1     128     192     256     2080
  implementation issues
    1.) transforming 3d into 2d chips and cabinets
    2.) switch latency depends on routing pattern, which depends on topology
    3.) bandwidth from processor with single port can be limiting factor
    quality of implementation depends more than topology
  redundancy matters at all levels: SAN, LAN, WAN
8.6 Practical Issues for Commercial Interconnection Networks 821
- Connectivity
  100s easier than millions
  affects protocol complexity
- Connecting the Network to the Computer
  where to connect?
    I/O bus
      most SANs and all LANs and WANs
      not usually hardware cache coherent
    memory bus
      some SANs
      can be hardware cache coherent
    DMA
      good for large messages to use memory bandwidth
      many small messages perhaps better to use PIO (programmed I/O) since data can come from cache
- Standardization: Cross-Company Interoperability
  low cost and stability
    volume and multiple sources for products
  when to standardize
    before its built?
      userless and ommited features
    can supress innovation...
  WANs
    no proprietary standards since different organizations with different suppliers must connect
  LANs
    Ethernet...
  SANs
    some standards, more propriety: okay since spans fewest machines
- Message Failure Tolerance
  end-to-end argument...
    but some lower level network retry as performance optimization
- Node Failure Tolerance
  WANs have stongest requirement for failure tolerance
  LANs are close, but rogue machine hosing network can be hunted down easily (one organization)
  SANs: sacrifice fault tolerance for low-latency. build with more redundancy.
  interrupt network to add new nodes?
    WANs: are you crazy
    LANs: that is why t-junctions sucked
    SANs: not as much of an issue
8.7 Examples of Interconnection Networks 825
- 10 design decisions
   1.) what is the target bandwidth
   2.) what is the message format
   3.) which media are used
   4.) is the network shared or switched
   5.) is it connection oreinted or connectionless
   6.) does it use store-and-forward or cut-through routing
   7.) is routing use source based, destination based, or virtual-circuit based
   8.) what is used for congestion control
   9.) what topologies are supported
  10.) does it follow a standard
- Ethernet: The Local Area Network
  growth
    1978 10Mb/sec
    1994 100Mb/sec
    GigE
    (10GigE)
  coax, cat3, cat5, optical, wireless
  parallelism
    hubs
      OSI level 1
      extend Ethernet segment
    bridges
      OSI level 2
      connect two Ethernet
    routers (aka gateways)
      OSI level 3
      connect Ethernet to other networks
  switches
    GigE+ requirement
    cut-through routing
- Storage Area Network: Infiniband
  2Gig bits/sec per link
    point-to-point
    links can be bundled in groups of 4-12
    packet swithced
    connectionless
    switched
    Cat5 - 17 meters
    optical - 100 meters
    backpressure for flow control
    SCSI command set to talk to storage
  SAN vs LAN 
    SAN arguments
    1.) lower protcol overhead
    2.) no protection requirements (behind server firewall)
    3.) graceful degradation under congestion
    LAN arguments
    1.) cheaper switches
    2.) IP networks allow scaling to many nodes
    3.) TCP/IP off-loading engines reduce protocol overhead
  iSCSI
    use SCSI over TCP/IP + GigE as SAN solution
- Wide area Network: ATM
  stats
    WAN only (no more LAN dreams)
    155Mb/sec baseline with 4x scaling increments: 155, 620, 2480, ...
    switched
    connections
    store-and-forward
    small, fixed sized packet (48 bytes)
    credit-based flow control
  connections and small packets?
    QoS (Quality of Service)
    predicitability matters for voice traffic, not just bandwidth
    small packet allows fast routers and switches
- Summary
  crossover
    ATM tries and failed to move from WAN to LAN
    iSCSI tries to move Ethernet from LAN to WAN
8.8 Internetworking 830
  protocol families (aka protocol suites)
    TCP/IP
      internetworking protocol even used within a LAN such as for NFS
  layering
    OSI levels
      1 physical      IEEE 802                hub
      2 data link     Ethernet                bridge, NIC
      3 network       IP                      router, switch
      4 transport     TCP                     gateway
      5 session       named pipes, RPC        gateway
      6 presentation  format conversion       gateway
      7 application   FTP,DNS,NFS,http        gateway, smart switch
    logically communcation at same level ("peer-to-peer")
      services of lower levels implement upper levels
    protocol stack
    latency an issue in layering
      however protocol defines interface, not implementation
      can try and merge levels in implementation
8.9 Crosscutting Issues for Interconnection Networks 834
- Density-Optimized Processors versus SPEC-Optimized Processors
  colocation
    charge for network, space, and power
    SPEC optimization processors haven't cared about space or power
  use mobile processors designed for small space and lower power
    better performance/watt
    better perforamce/cubic-foot
- Smart Switches versus Smart Interface Cards
  Ethernet
    cheap interface card means costly switches
  Infiniband
    cheaper switch with more expensive "host" interface
    trying to have restricted "target" interface to lower cost for disks
- Protection and User Access to the Network
  Cray
    fine grained
      network access to global address space
      protection by TLB
    block access
      OS needs to be involved to check permissions for block transfers
      adds OS overhead
- Efficient Interface to Memory Hierarchy versus Interconnection Network
  SPEC CPU 
    reward integration of cache with processor
      avoid memory at all costs
        add more levels of caches
    do not reward network integration
      have to travel via memory
        each cache layer adds more cost
- Compute-Oriented Processors versus Receiver Overhead
  SPEC doesn't need to take interrupts
  fancy OOOe processors can take huge penalities on interrupt processing
8.10 Clusters 838
  Commodity, Commodity, Commodity
- Performance Challenges of Clusters
  1.) I/O bus interconnect versus memory bus interconnection
      software-based communication vs hardware
  2.) division of memory
      single program cannot use more than cluster's worth of memory
      in 1995 this was a cost issue, in 2001 the problem is lack of expandability
- Dependability and Scalability Advantage of Clusters
  replacement
    hotswapping a machine on a LAN is easier than hot swapping processor in running machine
      no special machine hardware or OS software needed
  applications end up designed to tolerate down machines
    often makes it easy to handle adding machines as well
- Pros and Cons of Cost of Clusters
  cost of administration
    cluster proportional to number of nodes
    multiprocessor closer to single machine
  price/compute (price/bandwidth)
    cluster wins on commodity (both compute and interconnect)
    multiprocessor loses because of low volume
  falling memory costs have favored clusters
- Shooting for the Best of Both Worlds
  multiprocessors
    dynamic partitioning to allow online hardware replacement and software upgrade
  clusters
    reduce admin cost by using multiprocessor nodes
  clusters
    use SAN approach of isolating storage nodes from compute nodes
      but GoogleFS and MapReduce don't do this
      locality matters as bandwidth requirements could be huge between two pools of nodes)
- Popularity of Clusters
  scaling Internet applications
  scientific computing
    own your own, versus time-share a super
  TPC
    clusters fastest with good cost/performance
    TPC-C
      SMP has worst cost/performance
      NUMA typically not as offensive :)
    TPC-H
      general pattern: cluster beats NUMA beats SMP 
8.11 Designing a Cluster 843
- Intro
  similar process to Section 7.1
  building a cluster with 32 CPUs, 32GB DRAM, 32 or 64 disks
  costs 
    Larger 1 MB L2 cache
    higher DRAM costs for SMP nodes, especially between 2 and 8 nodes
      no particularly good reason...
- First Example: Cost of Cluster Hardware Alternatives with Local Disk
  fatter nodes
    fewer switches
      however, more cost for SMP interconnect
    less rack space
      however can't hold as many disks, need expansion unit
  on cost: 2-way system beats uniprocessor beats 8-way
- Second Example: Using a SAN for Disks
  cost for FC-AL interface cards brings uniprocessor and 8-way to same cost point
    2-way still best option
  SAN adds $40k-$100k to cost
    brings dependability: reliability and availability
- Third Example: Accounting for Other Costs
  TCO (total cost of ownership for 3 years)
    OS license
    DB license
    Computer maintenance
    Network maintenance
    rack space
    rack power
    bandwidth
    operator
    tapes
  operator costs start to dominate
    SAN requires 1/2 less operator, justifies additional cost
  2-way with SAN wins
  HW is only 1/3 to 1/2 of TCO
- Fourth Example: Cost and Performance of a Cluster for Transaction Processing
  - disk space: bigger and faster and more of them
  - RAID: instead of SAN, since TPC-C only counts HW costs, not operator
  - Memory: max each node out
  - Processor: max out
  - PCI: 7 of 12 used for RAID
  - tape reader, 4 monitors, 4 UPSs: ...
  - maintenance and spares: spare ethernet switches, host adapters, and cables
- Summary of Examples
  1.) HW is only fraction of TCO
      SAN adds HW cost to lower operator cost
  2.) smaller computers generally cheaper
      SMP needs large caches to reduce bus demand, increasing cost and limitng clock rate
  3.) space and power matters for high end as well as embedded
8.12 Putting It All Together: The Google Cluster of PCs 855
- Intro
  Google tackled
    1.) scale
    2.) relevancy
    3.) reliability
    4.) speed
    5.) bandwidth
    6.) index freshness
- Description of Google Infrastructure
  big pitures stats
    December 2000
        6000 processors
       12000 disks
           1 petabyte of storage
             no RAID, GoogleFS provides data replication
           3 data centers (2 norcal, 1 VA (soon 2 VA)
             each with OC48 2488Mbit/sec
             redundant OC12 between sister sites (CA-CA VA-VA)
  data center network stats
    1 monster cisco 12000 switch to OC48
    2 big foundary 8000 switches for redudany access to racks from cisco
     each can handle 128 1Gb/sec lines
     each rack has 4 1Gb/sec lines, 2 to each foundary switch
     64 racks total capacity per data center
  rack stats
    44U rack = 2x (4U switch + 40x1U PCs)
      2x because use both front and back side of rack
    4U HP ethernet switch
      5 blades with 8 100Mb/sec links for 40 PCs
      2 blades with 1 GigE link to 2 foundary switches
    40x1U PCs
      only ethernet and power
      ethernet on left, power to strips on right edge of rack
    cooling via 3" chimney up center to fans on top
  pc stats
    standard issue
      2 ATA drives
      256 MB SDRAM
      modest Intel processor
      PC motherpower
      1 power sully
      few fans
      linux
    upgrades over time
      larger disks
      faster CPUs
      March-November 2000
        533Mhz Celeron to 800Mhz pentium III
        40-80gb disk with 5400-7200rpm speed
        memory bus 100-133mhz
- Performance
  most of bandwidth for indexing, not responding to queries
  latency means picking right data center based on "geographic" location of user
    VA for Europe, Norcal for CA
- Cost
  $4.5 million to $6 million per data center
    PC             $1,300-1,500
    HP rack switch       $1,500
    Foundary switch    $100,000
    Racks           $1000-$2000
  much better price/performance than IBM TPC-C system from 8.11
    TPC-C cost $10.8 million for 1/8 the CPU and 5/8 the DRAM
  power
    4500W in 10 square feet
    60amps/rack
    colocation site comparison
      1000W and 20amps
- Reliability
  software is bigest source of failure
    20 machine reboots a day fix most problems
    no remote reboot
      call colocation site and have operator push reset button
        more errors here from wrong button or wrong machine label :)
  hardware rate 1/10 of software
    2-3% PC/year
      95% disk and DRAM
       5% motherboard, power supply, connectors, etc
      DRAM 1/3 of failures
        bits inside the DRAM
        bits on the bus
        no ECC at this time
        ECC going forward
      Disks
        1/2 of failures are performance only
        28MB/sec disk will only deliver 4MB/sec or even 0.8MB/sec
        sent back to manufacture
    on PC failure:
      dynamically removed from pool
      physical replacement once a week
      repair for later reuse
  switches
    HP
      200 total
      2 or 3 failures
    Foundary
      6 total with 16 blades/switch
      some DOA but none have died completely in production
      2 or 3 blades have failed (out of 96)
  colocation reliability
    power
      once a year power outage even with diesel backup
    network
      once a year outage of several hours
    also "bathtub curve" issues
      new site
        lots of people and new equipment cause problems
      steady state
        no people or equipment changes, working fine
      upgrade time
        people and new equipment return, causing proble,s
        people turnover also an issue
    tolerating colocation reliability
      multiple sites
      different network providers
      backup leased lines between sites
8.13 Another View: Inside a Cell Phone 862
- Background on Wireless Networks
  terms
    wireless networks
    carrier signal
    bit erro rate (BER)
    signal-to-noise ratio (SNR)
  challenges
    path loss         - signal drop over distance
    shadow fading     - physical object inteference
    multipath fading  - self-inteference because of same signal arriving at slightly different times
    interference      - frequency reuse, adjacent channels, etc
  issues
    dynamic network makes routing is an issue
    signal issues from movement
    power issues from batteries
    BER 1,000x-1,000,000x of copper wire
  base station vs peer-to-peer
    celluar telephony uses base stations
    non-overlapping hexagonal cells
    adjacent cells to not reuse frequencies
    base-stations at hexagon intersections
    cells have 2-10 mile radius depending on topography, population, ...
- The Cell Phone
  radio
    antenna -> RF amp -> mixer -> filter -> demodulator -> decoder
  move from analog to digital components
  multiband has led to software programming of DSPs
  frequency sweep to find best signal every 7 seconds (or more if signal drops below threshold)
  channels
    forward path
    reverse path
    4-80 per cell
  9600 bits/s
  variable power to save battery but also frequency inteference between cells
  nokia examples has Z-80!
- Cell Phone Standards and Evolution
  AMPS (advanced mobile phone service)
    mostly analog
    send 9600b/s signal 5x to deal with interference
  CDMA (code division multiple access)
    mostly digital
    wider frequency band
    harder to block
    uses spread spectrum
    caller share channel
    convert 9600b/s signal to 1.25Mb/sec
      stretch bits 11x to deal with interference
      base station converts back to 9600b/s
      SNR gradually degrades, adding background noise
    speech compression and variable data rate to save bandwidth
  TDMA (time division multiple access)
    mostly digital
  GSM (global system for mobile communication)
    mostly digital
8.14 Fallacies and Pitfalls 867
- Pitfall
    Using bandwidth as the only measure of network performance
  latency matters, especially if many small messages
  overhead matters
    10Mb/s ethernet only beats 155Mb/s ATM for NFS traffic by 1.2
- Pitfall
    Ignoring software overhead when determining performance
  CM-5 sending a message
    SW 20us HW .5us
  Intel Paragon 
    SW 250us HW .2us at first
    SW 25us HW .2us later
  CM-5 wins with slower HW, because SW cost is lower
- Pitfall
    Trying to provide features only within the network versus end-to-end
  beware: errors can happen in node as well as in transmission
  however: performance improvement, especially latency, of lowerlevel correction
- Pitfall
    Relying on TCP/IP for all networks, regarless of latency, bandwidth, or software requirements.
  significant processor resources to process TCP/IP that grow proportional to network bandwidth
    1MHz per 1Mb/sec
8.15 Concluding Remarks 870
  hot field
  dealing with failure is expected
  switches (vs buses) is area of growth and change
    SANs, clusters
  continue to technology catchup from SANs, LANs, WANs
8.16 Historical Perspective and References 871
- Wide Area Networks
  1970s TCP/IP Vince Cerf and Robert Kahn
    1975 100
    1983 200
    1994 50,000
  the web was the killer app
    1989 Tim Berners-Lee
    1992 UIUC Marc Andreessen Mosaic (later Netscape)
    1995 30,000 pages
    2001 1.3 billion pages
  ATM
   48byte packet compromise
     North America wanted 64-byte packet to match existing equipment
            Europe wanted 32-byte packet to match existing equipment
     compromise so that no one wins! (has an advantage over the other)
  WANs now all fiber 
    bandwidth is not the precious commodiy that packet switching assumed
- Local Area Networks
  Xerox PARC Alto [Thacker] and Ethernet [Metcalfe and Boggs]
    desktop and LAN
    David Boggs, Butler Lampson, Ed McCreight, Bob Sprowl, Chuck Thacker
  1974 3Mb/sec wit 50us latecy
    Boggs was ham radio operator, leading to realization that central arbiter was unneeded
  DEC, Intel, Xerox came up with 10Mb/sec standard, avoiding slow IEEE effort
    FDDI IEEE effort led to expensive system that couldn't compete with Ethernet
- Massively Parallel Processors
  Cosmic Cube
    ethernet to connect 8086 computers in hypercube
  SAN for high bandwidth and low latency
  topology 
    fadish, hypercubes not common today
  cut-through routing has persisted Dally*
  ASCI MPPs are basically clusters of SMPs
- Clusters
  1960s
    existed without name
  1975 Tandem 16-node cluster
  1984 VAX cluster
    shared I/O devices
    distributed OS
    geographic distribution for availability
    25,000 clusters sold by 1993
  (both Tandem Non-Stop and DEC VAX clusters ended up at Compaq/HP)
  1993 NASA Beowulf project
    move scientific computing from MPP to clusters
    1994 16-node cluster from 486's achieved NASA goal of 1GFLOPS for <$50,000
  networking to improve latency (and bandwidth)
   VI interface standard => Infiniband
  Network of Workstations (Patterson*)
    Berkeley
    Myrinet switches (160MB/sec)
    100 Ultra-SPARC 
    record breaking
      database sort (8.6GB in 1 minute)
      DES key cracking (3.5hrs for 40-bit key)
   Intktomi web search engine
- System or Storage Area Networks (SANs)
  features (original focused on clusters, now storage connection)
    single room
    wider bandwidth and faster for less cost then LAN/WAN
    hardware in order delivery
    less switch handshaking
    no need for fiber
    source based routing
  2001 FC-AL (fiber channel arbitrated loop)
  October 2000 Infiniband (Intel, HP, IBM, Sun)
    lauched to create point-to-point PCI replacement
  iSCSI (SCSI over TCP/IP on 1GigE ethernet)
  TCP/IP offloading
  
Exercises 877
Appendix A Pipelining: Basic and Intermediate Concepts
A.1 Introduction A-2
A.2 The Major Hurdle of Pipelining - Pipeline Hazards A-11
A.3 How Is Pipelining Implemented? A-26
A.4 What Makes Pipelining Hard to Implement? A-37
A.5 Extending the MIPS Pipeline to Handle Multicycle Operations A-47
A.6 Putting It All Together: The MIPS R4000 Pipeline A-57
A.7 Another View: The MIPS R4300 Pipeline A-66
A.8 Crosscutting Issues A-67
A.9 Fallacies and Pitfalls A-77
A.10 Concluding Remarks A-78
A.11 Historical Perspective and References A-78
Exercises A-81
Appendix B Solutions to Selected Exercises
Introduction B-2
B.1 Chapter 1 Solutions B-2
B.2 Chapter 2 Solutions B-7
B.3 Chapter 3 Solutions B-11
B.4 Chapter 4 Solutions B-16
B.5 Chapter 5 Solutions B-21
B.6 Chapter 6 Solutions B-25
B.7 Chapter 7 Solutions B-29
B.8 Chapter 8 Solutions B-30
B.9 Appendix A Solutions B-35
Online Appendices (www.mkp.com/CA3/)
Appendix C A Survey of RISC Architectures for Desktop, Server, and Embedded Computers
Appendix D An Alternative to RISC: The Intel 80x86
Appendix E Another Alternative to RISC: The VAX Architecture
Appendix F The IBM 360/370 Architecture for Mainframe Computers
Appendix G Vector Processors Revised by Krste Asanovic
Appendix H Computer Arithmetic by David Goldberg
Appendix I Implementing Coherence Protocols
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
CS315a
Introduction to Parallel Computing: Second Edition
by Ananth Grama, George Karypis, Vipin Kumar, Anshul Gupta
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Contents
Preface xvii
Acknowledgments xix
CHAPTER 1
Introduction to Parallel Computing 1
1.1 Motivating Parallelism 2
1.1.1 The Computational Power Argument -- From Transistors to FLOPS 2
1.1.2 The Memory/Disk Speed Argument 3
1.1.3 The Data Communication Argument 4
1.2 Scope of Parallel Computing 4
1.2.1 Applications in Engineering and Design 4
1.2.2 Scientific Applications 5
1.2.3 Commercial Applications 5
1.2.4 Applications in Computer Systems 6
1.3 Organization and Contents of the Text 6
1.4 Bibliographic Remarks 8
Problems 9
CHAPTER 2
Parallel Programming Platforms 11
2.1 Implicit Parallelism: Trends in Microprocessor Architectures 12
2.1.1 Pipelining and Superscalar Execution 12
2.1.2 Very Long Instruction Word Processors 15
2.2 Limitations of Memory System Performance* 16
2.2.1 Improving Effective Memory Latency Using Caches 17
2.2.2 Impact of Memory Bandwidth 18
2.2.3 Alternate Approaches for Hiding Memory Latency 21
2.2.4 Tradeoffs of Multithreading and Prefetching 23
2.3 Dichotomy of Parallel Computing Platforms 24
2.3.1 Control Structure of Parallel Platforms 25
2.3.2 Communication Model of Parallel Platforms 27
2.4 Physical Organization of Parallel Platforms 31
2.4.1 Architecture of an Ideal Parallel Computer 31
2.4.2 Interconnection Networks for Parallel Computers 32
2.4.3 Network Topologies 33
2.4.4 Evaluating Static Interconnection Networks 43
2.4.5 Evaluating Dynamic Interconnection Networks 44
2.4.6 Cache Coherence in Multiprocessor Systems 45
2.5 Communication Costs in Parallel Machines 53
2.5.1 Message Passing Costs in Parallel Computers 53
2.5.2 Communication Costs in Shared-Address-Space Machines 61
2.6 Routing Mechanisms for Interconnection Networks 63
2.7 Impact of Process-Processor Mapping and Mapping Techniques 65
2.7.1 Mapping Techniques for Graphs 66
2.7.2 Cost-Performance Tradeoffs 73
2.8 Bibliographic Remarks 74
Problems 76
CHAPTER 3
Principles of Parallel Algorithm Design 85
3.1 Preliminaries 86
3.1.1 Decomposition, Tasks, and Dependency Graphs 86
3.1.2 Granularity, Concurrency, and Task-Interaction 89
3.1.3 Processes and Mapping 93
3.1.4 Processes versus Processors 94
3.2 Decomposition Techniques 95
3.2.1 Recursive Decomposition 95
3.2.2 Data Decomposition 97
3.2.3 Exploratory Decomposition 105
3.2.4 Speculative Decomposition 107
3.2.5 Hybrid Decompositions 109
3.3 Characteristics of Tasks and Interactions 110
3.3.1 Characteristics of Tasks 110
3.3.2 Characteristics of Inter-Task Interactions 112
3.4 Mapping Techniques for Load Balancing 115
3.4.1 Schemes for Static Mapping 117
3.4.2 Schemes for Dynamic Mapping 130
3.5 Methods for Containing Interaction Overheads 132
3.5.1 Maximizing Data Locality 132
3.5.2 Minimizing Contention and Hot Spots 134
3.5.3 Overlapping Computations with Interactions 135
3.5.4 Replicating Data or Computations 136
3.5.5 Using Optimized Collective Interaction Operations 137
3.5.6 Overlapping Interactions with Other Interactions 138
3.6 Parallel Algorithm Models 139
3.6.1 The Data-Parallel Model 139
3.6.2 The Task Graph Model 140
3.6.3 The Work Pool Model 140
3.6.4 The Master-Slave Model 141
3.6.5 The Pipeline or Producer-Consumer Model 141
3.6.6 Hybrid Models 142
3.7 Bibliographic Remarks 142
Problems 143
CHAPTER 4
Basic Communication Operations 147
4.1 One-to-All Broadcast and All-to-One Reduction 149
4.1.1 Ring or Linear Array 149
4.1.2 Mesh 152
4.1.3 Hypercube 153
4.1.4 Balanced Binary Tree 153
4.1.5 Detailed Algorithms 154
4.1.6 Cost Analysis 156
4.2 All-to-All Broadcast and Reduction 157
4.2.1 Linear Array and Ring 158
4.2.2 Mesh 160
4.2.3 Hypercube 161
4.2.4 Cost Analysis 164
4.3 All-Reduce and Prefix-Sum Operations 166
4.4 Scatter and Gather 167
4.5 All-to-All Personalized Communication 170
4.5.1 Ring 173
4.5.2 Mesh 174
4.5.3 Hypercube 175
4.6 Circular Shift 179
4.6.1 Mesh 179
4.6.2 Hypercube 181
4.7 Improving the Speed of Some Communication Operations 184
4.7.1 Splitting and Routing Messages in Parts 184
4.7.2 All-Port Communication 186
4.8 Summary 187
4.9 Bibliographic Remarks 188
Problems 190
CHAPTER 5
Analytical Modeling of Parallel Programs 195
5.1 Sources of Overhead in Parallel Programs 195
5.2 Performance Metrics for Parallel Systems 197
5.2.1 Execution Time 197
5.2.2 Total Parallel Overhead 197
5.2.3 Speedup 198
5.2.4 Efficiency 202
5.2.5 Cost 203
5.3 The Effect of Granularity on Performance 205
5.4 Scalability of Parallel Systems 208
5.4.1 Scaling Characteristics of Parallel Programs 209
5.4.2 The Isoefficiency Metric of Scalability 212
5.4.3 Cost-Optimality and the Isoefficiency Function 217
5.4.4 A Lower Bound on the Isoefficiency Function 217
5.4.5 The Degree of Concurrency and the Isoefficiency Function 218
5.5 Minimum Execution Time and Minimum Cost-Optimal Execution Time 218
5.6 Asymptotic Analysis of Parallel Programs 221
5.7 Other Scalability Metrics 222
5.8 Bibliographic Remarks 226
Problems 228
CHAPTER 6
Programming Using the Message-Passing
Paradigm 233
6.1 Principles of Message-Passing Programming 233
6.2 The Building Blocks: Send and Receive Operations 235
6.2.1 Blocking Message Passing Operations 236
6.2.2 Non-Blocking Message Passing Operations 239
6.3 MPI: the Message Passing Interface 240
6.3.1 Starting and Terminating the MPI Library 242
6.3.2 Communicators 242
6.3.3 Getting Information 243
6.3.4 Sending and Receiving Messages 244
6.3.5 Example: Odd-Even Sort 248
6.4 Topologies and Embedding 250
6.4.1 Creating and Using Cartesian Topologies 251
6.4.2 Example: Cannon's Matrix-Matrix Multiplication 253
6.5 Overlapping Communication with Computation 255
6.5.1 Non-Blocking Communication Operations 255
6.6 Collective Communication and Computation Operations 260
6.6.1 Barrier 260
6.6.2 Broadcast 260
6.6.3 Reduction 261
6.6.4 Prefix 263
6.6.5 Gather 263
6.6.6 Scatter 264
6.6.7 All-to-All 265
6.6.8 Example: One-Dimensional Matrix-Vector Multiplication 266
6.6.9 Example: Single-Source Shortest-Path 268
6.6.10 Example: Sample Sort 270
6.7 Groups and Communicators 272
6.7.1 Example: Two-Dimensional Matrix-Vector Multiplication 274
6.8 Bibliographic Remarks 276
Problems 277
CHAPTER 7
Programming Shared Address Space
Platforms 279
7.1 Thread Basics 280
7.2 Why Threads? 281
7.3 The POSIX Thread API 282
7.4 Thread Basics: Creation and Termination 282
7.5 Synchronization Primitives in Pthreads 287
7.5.1 Mutual Exclusion for Shared Variables 287
7.5.2 Condition Variables for Synchronization 294
7.6 Controlling Thread and Synchronization Attributes 298
7.6.1 Attributes Objects for Threads 299
7.6.2 Attributes Objects for Mutexes 300
7.7 Thread Cancellation 301
7.8 Composite Synchronization Constructs 302
7.8.1 Read-Write Locks 302
7.8.2 Barriers 307
7.9 Tips for Designing Asynchronous Programs 310
7.10 OpenMP: a Standard for Directive Based Parallel Programming 311
7.10.1 The OpenMP Programming Model 312
7.10.2 Specifying Concurrent Tasks in OpenMP 315
7.10.3 Synchronization Constructs in OpenMP 322
7.10.4 Data Handling in OpenMP 327
7.10.5 OpenMP Library Functions 328
7.10.6 Environment Variables in OpenMP 330
7.10.7 Explicit Threads versus OpenMP Based Programming 331
7.11 Bibliographic Remarks 332
Problems 332
CHAPTER 8
Dense Matrix Algorithms 337
8.1 Matrix-Vector Multiplication 337
8.1.1 Rowwise 1-D Partitioning 338
8.1.2 2-D Partitioning 341
8.2 Matrix-Matrix Multiplication 345
8.2.1 A Simple Parallel Algorithm 346
8.2.2 Cannon's Algorithm 347
8.2.3 The DNS Algorithm 349
8.3 Solving a System of Linear Equations 352
8.3.1 A Simple Gaussian Elimination Algorithm 353
8.3.2 Gaussian Elimination with Partial Pivoting 366
8.3.3 Solving a Triangular System: Back-Substitution 369
8.3.4 Numerical Considerations in Solving Systems of Linear Equations 370
8.4 Bibliographic Remarks 371
Problems 372
CHAPTER 9
Sorting 379
9.1 Issues in Sorting on Parallel Computers 380
9.1.1 Where the Input and Output Sequences are Stored 380
9.1.2 How Comparisons are Performed 380
9.2 Sorting Networks 382
9.2.1 Bitonic Sort 384
9.2.2 Mapping Bitonic Sort to a Hypercube and a Mesh 387
9.3 Bubble Sort and its Variants 394
9.3.1 Odd-Even Transposition 395
9.3.2 Shellsort 398
9.4 Quicksort 399
9.4.1 Parallelizing Quicksort 401
9.4.2 Parallel Formulation for a CRCW PRAM 402
9.4.3 Parallel Formulation for Practical Architectures 404
9.4.4 Pivot Selection 411
9.5 Bucket and Sample Sort 412
9.6 Other Sorting Algorithms 414
9.6.1 Enumeration Sort 414
9.6.2 Radix Sort 415
9.7 Bibliographic Remarks 416
Problems 419
CHAPTER 10
Graph Algorithms 429
10.1 Definitions and Representation 429
10.2 Minimum Spanning Tree: Prim's Algorithm 432
10.3 Single-Source Shortest Paths: Dijkstra's Algorithm 436
10.4 All-Pairs Shortest Paths 437
10.4.1 Dijkstra's Algorithm 438
10.4.2 Floyd's Algorithm 440
10.4.3 Performance Comparisons 445
10.5 Transitive Closure 445
10.6 Connected Components 446
10.6.1 A Depth-First Search Based Algorithm 446
10.7 Algorithms for Sparse Graphs 450
10.7.1 Finding a Maximal Independent Set 451
10.7.2 Single-Source Shortest Paths 455
10.8 Bibliographic Remarks 462
Problems 465
CHAPTER 11
Search Algorithms for Discrete Optimization
Problems 469
11.1 Definitions and Examples 469
11.2 Sequential Search Algorithms 474
11.2.1 Depth-First Search Algorithms 474
11.2.2 Best-First Search Algorithms 478
11.3 Search Overhead Factor 478
11.4 Parallel Depth-First Search 480
11.4.1 Important Parameters of Parallel DFS 482
11.4.2 A General Framework for Analysis of Parallel DFS 485
11.4.3 Analysis of Load-Balancing Schemes 488
11.4.4 Termination Detection 490
11.4.5 Experimental Results 492
11.4.6 Parallel Formulations of Depth-First Branch-and-Bound Search 495
11.4.7 Parallel Formulations of IDA* 496
11.5 Parallel Best-First Search 496
11.6 Speedup Anomalies in Parallel Search Algorithms 501
11.6.1 Analysis of Average Speedup in Parallel DFS 502
11.7 Bibliographic Remarks 505
Problems 510
CHAPTER 12
Dynamic Programming 515
12.1 Overview of Dynamic Programming 515
12.2 Serial Monadic DP Formulations 518
12.2.1 The Shortest-Path Problem 518
12.2.2 The 0/1 Knapsack Problem 520
12.3 Nonserial Monadic DP Formulations 523
12.3.1 The Longest-Common-Subsequence Problem 523
12.4 Serial Polyadic DP Formulations 526
12.4.1 Floyd's All-Pairs Shortest-Paths Algorithm 526
12.5 Nonserial Polyadic DP Formulations 527
12.5.1 The Optimal Matrix-Parenthesization Problem 527
12.6 Summary and Discussion 530
12.7 Bibliographic Remarks 531
Problems 532
CHAPTER 13
Fast Fourier Transform 537
13.1 The Serial Algorithm 538
13.2 The Binary-Exchange Algorithm 541
13.2.1 A Full Bandwidth Network 541
13.2.2 Limited Bandwidth Network 548
13.2.3 Extra Computations in Parallel FFT 551
13.3 The Transpose Algorithm 553
13.3.1 Two-Dimensional Transpose Algorithm 553
13.3.2 The Generalized Transpose Algorithm 556
13.4 Bibliographic Remarks 560
Problems 562
APPENDIX A
Complexity of Functions and Order Analysis 565
A.1 Complexity of Functions 565
A.2 Order Analysis of Functions 566
Bibliography 569
Author Index 611
Subject Index 621

# turn off auto-fill in this file with wide lines.
# only works because I have auto-fill on by default,
# so specifying it toggles it off...duh.
;;; Local Variables: ***
;;; mode:text ***
;;; mode:auto-fill ***
;;; End: ***