See also: My Computer Architecture Notes from Comps 2004 http://carlstrom.com/stanford/comps/Computer-Architecture.txt -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- CS282 Computer Architecture: A Quantitative Approach, Third Edition by John L. Hennessy and David A. Patterson Foreword vii by Bill Joy - RISC pioneered by John Cocke IBM promoted and named by P&H commericalized in PowerPC, MIPS, and SPARC Preface xvii - embedded vs desktop vs server markets emeddded low power no active cooling game consoles, cameras, cell phones EEMBC benchmarks desktop latency SPEC2000 benchmarks server reliability, scalability, throughput TPC-C benchmarks SPEC FS (File System) Acknowledgments xxv Chapter 1 Fundamentals of Computer Design 1.1 Introduction 2 - moore's law improvements steady architecture improvements more bursty however, figure 1.1 shows that architecture improvements have major impact on relatively performance curve (15x in 2001 starting from 1984) - the move to microprocessors 35% baseline performance improvement per year without architecture improvements new microprocessor architectures easier to adopt because: 1.) easier to switch with mostly non-assembly code 2.) easier to switch with unix/linux to port - microprocessor have meant 1.) supercomputer power for the desktop user 2.) even supercomputers are built with microprocessors - huge transistor budgets have allowed CISC to adopt RISC internally 1.2 The Changing Face of Computing and the Task of the Computer Designer 4 - timeline 60s mainframe 70s mini-computer 80s microprocessors 90s handheld devices - computing markets Desktop - price performance Servers 1.) availability (not reliability) 2.) scalability 3.) throughput Embedded - widest power and cost ranges 8-bit to 32-bit low-end 8-bit 16-bit $00.50 middle 32-bit $10.00 high 64-bit $$$$$$ console, high end network switch , $$$ - real-time performance requirement soft real-time - minimizing 1.) memory can be significant portion of system cost sometimes want to stick to on processor memory code density can be an issue 2.) power battery live lower power can use plastic instead of more expensive ceramic packaging - custom logic 1.) custom logic circuts on chip 2.) custom software 3.) DSP processor with custom hw/sw - The Task of the Computer Designer computer architecture - not just instruction set architecture (ISA) design - also architecture implementation instruction set architecture - programmer visible instruction set - hw/sw interface architecture implementation - organization - hardware organization - memory system - bus structure - CPU design hardware - detailed logic design - packaging technology computer architecture 1.) ISA design 2.) organization 3.) hardware computer architect makes tradeoffs between functional requirements - price - power - performance today, different designs for different markets! 1.3 Technology Trends 11 - Growth Rates - Integrated circuit logic tecehnology 1.) transistor density 35%/yr 2.) die size 10-20%/yr 1/2 => transistor count 35%/yr - Transistor DRAM density 40-60%/yr cycle time groth slow 33% in 10 years bandwidth increases 2x as fast as latency decreases (future is high bandwidth, but relatively high latency) - Magnetic Disk Technology disk density 100%/yr access time growth slow also 33% in 10 years - Network Technology bandwith focus of growth latency room for improvement - Designer must plan ahead, anticipating these growth rates - Discrete leaps, not just continuous changes 32-bit processor on one chip... When cache could move on chip... When multiple-cores can move on chip... - Scaling of Transister Performance, Wires, and Power in ICs feature size - three dimensional devices shrink quadratically in horizontal devices also shrink in vertical affects operating volatage - growth transistor performance improves linearly with decreasing feature size transistor count improves quadratically with decreasing feature size big budget - move from 4,7,16,32,64 bit processors - pipelining - caches wires do not improve - signal delay proportional to R*C (resistance*capacitance) - wires get shorter but R*C per unit length increase - move to copper provided one-time improvement in wire-delay - wire delay is now a major issue, rivaling transistor switching delay - large ammount of cycle time for signal propagation Pentium 4 has 2 stages of 20+ stage pipleline just for signal propagation power scaling - CMOS processor power dominated by switching transistors - energy/transistor proportional to load capacitance * switching frequency * square of voltage - parameter trends don't favor power decreasing :) transistor count ++ switching frequency ++ load capacitance - voltage - - power likey to be limiting factor for chips, not raw transistor count 1.4 Cost, Price, and Their Trends 14 - Impact of Time, Volume, and Commodification key factors in cost 1.) learning curve - manufactuing costs decrease over time yield (measure of learning curve) - percentage of manufactured devices that pass testing - 2x the yield = 1/2 the cost DRAM price=~cost, as yield improves, price comes down directly CPU price~!cost, yield improvements, ..., profit!!! :) (to be fair, the book says that price decreases, just not as fast as for commodity DRAMS) 2.) volume - rule of thumb: cost decreases 10% each time volume doubles - ammortize development cost commodities - clear product definition - drives volume - pressure to lower price down to (or below) cost - drives low end growth, while sacrificing profits - Cost of an Integrated Circuit rest of system is so commodity oriented, IC cost can be differentiating factor wafers, chopped into dies simple math... cost of ic = (cost of die + cost of testing die + cost of packagine + cost of final test) / final test yield cost of die = cost of wafer / (dies per wafer * die yield) diers per wafer = area of wafer / area per die (approximately...) can use redundancy to raise yield example, DRAM includes redundant memory cells to accomidate flaws also in stand alone SRAMs also in processing embedded cache SRAM architect can control die size I/O pins redundancy and binning hacks mask set cost significant for low production runs $1 million for 4-6 high density metal masks cost going up use reconfigutable logic for flexibility use gate arrays which have fewer custom mask levels - Distribution of Costs in a system: An example $1000 PC overall costs 6% cabinet 37% processor board 37% I/O devices 20% software (OS+office) 22% for processor and 19% for monitor - Cost versus Price - Why they differ and by how much 1.) component costs 2.) direct costs = labor, purchasing components, scrap, warranty = 10-30% 3.) gross margin = indirect costs R&D, marketing, sales, manufacturing equipment maintenance, real estate, cost of financing, pretax profits, taxes 4.) average discount 1-3 = average selling priace 1-4 = list price R&D percentage 4% commodity PC business 12% high end server business 15%-20% is unsustainable expensive, large machine problem... sell few AND higher R&D costs need to have higher gross margin options high-performance design (supercomputer, high end servers) low-cost design (embedded) cost-performance (desktop, workstation, low end server) 1.5 Measuring and Reporting Performance 24 - Metrics response time (aka execution time) - how long does it take to get my job done - one task vs many tasks => latency throughput (vs response time) - how much work can get done in a certain time - many tasks vs one task - important measure for batch, server, supercomputer operations execution time is only completely reliable metric (see section 1.9) - Measuring Performance wall-clock time, response time, elapsed time latency to complete a task including disk access, memory access, I/O, OS overhead, everything CPU time does not include time waiting for I/O or other running programs divided into user CPU time (program) and sytem CPU time (OS) system performance vs CPU performance system performance: wall-clock time on unloaded system CPU performance: user CPU time - Choosing Program to Evalutate Performance workload typical mixture of programs and OS commands that users run on a machine bests to work approximations of workload 1.) real applications compilers, word processors, photoshop portability problems for OS dependant apps 2.) modified (or scripted) applications modified for portability modified to focus on one part of execution scripts to similate user response, esp in multi-user case 3.) kernels isolate performance of individual features of a machine Linpack and "Livermore Loops" 4.) toy benchmarks 10-100 line programs Sieve of Eratosthenes, Puzzle, Quicksort 5.) synthetic benchmarks kernels are at least from real programs synthetic are artificial either compiler optimizer can't make sense of synthetic program OR compiler can over optimize over simplified synthetic program Whetstone and Dhrystone - Benchmark suites SPEC - 1980's effort for better workstation benchmarks Ziff Davis does a lot of Windows machine benchmarks Business Winstone: browser and office product multitasking CC Winstone: content creation (browser, photoshop, premier, audio editing) Winbench: kernels for CPU, video, disk performance - Desktop Benchmarks for CPU and Graphics SPEC CPU benchmarks: SPEC89, SPEC92, SPEC95, SPEC CPU2000 CPU2000 CINT2000 11 integer benchmarks c-compiler, VLSI place-and-route, graphics app CFP2000 14 floating point benchmarks quantum chromodynamics, finent element modeling, fluid dynamics focus on portability and minimizing role of I/O C, C++, F77, F90 for desktops and single-CPU servers results based on machine, compiler, and OS configuration SPECviewperf for OpenGL 3D rendering SPECapc uses graphics applications Pro/Enginner - solid modeling of photocopier SolidWorks 2001 - 3D CAD/CAM of assembly line Unigraphics V15 - aircraft model assembly, drafting, numeric control machining, solid modeling, and optimization - Server Benchmarks SPECrate - run multiple copies of SPEC CPU2000, usually one copy per CPU SPECSFS - NFS file server benchmark - throughput measurement but with requirements on response time SPECWeb - web server benchmark - static and dynamic pages generatio - server responses to client posts TP (transaction processing) benchmarks measurement is transactions/second however, response time requirement slow reponding transactions don't count in total higher transactions rates means large systems that is more concurrent users and large data set system cost must be reported to allow cost-performance analysis TPC-A 1985 TPC-C 1992 complex query environment TPC-H ad hoc decision support queries unrelated, past queries can't help optimize future TPC-R business decision support standard set of queries allows pre-optimization TPC-W web-based transaction benchmark covers database and web sever - Embedded Benchmarks EEMBC (pronounced "embassy") 5 classes of kernels 1.) 16 automotive/industrial - 6 microbenchmarks, 5 automotive, 5 filter/FFT 2.) 5 consumer - multimedia 3.) 3 networking - shortest path, routing, packet flow 4.) 4 office automation - graphics and text 5.) 6 telecommunication - filtering and DSP - Reporting performance results reproducibility list everything another experimenter would need to duplicate the results SPEC requires fairly complete description of machine compiler flags baseline and optimized results (see software configuration below) actual performance times, in tabular form and graph TPC requires even more benchmarking audit cost information software configuration often run in single user mode to reduce overhead OS enhancements help TPC compiler enhancements help CPU especially when source modification allowed benchmark specific optimization flags!!! which is why SPEC requires both baseline and optimzied reporting baseline optimizations have to work for ALL apps in same language modifications allowed? 1.) no modification SPEC 2.) modifications alloed but are difficult or impossible TPC-C typically uses closed source databases from Oracle and Microsoft 3.) modifications are allowed NAS Supercomputing benchmark as long as you produce same output on same input... EEMBC have to report results with original code as well as modified soure 4.) hand coding (assembly) EEMBC allows hand coding of all benchmarks - Comparing and Summarizing Performance faster: is it clear what this means? lying with statistics - Total Execution Time: A Consistent Summary Measure arithmetic mean tracks total execution time - Weighted Execution Time weighted arithmetic mean for when some apps are more important that others... for example, if one does 4x of the work... but isn't that dependant on the apps scaling the same on new configurations? - Normalized Execution Time and the Pros and Cons of Geometric Means normalize execution time to reference machine take the average of the normalized execution times used by SPEC, where SPARCstation is used for reference geometric mean is consistent no matter which machine is the reference also makes it independant of biasing by using larger inputs on your better apps of course, most benchmarks specify inputs HOWEVER violates fundamental rule of measure performace by execution time encourages vendors to focus on what is easy, not what is hard especially on small, short running programs, not large programs with multiple phases 1.6 Quantitative Principles of Computer Design 39 - Make the Common Case Fast - Amdahl's Law the law of diminishing returns - CPU Performance Equation Seconds Instructions Clock Cycles Seconds CPU time = ------- = ------------ * ------------ * ------------ Program Program Instruction Clock Cycle Several Parameters are interdendent: Clock cycle time - hardware technology and organization CPI - Organization and instruction set architecture Instruction count - instruction set architecture and compiler technology specifically organization affects clock cycle time and CPI ISA affects CPI and instruction count hardware technology largely affects cycle time, but also CPI compiler technology affects instruction count - Measuring and Modeling the Components of the CPU Performance Equation existing CPU measure execution time clock speed is known instruction count from hardware performance counters statistical CPI from hardware performance counters simulation can give more detail simulation profile based, static modeling get a program profile via 1.) hardware sampling 2.) instrumented execution 3.) ISA simulator trace based particularly useful for memory system performance can be combined with static analysis of simple pipeline performance or more detailed dynamic analysis for complicated pipelines isolates model of memory from pipeline execution driven simulations detailed simulation of both memory and pipeline - Principle of Locality temporal locality spatial locality - Take Advantage of Parallelism system level SPECWeb or TPC multiple processors multiple disks processor level ILP pipelining digital design set associative caches ALU with carry look ahead adders speculation evalutate multiple possible outcomes late selection of result cary select addres set-associate caches handling branches in piplelines 1.7 Putting It All Together: Performance and Price-Performance 48 - Performance and Price-Performance for Desktop Systems Figure 1.19, integer performance roughly parallel to price-performance notably, there was not a significant premium for extra performance on high end systems, in other words you get what you pay for Figure 1.20, FP performance vs price-performance more varied Penitum 4 system price-performance was notably higher in other words, P4 a good deal if you need FP - Performance and Price-Performance for Transaction-Processing Serverrs Measuring OLTP with TPC-C 1.) resaonably approximation of real OLTP 2.) total system performance 3.) rules are very complete 4.) vendors spend effort to make it run well 5.) requirement to report performance and price-performance TPM (transactions/minute) and TPM/$ Figure 1.22 shows 6 highest absolute performance OLTP systems, clusters of 48-280 cpus Figure 1.23 shows 6 highest price-performance OLTP systems, SMPs with 3-8 cpus 4x better in terms of price-performance than their big brothers - Performance and Price-Performance for Embedded Processors cost can be affected by integrated memory control and I/O controller not reflected in benchmarks x86, PowerPC, MIPS64 systems wide range in price makes slower processors more cost-effective (not true in Desktop systems, similar to high performance premium in servers) 1.8 Another View: Power Consumption and Efficiency as the Metric 56 - Performance/Watt Power consumption varies in embedded processors by a factor of 10x desktop processors pushing against air cooling limits server operating costs affected by power consumption 1.9 Fallacies and Pitfalls 57 - Fallacy The relative performance of two processors with the same instruction set architecture (ISA) can be judge by clock rate or by the performance of a single benchmark suite - Fallacy Benchmarks remain valid indefinitely maxtrix300 in SPEC89 99% of time in one line blocking optimization owned this benchmark adding compiler optimizations for specific benchmarks SPEC92 increased GCC input sizes to be more interesting half of benchmarks added in 1992 and 1995 were dropped in next release - Pitfall Comparing hand-coded assembly and compiler-generated, high-level language performance - Fallacy Peak performance tracks observed performance difference is usually 10x in supercomputer applications - Fallacy The best design for a computer is the one that optimizes the primary object without considering implementation - Pitfall Neglecting the cost of software in either evaluating a system or examining cost-performance - Pitfall Falling prey to Amdahl's Law - Fallacy Synthetic benchmarks predict performance for real programs - Fallacy MIPS is an accurate measure for comparing performance among computers 1.10 Concluding Remarks 65 1.11 Historical Perspective and References 67 - The First General-Purpose Electronic Computers ENIAC J. Presper Eckert, John Mauchly University of Pennsylvania EDVAC stored program computer proposal "von Neumann machine" John von Neumann, J. Presper Eckert, John Mauchly EDSAC Maurice Wilkes Cambridge University IAS machine Institute for Advanced Study, Princeton Burks, Holdstine, von Neumann basis for IBM 701 Mark-I, Mark-II, Mark-III, Mark-IB Howard Aiken Harvard "Harvard architecture" had seperate memories for instructions and data term sometimes used today for split L1 caches for instructions and data Whirlwind MIT Magnetic Core Memory - Important Special-Purposes Machines BOMB UK Alan Turing Electro-mechanical COLOSSUS UK Newman and Flowers Electronic ERA machines US acquired by Sperry-Rand (which also owned Eckert Mauchly Computer Corporation) Konrad Zuse Germany first floating point, deemed unnecessary by von Neumann John Atanasoff Iowa State University ABC computer, never completely operations binary representation however, Mauchly visited, probably was influenced by ABC - Commerical Developments BINAC Eckert-Mauchly Computer Corporation for Northrup UNIVAC I Eckert-Mauchly -> Remington-Rand -> Sperry-Rand -> UNIVAC division IBM 701, IBM 702 modest successes IBM 650, 704, 705 significant successes - Development of Quantitative Performance Measures: Successes and Failures Gibson mix average instruction execution time MIPS Whetstone Algol 60 -> FORTRAN Livermore FORTRAN Kernels Relative MIPS VAX-11/780 same speed relative to IBM 370/158 which was 1 MIPS MFLOPS usually only peak reported... SPEC see above Exercises 74 Chapter 2 Instruction Set Principles and Examples 2.1 Introduction 90 - MIPS architecture spans embedded, desktop, server - x86 shows more than purely aesthetic design is important - DSP instructions largely used by special purpose libraries never generated by compiler... - Vector instructions again, largely used by platform specific libraries including memcpy... note quite as special purpose as DSP instructions GCC 4 does some auto-vectorization... 2.2 Classifying Instruction Set Architectures 92 - categories Stack implicit arguments from stack example: C=A+B PUSH A PUSH B ADD POP C Symbolics, Java Virtual Machine Accumulator implicit argument of accumulator example: C=A+B LOAD A ADD B STORE C General-purpose register architectures no implicit arguments Register-memory (CISC) operands from register and memory example: C=A+B LOAD R1, A ADD R3, R1,B STORE R3, C IBM 360/370, Intel 80x86, Morotola 68k, TI TMS320C45x pros data can be accessed w/o seperate load instruction format easy to encode good code density cons operands not equivalent since a source operatnd in binary operation destroyed encoding register and memory in each instruction can restrict number of registers clocks per instruction vary by operand location Register-register/load-store (RISC) example: C=A+B LOAD R1, A LOAD R2, B ADD R3, R1, R2 STORE R3, C operands from registers only, except load/store from/to memory Alpha, ARM, MIPS, PowerPC, SPARC, SuperH, Trimedia TM5200 pros simple fixed-length instruction coding simple code generation model instructions take similar number of clocks to execute cons higher instruction count because of extra memory reference instructions more instructions and lower density leads to larger programs Memory-Memory operands from memory only (not in use today) example: C=A+B ADD C, A, B VAX pros most compact doesn't waste registers on temporaries cons large variation in instruction size especially for 3 operand instructios large variation in work per instruction memory access create memory bottlenck extended accumulator or special purpose register computer x86 certain registers are implicit in some instructions - advantage of general-purpose register architectures 1.) registers are faster than memory 2.) registers are more efficient for a compiler to use allows easy reordering without maintaining stack discipline registers can hold variables reduces memory traffic register are faster as well (stated in 1.) code density improves (fewer bits to name register than memory) general-purpose registers best for compiler special purpose users constrain compiler DSP crowd still has a lot of special purpose registers ...and hand coding as well so that isn't as much of an issue calling conventions dictate some register usage register for argument passing volatile vs non-volatile registers across function calls SUMMARY NEW architectures are register-register/load-store 2.3 Memory Addressing 95 - Interpretting Memory Addresses endianness little endian 76543210 strings appear backwards (DRAWKCAB) in registers big endian 01234567 alignment byte, half word, word, double word (ie, 8, 16, 32, 64 bit) - Addressing Modes Effective Address the computed address actually used to access memory, regardless of mode PC-relative addressing primarily for code addresses in control transfer Figure 2.6 Register Add R4, R3 When a value is in a register Immediate Add R4, #3 For constants, embedded in instruction Displacement Add R4, 100(R1) constant offset from address in register Register indirect Add R4, (R1) special case of displacement with no offset Indexed Add R4, (R1+R2) useful for array indexing, base in R1, offset in R2 Direct or absolute Add R1, (1001) useful for static data, need large address Memory indirect Add R1, @(R3) double indirect Autoincrement Add R1, (R2)+ Useful for stepping through an array Autodecrement Add R1, -(R2) useful for push/pop with autoincrement Scaled Add R1, 100(R2)[R3] similar to indexed, but r3 is scaled by constant such as word size Figure 2.7 Frequently used addressing moves for 3 SPEC89 apps on VAX (which had the richest modes) Displacement, Immediate, Register Indirect, Scaled, Memory indirect (excluding PC-relative) - Displacement Addressing Mode affects code density Figure 2.8 On Alpha with 16-bit displacement field, running SPEC CPU2000 integer programs use smaller displacements on average (1-bit) floating point programs use large displacements on average (13 bits) - Immediate or Literal Addressing Mode Figure 2.9 1/4 of data transfers and ALU operations use immediate operand Figure 2.10 On Alpha with 16-bit immediate field, running SPEC CPU2000 integer programs use LARGE displacements on average (14-bit) floating point programs use SMALLER displacements on average (4 bits) On VAX with 32-bit immediates, 20%-25% of immediates were bigger than 16-bits 2.4 Addressing Modes for Signal Processing 101 - Circuluar buffers are common in DSP applications Special "modulo" or "circular" addressing mode - FFT needs a particular permution Special "bit reverse" addressing mode (just for FFT!) SUMMARY: Memory Addressing Expectations for new architecture: 1.) addressing modes: displacement, immediate, register indirect Figure 2.7 shows these cover 75-99% of SPEC CPU2000 apps 2.) displacement 12-16 bits Figure 2.8 shows these cover 75-99% of displacements 3.) immediate 8-16 bits Figure 2.10 shows these cover 50-80% of displacements Modes that cannot be targeted by compiler of questionable value even not frequently used ones are not usually worth implementation cost 2.5 Type and Size of Operands 104 - type implied by operation tagged in hardware only in museums - examples integer byte, half-word, word floating point single-precision, double-precision before 1980, vendor specific implementation IEEE standard 754 character (8-bit byte or 16-bit unicode) now rare "string" operations binary coded decimal (BCD) 4-bits per digit pack and unpack into string form exact arithmetic - registers contents typically store full word or double precision values e.g., byte becomes word, single becomes double VAX had operations for shorter values but only used 6% of operand accesses 2.6 Operands for Media and Signal Processing 105 - vertex x,y,z,w a 3d point with additional "w" information (such as color) 3 vertexes is triangle... - pixels render that triable to get r,g,b,a (color + alpha transparency) - fixed point fractions between -1 and +1 in binary SUMMARY: Type and Size Operands - 32-bit architecture 8,16,32 integers 32,64 doubles - 64-bit adds 64-bit integers - probably not decimal (BCD) support - DSP's have wider accumulator registers than memory for more accuracy often more that 2x more (x86 floating point similar, 80-bit precision or something like that -bdc) 2.7 Operations in the Instruction Set 108 - Categories of instructions: Figure 2.15 Arithmetic and logical Data transfer Control System Floating point Decimal String Graphics - Most frequently used are most simple Top 10 for x86 covering 96% (Figure 2.16) 1 22% load 2 20% conditional branch 3 16% compre 4 12% store 5 8% add 6 6% and 7 5% sub 8 4% move register-register 9 1% call 10 1% return 2.8 Operations for Media and Signal Processing 109 - Media processor output judged by human perception which can't perceieve 64-bit values :) therefore use less precision example graphics using single precision - partitioned add since using 4 16-bit values, let 64-bit ALU do 4 adds in parallel just need to prevent carries between partitions this is just special case of ... - vector instructions single-instruction multiple-data (SIMD) - commerical examples Alpha MAX HP PA-RISC MAX2 Intel Pentium MMX PowerPC AltiVec SPARC VIS - categories add/subtract saturating add/sub multiply compare shift right/left shift right arithmetic multiple and add shift and add (satuarying) absolute difference max/min pack (2n bits->n bits) unkac/merge permute/shuffle - AltiVec seems to be most complete uses 128-bit ALU AltiVec followed by MMX - DSP processor issues 1.) saturating artitmetic don't want exception on overflow, just go to max value of type 2.) special rounding modes for storing from wide accumulator to memory 3.) multiple-accumualte instruction for dot product other wacky DSP stuff - some memory mapped registers! - usually no FP unit, all fixed point SUMMARY simple instructions are popular and important load, store, add, subtract, move register-register, shift DSP adds multiply and multiply-accumulate 2.9 Instructions for Control Flow 111 - many names for control flow transfers, branch, jump jump => unconditional branch => conditional - 4 major types of control flow instructions conditional branches jumps procedure calls procedure returns - frequencey in alpha runinng SPEC CPU2000 approx 80% condition branch approx 10% jump approx 10% call/return about the same for INT and FP, see Figure 2.19 - Addressing Modes for Control Flow Instructions Destination Address the computed address actually used as target for control flow mode examples PC-relative position independance position independant code (PIC) works for conditional branches jumps quantitative results from Alpha (Figure 2.20) 4-8 bits of displacement needed for integer and floating point respectively 75% of are forward relative RISC has lower code density, but knows code is word aligned, saving lower bits CISC has higher code density, but needs byte indexed offset taking more space register indirect works when target is not known at compile (really link) time case/switch statements virtual functions or methods high-order functions or function pointers dynamically shared libraries - Conditional Branch Options Figure 2.21 Condition code test special bits set by ALU operations, possibly under program control x86, ARM, PowerPC, SPARC, SuperH pros sometimes condition is set for free (w/o extra instruction) measurement shows this is rare cons CC is extra state (special register on PowerPC, many flags in x86 -bdc) constrain ordering of instructions especially when codes are set unconditional (not a bit in instruction) Condition register tests arbitrary register with result of comparison Alpha, MIPS pros simple cons uses up a register Compare and branch compre is part of the branch, often limited to subset PA-RISC, VAX pros one instructions rather than two for branch cons may be too much work per instruction for pipelined execution by limiting to subset, such as branch conditionally on zero, can overcome pipeline issue Figure 2.22 approximate comparison type frequency on alpha running SPEC CPU2000 <= 35% < 35% == 15% != 5% >= 10% > 0% DSP additional condition branch type repeat mode repeat instruction or block a fixed number of times say 256 used with autoincrememnt/autodecrement to reduce kernel loop overhead - Procedure Invocation Options Requirement transfer control save return address in special link register(LR) or general purpose register(GPR) Options save additional register state Conventions caller saved (volatile) save needed registers before function call callee saving (non-volatile) save changed before use, restore before return separate compilation makes it hard for compiler to optimize away caller saved application binary interface(ABI) convention defining register usage caller saved and callee saved also argument passing, frame pointer, stack pointer SUMMARY PC-relative branch with at least 8-bit displacement register indirect for returns, etc 2.10 Encoding an Instruction Set 117 - opcode - address specifier CISC way of identifying operand location RISC encodes as logically part of opcode - tradeoffs between 1.) want to have as many registers AND addressing modes as possible 2.) size of register and addressing mode fields affects size of program and therefore average program ize 3.) a desire for instruction lengths to allow efficient instruction decoding of for pipelining. fixed size preferable, multiple of bytes at the minimum - options (Figure 2.23) variable VAX, x86 opcode, number of operands, address specifier1, address field1, ... Fixed Alpha, ARM, MIPS, PowerPC, SPARC, SuperH opcode, address field 1, address field 3, address field 3 hybrid (small set of regular options, can be different lengths) IBM 360/70, MIPS16, Thumb, TI TMS320C54x opcode, address specifier, address field opcode, address specifier1, address specifier2, address field opcode, address specifier, address field1, address field2 - Reduced Code Size in RISCs RISC variants with 16-bit wide and 32-bit wide instructions examples ARM Thumb and MIPS 16 claim code size reduction of 40% IBM offers CodePack which uses code compression decompressed on fetch to i-cache downside: i-cache filled with 32-bit instructions hybrid's have 25% more logically instructions in i-cache CodePack uses RLE of PowerPC - specific to program - 2k table on chip for decode - hashtable to map aligned to unaligned addresses Performance? - 10% hit - 35%-40% code reduction SuperH - 16-bit wide instructions only - only 16 registers vs 32 to make it fit - otherwise looks like classic RISC SUMMARY - speed => fixed length - size => variable length 2.11 Crosscutting Issues: The Role of Compilers 120 - today almost all programming done in high level languages ISA is simply a compiler target - historically (and today's DSP market) focus on making it easy on assembly programmers - broken thinking trying to isolate ISA from compiler as bad as seperating ISA from hardware implementation tradeoffs between each interface - The Structure of Recent Compilers (Figure 2.34) Front end per language -> intermediate representation High-level optimizations -> loop transformations -> procedure inlining (aka procedure integration) Global optimizier -> global and local optimizations -> register allocation Code Generator -> detailed instruction selection -> machine-dependant optimizations -> (peep hole optimziations -bdc) - Goals 1.) correctness 2.) speed - Complexity #1 limits optimizations that can help with #2 phase ordering problem need to decide inlining before know size from code generator - compiler-ISA interaction example global common subexpression elimination big impact if shared temporary location is register allocated - Register Allocation probably the most important optimization graph coloring to try and achieve 100% allocation to registers NP-complete but heuristics work well However, need 16+ registers for heuristics to work well - Impact of Optimizations on Performance Figure 2.25 High-level (at or near the source level) Procedure inliming Local (within straightline code) 18% common subexpression elimination 22% constant propagation stack height reduction minimize resourced needed for expression evaluation Global (across a branch) 13% global common subexpression elimination 11% copy propagation 16% copy motion 2% induction variable elimination Processor-dependant strength reduction pipeline scheduling branch offset optimization Figure 2.26 optimizations eliminate 25%-90% of instructions - The Impact of Compiler Technology on the Architect's Decisions Question 1 of 2: How are variables allocated and addressed? Question 2 of 2: How many registers are needed to allocate variables appropraitely? Answer 1 stack local variables global data area static declared objects frequently arrays or other large data structures heap dynamically allocated objects that don't follow stack discipline accessed via pointers, typically not scalars Answer 2 registers are more effective for stack than global very hard for heap allocated objects "aliasing" problem for heap and some stack variables - How the Architect Can Help the Compiler Writer Make the frequent cases fast and the rare cases correct 1.) provide regularity orthogonality of operations, data types, addressing modes 2.) provide primitives, not solutions 3.) simplify trade-offs among alternatives 4.) provide instructions that bind the qualatites known at compile time as constants example of badness: VAX calls instruction dynamically interprets register save mask known at compiletime at runtime - Compiler Support (or Lack Thereof) for Multimedia Instructions SIMD problems - solutions not primitives - short of registers - data types don't match existing programming languages general vector architectures not bad compiler support for those limits on size MMX 64-bit total, altivec 128 SSE introduced for 128-bit support (MMX not extensible) better design? seperate length and size let processor determine scheduling of 128 8-bit adds old processor might do 8 at at time, new 16 (64=8*8, 128=8*16) general vector architectures good for hiding memory latency access pattern visible ahead of time options like - strided addressing - gather/scatter addressing allow more programs to be vectorized the "register indirect" mode of vector computers because of problems, SIMD typically only found in hand coded libraries (out of date: - GCC4 does target SIMD now... - but what is usage in SPEC CPU2000? ) SUMMARY of role of compilers - 16 general purpose registers to simplify register allocation not counting floating point - orthogonality: all addressing modes for all load/store instructions - KISS (keep it simply stupid) principple primitives not solutions simplify tradeoffs don't bind constants at runtime - current SIMD is good marketing, not good design 2.12 Putting It All Together: The MIPS Architecture 129 - based on rationalizations of previous sections we come up with our ISA design which happes to be MIPS-64 2.2 GPR registers with load-store architecture 2.3 Addressing modes displacement with offset of 12-16 bits immediates of 8-16 bits register indirect 2.5 data sizes and types 8,16,32,64-bit integers 64-bit IEEE 754 floating point 2.7 load,store,add,sub.mv,,shift 2.9 eq,ne,lt branch with PC-relative 8-bit offset jump, call, return 2.10 fixed instruction encoding for performance (or variable for size) 2.11 16+ GPRs, supported in all addressing modes separate set of FPRs - MIPS64 (aka MIPS) - simple load-store instruction set - design for pipelining efficiency, including a fixed instruction set encoding - efficiency as a compiler target - Registers for MIPS 32 64-bit GPRs R0..R31 32 64-bit FPRs F0..F31 (hold 1 float, 1 double, or 2 floats) R0 is always 0 SPRs (for example floating-point status) Can move between GPR and FPR - Data Types for MIPS 8,16,32,64-bit integers 32,64-bit floating point ALU operations on 64-bit integers, 32/64 floating point smaller integer types are zero or sign extended - Addressing Modes for MIPS Data Transfers immediate and displacement both with 16-bit offset fields register indirect: use 0 for displacement absolute addressing with 16 bit field using R0 as base register byte addressable aligned memory accessed based on 8,16,32,64-bit types bit endian or little endian - MIPS Instruction Format 32-bit fixed size 6-bit for primary opcode I-type (immediate) 6-opcode 5-rs 5-rt 16-immediate loads and stores immediates conditional branch rs is register, rt unused jump register, junp and link register rt=0, rs=dest, immediate=0 R-type (register) 6-opcode 5-rs 5-rt 5-rd 5-shamt 6-function register-register ALU operations rd = rs funct rt (shamt for shift?) read/write special registers move registers J-type (jump) 6-opcode 28-offset jump and jump and lick trap and return from exception - MIPS Operations loads and stores ALU operations loading r0 always is 0 all register-to-register add,sub,and,or,xor,shifts with and without immediate operands lui (load upper immediate) allows two instruction sequence to load 32-bit immediate branches and jumps floating point operations explict conversion between single and double precision r0 LI: loading a constant is just adding immediate to r0 MOV: register-register move is just adding r0 to a register - MIPS Control Flow Instructions compare two registers, store result in third register set-equal, set-not-equal, set-less-than (really just ALU operation) control via jumps and branches 4 jumps, choose between PC-relative, absolute choose between plain or jump-and-link branches test register for zero or non-zero test register for negative test registers equal test floatint-point status register predicated instructions branches cause problems for pipeline change branches into conditional ALU operation instead MIPS example: MOVZ R1,R2,R3 (move r2 to r1 if r3 is zero) - MIPS Floating-Point Operations move without type or precision conversion move with precision conversion move with type conversion to/from integer add,sub,mul,div FP compare sets floating point status register FP branch uses FP status registeer, testing true or false paired single (for graphics) special ops for operating one two 32-bit flats in one register MADD (for DSP) multiple-add for fast dot product - MIPS Instruction Set Usage Figure 2.32 SPECint2000 MIPS instruction usage load 26% store 10% add 19% compare 5% cond branch 12% and 4% or (including mov) 9% Figure 2.33 SPECfp2000 MIPS instruction usage load 15% add 23% load imm 5% cond branch 4% load FP 15% store FP 7% add FP 7% mul FP 8% 2.13 Another View: The Trimedia TM32 CPU 136 - media processors Figure 2.35 example apps, benchmark data communication - Viterb deconding audio coding - AC3 decode video coding - mpeg2 encode, dvd decode video processing - layer natural motion, dynamic nouse, reduction, peaking graphics - 3d renderer library - Trimedia TM32 CPU set-top boxes, advanced TVs 1.) 128 32-bit registers, integer or fp 2.) partitioned ALU (or SIMD) instructions for operations on narrow types 3.) saturating DSP arithmetic, in addition to twos complement 5-way VLIW many NOPs, uses compressed instructions, decoded in I-cache 2-3x code size over MIPS, even with with compression hand coded EEMBC apps have many pack/merge instructions to align data for SIMD 2.14 Fallacies and Pitfalls 142 - Pitfall Designing a "high-level" instruction set feature specifically oriented to supporting a high-level language structure fixing the "semantic gap" often results in a "semantic clash" too much overhead in general case that compiler might be able to optimize away standarize calling convetion in ABI, not in ISA - Fallacy There is such a thing as a typical program. - Pitfall Innovating at the ISA to reduce code size with accounting for the compiler Compilers often target fastest code, not smallest code - Pitfall Expecting to get good performance from a compiler for DSPs - Fallacy A architecture with flaws cannot be successful 80x86 1.) PC compatible big marketplace factor 2.) Moore's law allowed internal RISC design 3.) high volulmes cover extra design cost - Fallacy You can design a flawless architecture Trade-offs in HW/SW VAX (1975) did not foresee need for lowcost decoding and pipeling RISC branch delay slots all architectures have problems with size of address space over time Planning for the future means losing efficieny now why not go from 32-bits to 128-bit addressing? 2.15 Concluding Remarks 147 - Supporting high level languages 60s stack architectures 70s reduce software costs by moving into hardware (CISC) 80s compilers bring RISC revolution 90s address size doubles optimization of conditional branches via conditional execution optimization of cache performance via prefetch support for multimedia faster floating-point operations 00s long instruction words increased conditional execution blending of general-purpose and DSP architectures 80x80 emulation - the role of architect 60s computer arithmetic 70-85 ISA design (until binary compatibility mattered) today design and evaluation of full computer systems 2.16 Historical Perspective and References 148 - accumulator univac I, edsac, IAS computers - general-purpose register computers first: pegasus by Ferranti, 1956, 8 registers with R0 always 0 - Stack Architectures Burroughs B5000 (1963) first real HW/SW tradeoffs support for ALGOL and OS (MCP) written in high-level language first US computer with VM Burroughs B6500 (1968) added hardware managed activation records top two stack elements in processor, rest in memory good code density but only two high speed locations IBM 360 (1964) and PDP-11 (1970) desginers argue against stack arch 1.) performance is derived from fast registers, not the way they are used 2.) the stack organization is too limiting and requires many swap and copy operations 3.) the stack has a bottom, and when placed in slower memory there is a performance loss Only remnants are 80x87 floating point stack Java bytecode ISA is a stack, but JIT compiled to regular architectures - Computer Architecture Defined IBM coined the term "computer architecture" (Amdahl et al) described the move to a singal ISA for the 360 line the begining of binary compatibility IBM/360 ISA: first widespread with byte addressing and general purpose registers CDC 6600 (Thorton, Cray et al) early pipelined architecture fist general-purpose, load-store architecture figured out in the 60s link between simple ISA and efficient pipelining - High-Level Language Computer Architecture Andrew Tanenbaum argued in 1978 for stack architecture :) VAX, the ultimate CISC, super-othogonal design for compiler writers HLLCA (high level language computer architecture) code size problem disappeared, allowing RISC... - Reduced Instruction Set Computing Ditzel and Patterson (1980) argued to lower the operations targeted by compiler Clark and Strecker (1980) of VAX argued the opposite CDC 6600 wins Berkely RISC, IBM 801, Stanford MIPS IBM 801 (John Cocke) 24-bit experimental machine Berkeley RISC (Patterson) supported hybrid 16/32-bit instruction encodings smalltalk and Lisp support Stanford MIPS (Hennessy) (Microprocessor with Interlocked Pipeline Stages) Hennessy explains performance with low CPI of 2 vs 10 for VAX Commerical MIPS R2000 1986 HP PA-RISC 1989 IBM RT-PC 1986 IBM RS 6000 1990 super-scalar Sun SPARC (Berkeley RISC-II) 1988 PowerPC (Apple, IBM, Motorola) RISC wins on designs, RISC wins on volume embedded processors are 90% RISC (embedded volume 2x PC volume) - A Brief History of Digital Signal Processors 80s first designs, most not commercial successful, including from Intel NEC uPD7710 1980 first in volumne, weak tools AT&T DSP1 1980 used within AT&T but later sold outside TI DSP TMS32010, strong tools, solid success mid80s-mid90s evolution of "conventional DSP" 1995+ TI TMS320C62xx family VLIW (1997) SIMD designs Today very commerical 32-bit microprocessor has some DSP meatures system-on-a-chip (90s) processor core, memory, app specific HW, peripherals, I/O interfaces example: second generation cellphones - Multimedia Support in Desktop Instruction Sets color 8-bit VGA 16-bit high color RGB = 5-bits, 6-bits, 5-bits 32-bit true color RGBA sound 16-bit samples, 24-bit for professional Intel i860 partitioned ALU for 8/16/32/64 bit operations useful for video, 3d graphics, digital photography, audio/image processing names partitioned ALU, subword parallelism, vector, SIMD - Summary CISC is dead (VAX and Intel iAPX 432) Long live CISC (80x86) embraced and extend RISC binary compatibility is king VLIW? Itanium? Transmeta? NO Exercises 161 Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation 3.1 Instruction-Level Parallelism: Concepts and Challenges 172 - Intro all processors since 1985 have a pipeline to improve performance ILP = potential overlap among instructions limits due to data and control hazards dynamic (hardware) and static (compiler/vliw) scheduling dynamic examples Intel Pentium III, Pentium 4 AMD Athlon MIPS R10k R12K Sun UltraSPARC III PowerPC 603, G3, G4, Alpha 21264 Pipeline CPI = Ideal piple CPI + Structural stalls + Data hazard stalls + Control stalls Figure 3.1 major techniques for lower CPI TECHNIQUE REDUCES SECTION Forwarding and bypassing Potential data hazard stalls A.2 Delayed branches and simple branch scheduling Control hazard stalls A.2 Basic dynamic scheduling (scoreboarding) Data hazard stalls from true dependencies A.8 Dynamic scheduling with renaming Data hazard stalls and stalls from antidependences and output dependencies 3.2 Dynamic branch prediction Control stalls 3.4 Issuing multiple instructions per cycle Ideal CPI 3.6 Speculation Data hazard and control hazard stalls 3.7 Dynamic memory disambiguation Data hazard stalls with memory 3.2 3.7 Loop unrolling Control hazard stalls 4.1 Basic compiler pipeline scheduling Data hazard stalls A.2 4.1 Compiler dependence analysis Ideal CPI, data hazard stalls 4.4 Software pipelining, trace scheduling Ideal CPI, data hazard stalls 4.3 Compiler speculation Ideal CPI, data, control stalls 4.4 - What is Instruction-Level Parallelism? basic block - not much ILP convert loop-level parallelism to ILP statically via loop unrolling dynamically via hardware (or use vector instructions) - Data Dependencies and Hazards need to determine if two instructions are in fact parallel at ISA level (free from R/W dependencies) even if so, structural hazards may prevent parallel execution Data Dependencies instruction i is "data dependent" on instruction j if either: - instruction i produces a result that may be used by instruction j - instruction j is data dependent on instruction j, and instruction k is data dependent on instruction i hardware with interlocks can stall to avoid dependencies or software can schedule to avoid pipeline dependencies if part of ISA dependencies are part of programs but hazards are a property of pipeline organization dependencies indicate potential for hazard but actual hazard and length of stall are property of pipeline dependencies 1.) indicate potential hazard 2.) determine order of evaluation 3.) upper bound on potential parallelism overcoming dependencies 1.) maintaining the dependence but avoiding the hazard scheduling instructions to avoid hazards... 2.) eliminating the dependence by changing the code sometimes by software, but also by hardware register data dependencies are relatively easy to spot, but are complicated by branches memory dependencies caused by aliasing are much more difficult to disambiguate Name Dependencies name dependence is when two instructions used the same register or memory location (called a name) but there is no flow of data between the instructions associate witht that name 1.) antidependence - when j writes an register/memory that i reads - i needs old value 2.) output dependence - when i and j write the same register/memory - j needs to win register renaming, either by compiler or hardware, can eliminate name dependencies Data Hazards we must preserve program order but only where it affects the outcome of the program RAW j tries to read a source from i writes it, j incorrectly gets old value true data dependence example load followed by a ALU operation in 5 stage pipeline WAW j tries to write an operand before it is written by i output dependence present if write in multiple stages present if later instructions can proceed past earlier stalled instructions WAR j tries to rwite a destination before it is read by i, i incorrectly gets new value antidependence cannot occur in most static issues pipelines because reads are done early and writes are done late need to have instructions that write results realy and some that read late or when instructions are reordered RAR not a hazard - Control Dependencies "control dependence" determines the order of instruction i so that is in correct program order with respect to a branch 1.) an instruction that is control dependent on the branch cannot be moved before the branch so that its execution is no longer controlled by the branch (move "then" code before "if") 2.) an instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch (move code before "if" to "then") Simple pipeline 1.) instructions execute in program order 2.) pipeline prevents execution of dependant instructions until branch direction is known control dependence is not fundamental limit we can execute extra instructions so long as they do not affect program correctness however "exception behavior" and "data flow" must be preserved exception behavior reordering must not cause any new exceptions moving load before branch cannot cause exception if branch ends up skipping load data flow branches make data flow dynamic sometimes have to evaluate branch before we know "predecessor" for data flow input sometimes violating control dependence has no affect on correctness example move operation before branch if target location is dead after branch an example of compiler speculation delayed branches compiler preserves the data flow 3.2 Overcoming Data Hazards with Dynamic Scheduling 181 simple static pipeline if data dependence cannot be hidden via bypassing or forwarding, stall the pipeline dynamic scheduling rearrange ordering while preserving exception behavior and data flow handles dependencies unknown at compile time, such as memory aliasing compiler doesn't have to model pipeline and schedule accordingly and output code can run on different pipeline implementations... compiler static scheduleing separate dependant instructions to avoid hazards Dynamic Scheduling: The Idea simple pipeline in order issue and execution structural and data hazards all checked during instruction decode dynamic pipeline check for structural hazards during instruction decode dynamically wait for data hazards out-of-order execution implies out-of-order completion causes WAR and WAW hazards which simple pipeline avoids exception behavior allow imprecise execptions may have already complete later instructions or not started earlier instructions make it hard to restart execution after exception better to maintain precise exceptions (use in-order commit (aka in-order retirement)) converting the pipeline IF fetch into buffer or queue ID 1.) issue deocde instructions, check for structural hazards 2.) read operations - wait until no data hazards, then read operands EX potentially multiple cycles depending on operation "begins execution" -> "in execution" -> "completes execution" parallel execution needs 1.) multiple execution units OR 2.) pipelined execution units OR 3.) both scoreboarding allows out-of-order execution if sufficient resources and no data dependencies Tomasulo's algorithm several major enhancements over scoreboarding Dynamic Scheduling Using Tomasulo's Approach From IBM 360/91 floating-point unit tracks when operands are available to minimize RAW hazards introduces register renaming to minimize WAW and RAW hazards designed to overcome few (4) FP registers and long memory access and FP execution times (sounds like modern x86: few registers, big memory wall, long pipelines for many ALU operations) RAW hazards avoided by delaying for operands WAR and WAR are handled by register renaming register renaming effectively reintroduces compiler temporary names from before register allocation register renaming in Tomasulo's algorithm uses reservation stations - buffers for operands issue logic - picks correct reservation station for execution output sucessive writes to the same register go to different stations, only last need to to actual register removes name dependencies caused by lack of unique names for compiler to use (lack of free registers) centralized to decentralized control 1.) hazard detection and execution control are distributed information held in reservation stations at each function unit determine when execution can start at that unit 2.) results are passed directly to functional units from the reservation station rather than going through registers common data bus allows all units waiting for a result to be loaded simultaneously in pipelines with multiple execution units and issuing multiple instructions per cycles, more than one bus needed reservation station contains instruction to be issues either the values of the operand if already computed or names of the reservation stations that will provide the operand load and store buffers act like reservation stations store buffer takes values from common data base load does not take input but does output to common data bus stages of Tomasulo's algorithm 1.) Issue - get instruction from queue - if available reservation station at functional unit, send there, otherwise stall (structural hazard) - if operands available, fetch values from regisers from and send to selected reservation station otherwise note which other reservation station's results will provide operand value This renames registers, removing WAR and WAW hazards 2.) Execute - if operands not available, just keep watching bus - if result available, fill operand value - if all operands available, execute instruction This waiting removes RAW hazards NOTE: multiple instructions could become ready in the same cycles for floating point, the choice of which to execute can be arbitrary loads and stores are maintained in program order to prevent hazards through memory loads only have one operand, the address, and execute when memory unit is available stores need two operands, value and address exception behavior no instruction starts execution until all previous branches are executed wait until branch prediction is verified alternative is to note exception but not raise it until write result more on this in speculation below 3.) Write result - write result to common data bus (CDB) - and to any needed registers - and to any needed reservation stations (include store buffer) - stores also write to memory unit in this step when both operands are available reservation station tags indicates name of reservation station or load buffer entry in place of register names because there are more reservation stations that actual regisers, we have no WAW and WAR other approches use extended register set or reorder buffer (more below) reservation station fields Op - the operation to perform Qj Qk - the source reservation stations (zero means satisified) Vj Vk - the operand values A - address information, starts as immediate offset, then stores effective address Busy - marks station as in use register file field Qi - the reservation station that will supply this register (zero means value is current) 3.3 Dynamic Scheduling: Examples and the Algorithm 189 - Advantage of Tomasulo's Algorithm 1.) distribution of hazard logic CBD broadcast allows all waiting operations to continue in parallel instead of waiting to cotending to access register file 2.) the elimination of stalls for WAW and WAR hazards by register renaming - Tomasulo's Algorithm: The Details Figure 3.5 for details loads/stores go though functional unit for effective address calculation before load/store buffers loads take a second step "execution" step to access memory and then a "write result" step to send to CDB stores complete in first "write result" step which writes to memory note all register/memory writes occur in "write result" step, critical to Speculation in section 3.7 - Tomasulo's Algorithm: A Loop Based Example loop is logically unrolled by the hardware (assuming correct branch prediction, see below) reordering memory operations - load/stores to different addresses can be reordered - can reorder buffer if effective address computation still in program order (note loads can be reordered relative to each other arbitrarily, its the load/store and store/store ordering that matter that is RAR does not matter, only RAW, WAR or WAW...) - negative of Tomasulo's Algorithm: cost of hardware each reservation station needs fast associative buffer complex control logic each extra CDB addes more associative searching logic 3.4 Reducing Branch Costs with Dynamic Hardware Prediction 196 Analogy Tomasulo's Algorithm => data hazards Branch Prediction => control hazards Crucial for multiple issue processors 1.) branches arrive n-times faster in n-way processor 2.) control stalls will be larged with the lower potential CPI Static schemes (Appendix A) - delayed branches - predict not taken - Basic Branch Prediction and Branch-Prediction Buffer names "branch-prediction buffer" or "branch history table" basic mechanism table indexed by *lower* bits of branch PC entry hints whether branch was taken or not on last visit - only hint because of possible collisons of multiple PC's to entry - no comparison of high bits - if prediction wrong, we just update to other predict the opposite next time basically a small cache minor improvement with only 1-bit, entry wrong twice for each loop with 2-bits, only wrong once for each loop (can generalize to n-bit saturating counter, but 2-bits is good enough) where in pipeline? can lookup bits in IF with instruction address or attach bits to instruction cache doesn't help for 5-state pipeline generally find out whether branch is taken and branch target at the same time accuracy (Figure 3.8) SPEC89 2bits per entry, 4k entries 99% to 82% accuracy (misprediction 1% to 18% of the time) typically better for floating points apps with more regular access patterns impact on programs accuracy is not enough, need to understand frequency of branches in program size matters not (Figure 3.9) 4k is pretty large, and results not much better with infinitely sizied predictor - Correlating Branch Predictors (aka two-level predictor) observation often consecutive branches have correlated behavior 1 bit of correlation look not just at branch PC, but also whether last branch was taken or not taken (1,1) predictor means look at last 1 branch and keep 1 bit of history state (m,n) predictor means look at last m branch and keep n bit of history state (0,2) predictor is our original 2-bit history predictor implementation global history of last m branches a simple shift register, used to index table with low bits of PC accuracy (2,2) often beats (0,infinite) (for same sized table (because 4x the state, only 1/4x the entries) other extreme (12,2) predictor with NO BRANCH PC INPUT can often outperform (0,2) predictor - Tournament Predictors: Adaptively Combining Local and Global Predictors multi-level branch predictors use several levels of branch-predictors tables with algorithm for choosing between multiple predictors effectively make use of more predictor bits (diminishing returns for other predictors even with infinite size) accuracy (Figure 3.19) local 2-bit < 7% misprediction correlating < 4% misprediction tournament < 3% misprediction An Example: The Alpha 21264 Branch Predictor 4k 2-bit counters indexed by branch address to pick between global and local global also has 4k entries indexed by last 12 branches (same (12,2) predictor as above) local is a two level predictor top level 1024 10-bit entries, 1 bit for each of the last 10 branches selected entry used to index a table of 1k entries with 3-bit saturating counters implementation 29k bits total accuracy less than 1 in 1000 instructions for SPECfp95 less than 11.5 in 1000 instructions for SPECint95 3.5 High-Performance Instruction Delivery 209 - Predicting branches not enough, need to deliver 4-8 instructions per cycle for multi-issue processor - Branch-Target Buffer "branch-target buffer" or "branch-target cache" predicts destination 5-state pipeline needs to predict where to branch by end of IF need to guess before ID if instruction is a branch and if so what next PC will be ** With a basic "branch-prediction buffer" accessed during ID we know - branch target at end of ID - fall through address at end of IF - prediction ** With a "branch-target buffer" accessed during IF we know a prediction for target a cycle earlier A real-cache, different than cache predictor we cannot afford mis-predict in case that instruction is not a branch we must compare tag of high order bits to make sure this is a branch Only store taken branches, since we will predict PC+4 if not in branch-target buffer 2-bit predictors complicate this, since we need to track target in taken case even after a non-taken cases PowerPC uses both a target buffer and prediction buffer to solve this performance in 5-stage pipeline in prediction case, we have no branch stall in misprediction case, we have a 2 cycle stall when we find out the bad news in EX IF: check for entry in branch-target-buffer (if not found and it later turns out to be a branch, in EX we update branch-target-buffer) ID: if found, fetch predicted PC EX: if correct, win! if incorrect, kill, restart IF for correct target, update branch-target-buffer variation store one or more target instructions (possible instead of target address speculation involves actually starting to execute actual instruction, not just fetch "branch folding" zero-cycle unconditional branches, we can just skip to the next instruction after the branch sometimes zero-cycle conditional branches if we can use preset condition codes to select destination - Integrated Instruction-Fetch Units No longer a simple pipeline stage, a whole unit with several features 1.) Integrated branch prediction 2.) Instruction prefetch 3.) Instruction memory access and buffering (See also trace caches in Chapter 5) - Return Address Predictors 85% of indirect jumps in SPEC89 are for procedure returns branch-target-buffer ineffective if common procedure called from multiple sites Instead, use a small buffer that pushes return address on call, pops on return 1-16 entries accuracy (li is worse for of all SPEC89 apps) 1 50% miss rate on "li" 2 35% miss rate on "li" 4 15% miss rate on "li" 8 5% miss rate on "li" 16 <5% miss rate on "li" - Alternative to branch prediction try to reduce penalty of misprediction - prefetch both paths dual ported memory, interlaved cache, or read in sequence (all costly) 3.6 Taking Advantage of More ILP with Multiple Issue 215 - Below CPI=1.0 Last two sections try to remove data and control hazards to get CPI=1.0 To go lower we need multiple instructions to be issued per cycle multiple-issue processors super-scalar VLIW (aka EPIC expliclty parallel instruction computers) Summary of 5 primary approachs for multiple issue processors (Figure 3.23) Superscalar (static) Sun UltraSPARC II/III Superscalar (dynamic) IBM Power2 Superscalar (speculative) Pentium III/4, MIPS R19K, Alpha 21264, HA PA 8500, IBM RS64III VLIW/LIW Trimedia,i860 EPIC Itanium (not pure VLIW because dependencies marked in bundles) - Statically Scheduled Superscalar Processors 0-8 instructions per cycle instructions issue in order with all hazards account for at issue time any data hazard or structural hazard causes a stall of instruction and all after it issue packet the set of instructions actually deliver for issue as a unit logically inspected in program order, in practice in paralle with each other issue checks typically complex enough to now be multiple pipeline stages - increases complexity because now handling multiple issue packets at once - first stage typically checks for dependencies within issue packet - second stage typically checks for dependencies with other in flight instructions - harder to further pipeling, making this a potential limit factor in pipeline - more stages means higher branch penalties, more reliance on good branch prediction - A Statically Scheduled Superscalar MIPS Processor MIPS example, similar to HP 7100 - two instructions per cyle 1 is load/store/branch/integer ALU 2 is floating point - fetch 64 bits per cycle early superscalar required fixed placement although not common now (for example integer/floating in that order) if unaligned access at end of cache block, simple fetchers may only one instructions "independant function prefetch unit" (above) can solve this problem - integer/floating point case simplifies hazard checking need to check for integer-to-floating point transfer either register-to-register move or load/store of FP combined with FP ALU operation - hardware implementation issues floating point can quickly become the bottleneck either need multiple FP units or a more pipelined FP unit need dual-ported FP register file to allow FP load/store with simultaneous FP ALU operation need larger bypass paths with multiple instructions feeding each other - imprecise exceptions an FP instruction can finish after a integer instruction that was logically before it an FP instruction exception could be detected after the integer instruction completes solutions - early detection of FP exceptions (appendix A) - software mechanism to restore a precise exception state - delay instruction completion until we know exception is impossible (See 3.7 on speculation) - load latency increased from one clock cycle to two result of load cannot be used on the *same* clock cycle or the *next* clock cycle since we are issuing them in parallel, we logically have an extra instruction delay - Multiple Instruction Issue with Dynamic Scheduling Don't block issue, just ship instructions to reservation stations - tricky since can be dependencies in issue packet 1.) do each instruction in half a clock 2.) complex logic to handle parallel issue to reservation stations while maintaining in order queueing 3.) a little bit of 1 and 2 - simple loop example bottlenecked on integer unit, not enough FP ops to keep it busy 1.) load imbalance between integer and fp unit... 2.) ...that is too much loop overhead, need to unroll (next chapter 4) 3.) control hazard holding up load, if only we could speculate past branch (next section 3.7) 3.7 Hardware-Based Speculation 224 - Limits of branch prediction prediction reduces stalls from control hazards wide issue processor may need to execution one branch per cycle need to overcome control dependence solution - do not limit prediction to instruction fetch - instead, also start execution based on prediction three ideas 1.) dynamic branch prediction (choose what to execute) 2.) speculation (execute prediction without commiting results) 3.) dynamic scheduling (deal with scheduling of different basic blocks) extending Tomasulo's algorithm to handle speculation (as done in PowerPC 603+, MIPS R10k+, Pentium II+, Alpha 21264, AMD K5+) seperate bypassing of rsults among instructions from completion of instructions instruction commit: when a instruction is no longer speculative, allow update to reg/mem also delay exceptions to commit, preserving precise exceptions EXECUTE OUT OF ORDER, COMMIT IN ORDER in 5 stage pipeline, we always had writes/exceptions at end of pipeline now we add another stage to Tomasulo's, "Commit", to act in same manner reorder buffer (ROB) new hardware to hold speculative results before commit stage writes them to real reg/mem replaces reservation stations as source of operands replaces store buffer as well fields instruction type : branch, store, register destination field : register (loads/alu) or address (stores) value field : value until comit ready field : instruction complete (valid value) stores still two part, second part is now done by new "Commit" phase reservations stations still exist used to schedule units however, references for operands now use ROB indexes steps 1.) Issue (aka "dispatch" in some dynamically scheduled processor) issue requires reservation station and ROB entry full ROB a structural hazard, stalls issuing reservation stations operands get register values or ROB pointers 2.) Execute (aka "issue" in some dynamically scheduled processor) when operands available, start possible multi-cycle exection 3.) Write Result write result across CDB to waiting reservations stations and ROB store not ready until destination and value calculated other instructions are always ready at this point free reservation station for reuse 4.) Commit: completion, graduation, retirement Commit instructions in order - if head is ready, perform reg/mem write as needed - if head is not ready, stall - if head is correctly predicted branch, branch is done - if head is incorrectly predicted branch, flush ROB and restart with correct instruction exceptions are raised here, in order, as needed free ROB entries as they commit branch misprediction try to handle even before instruction is ready to be commited if possible store optimization value to store can actually pass from write results immediately to waiting stores for commit the same cycle memory aliasing problems WAW and WAR: no issues because of in-order retirement of stores RAW: 1.) stall 2nd half of load if any ROB entry has matching destination 2.) maintain program order for computation of effective address of load with respect to earlier stores bonus points: bypass store value directly to load when RAW detected - Multiple Issue with Speculation "multiple issue" issues (instruction issue and monitoring CDB's for completion) just keep getting harder in addition, need to support multiple commits per cycle to have CPI<1.0 - Design Considerations for Speculative Machines Register Renaming versus Reorder Buffer replace ROB with set of physical registers larger than the set of architectural registers maintain mapping of architectural registers to physical registers commit is simplifed update map of architectural to physical free old physical register that was held previous architectural value freeing physical registers is a bit complex 1.) need to make sure it is not used for an architectural register AND need to make sure it is not an operand of any functional unit queue OR 2.) just wait until the original architectural register is written again consumes physical register a bit longer but simpler, used in MIPS R10k issue is simplified don't need to check ROB and registers for operand values checkpointing register operations for context switch etc needs a bit of extra logic sometimes (IA-64!?!) physical registers are exposed architecturally as well!?! usage: MIPS R10k/R12k, Alpha 21264, P2,3,4 all use register renaming 20-80 extra registers replace primary function of ROB largely determine how many instructions can be in execution (b/w issue and commit) at one time How Much to Speculate danger in speculating many costly exceptional events such as cache misses in parallel ties up other system resources, potentially hard to stop if misprediction is found often just limit to one such costly event to be speculated at a time examples: L2 miss, TLB miss Speculating through Multiple Branches three reasons for Speculating through Multiple Branches 1.) very high branch frequency 2.) significant clustering of branches 3.) long delays in functional units slight complication of speculation recovery harder to predict and speculative on more than one branch per cycle POWER2 can resolve two branches per cycle but not speculate on any other instructions as of 2001, no processor can provide full speculation with resolving multiple branches per cycle 3.8 Studies of the Limitations of ILP 240 - Limits of ILP pipelining started in 60s, responsible for 80s and 90s performance growth even given "perfect" hardware, their are limits to ILP some assumptions on compiler technology may change perhaps some hope from new tech like value prediction however, realistic hardware has problems before fundamental limits - The Hardware Model assumptions 1.) register renaming - infinite registers 2.) branch prediction - perfect prediction 3.) jump prediction - perfect prediction 4.) memory address alias analysis - addresses known exactly and independent load can move before store implications 2/3 remove control dependencies 1/4 remove all but true data dependencies other assumptions no restrictions of types of instructions issued all functional units 1 cycles perfect caches, 1 cycle load/store methodology MIPS optimizing compiler create trace trace analysis to create optimal schedule - Limits on the Windows Size and Maximum Issue Control checking dependencies in "n" instructions needs "n^2-n" comparisons to issue 50 instructions needs 2450 comparisons 2000 instructions needs 4 million "window" size (how many instructions analyzed for simultaneous execution) comparisons = max completion rate * windows size * number of operands total windows size limited by storage, comparisons, and issue rate makes larger window less helpful large window sizes are impractical and inefficient * assume window size of 2k entries * assume maximum issue capability of 64 instructions per clock - The Effects of Realistic Branch and Jump Prediction * assume ambitious tournament predictor that uses two levesl of prediction and a total of 8k entries - The Effects of Finite Registers * assume 256 integer and 256 floating point - The Effects of Imperfect Alias Analysis three models 1.) global/stack perfect perfect predictions for globals and stack, but no heap 2.) inspection assume offsets from same register are not aliases assume addresses in different segments not aliases 3.) none all addresses could conflict NOTE: for fortran without heap, 1==2 results (Figure 3.43/3.44) 1 does very close to ideal 1 factor of 2 better than 2 2 not much better than 3 hardware limits to alias analysis 1.) to disambiguate load, need to know memory address of earlier stores use memory address speculation however, when wrong, need to redo 2.) only a small number can be disambiguated per clock cycle 3.) number of load-store buffers determines how much ealier or later we can move access 3.9 Limitations on ILP for Realizable Processors 253 assume 1.) 64 instructions per clock with no issue restrictions 2.) tournament predictor with 1k entries and 16 entry return predictor 3.) perfect disambiguation of memory references dynamically 4.) register renaming with +64 integer and +64 fp registers (in addition to architected registers) vary window size from 4 to infinite results integer programs reach limits pretty early, curve pretty flat approx 10-15 even in infinite case floating point could use more window size, scaling all the way to infinite - Beyond the Limits of This Study other limitations in the perfect model 1.) WAR and WAW hazards through memory speculating through functional call/return/call causes unneeded dependencies on stack frame need to rename stack memory as well 2.) Unnecessary dependences don't really need a loop dependency to do i++ compiler can help here return address register and register stack pointer also cause unnecessary dependencies can't speculate on two two parallel function calls 3.) overcoming the data flow limit value prediction... assume function does not return error code or exception... help for less than perfect processors 1.) address value prediction and speculation easier than value prediction 2.) speculating on multiple paths are HW resources better spent on a bigger predictor? 3.10 Putting It All Together: The P6 Microarchitecture 259 - Intel P6 Microarchitecture Pentium Pro, Pentium II, and Pentium III differences some have MMX and SSE clock rate cache architecture memory interface IA-32 decode translates IA-32 into uops (micro-ops, aka RISC instructions) 3 IA-32 instructions fetched/decoded/translate per cycle if CISC instruction needs more than 4 uops microcode sequence generates uops over multiple cycles max uops per cycle=6 4 for 1st instruction of 3, other two for 2nd and 3rd IA-32 instructions uops execution OOO specualtive pipeline with register renaming and ROB 3 utops per clock can be renamed and dispatched 14 stage pipeline - 8 stages for in-order instruction fetch, decode, and dispatch branch predictor: 512-entry, 2-level 40 virtual registers 20 reservation stations 40 ROB entries - 3 stages for OOO exuction in 1 of 5 functional units 1-integer, 2-FP, 3-branch, 4-memory address, 5-memory access integer ALU can be as little as 1 cycle floating point divide is 32 cycles - 3 stages for instruction commit - Performance of the Pentium Pro Implementation Pro has smaller caches than II, but faster, high BW secondary cache What are sources of stalls largely failure to keep above pipelines busy sometimes because of particilar instructions, sometimes hardware resource limits Stalls in the Decode Cycle integer programs have more instruction cache misses than FP floating point programs have more HW resource capacity stall Data Cache Behavior out of order can mask some l1 miss latency, but not l2 Branch Performance and Speculation Costs branch target buffer (BTB) dominates mispredict frequency so we know which way to go, but not where to go perhaps a bigger BTB, with slightly worse predict rate would be better Putting the Pieces Together: Overall Performance of the P6 Pipeline SPECint CPI is 1.15 SPECfp CPI is 2.0 - The Pentium III versus the Pentium 4 Pentium 4 uses "NetBurst" microarchitecture some similarites fetch up to 3 IA-32 instructions per cycle graduate up to 3 uops per cycle differences much deeper pipeline example integer add: p6: 10 cycles netburst: 20 cycles, including 2 wire only cycles register renaming vs ROB 128 outstanding results vs 40 7 vs 5 integer units extra integer ALU and address computation unit aggresive ALU (2x clock) and aggressive data cache ALU 1/2 cycle vs 1 cycle loads 2 vs 3 cycles trace cache to improve instruction fetch performance versus prefetch buffer and instruction cache BTB is eight times larger and has better predictor algorithm data cache smaller (8k vs 16k) but larger 256k L2 with higher bandwidth SSE2 2 floating point operations per instruction considerable advantage over PIII on floating point A Brief Performance Comparison of the Pentium III and Pentium 4 better absolute performance, especially for FP however, higher CPI for integer apps, reasons: 1.) clock overhead because of skew and jitter 2.) deeper pipeline alternative wider issue width instead of deeper pipeline (what is unsaid, basically Intel MHz marketing decided architecture) 3.11 Another View: Thread-Level Parallelism 272 - TLP could be less costly than ILP especially for some workloads... - existing desktop code lacks TLP in software, harder to find, focus remains on ILP... 3.12 Crosscutting Issues: Using an ILP Data Path to Exploit TLP 273 - multithreading (aka simultaneous multithreading (aka Intel hyperthreading)) reuse ILP resources for TLP 3.13 Fallacies and Pitfalls 273 - Fallacy Processors with lower CPI will always be faster - Fallacy Processors with faster clock rates will always be faster Power3-II 400MHz beats Pentium III 800MHz on SPECfp 344 to 237 of course Pentium beats Power on SPECint 344 to 249 - Pitfall Emphasizing an improvement in CPI by increasing issue rate while sacrificing clock rate can lead to lower performance HP PA-7100 with simple dual issue (99MHz) beats TI SuperSPARC flexible three issue (40MHz) Simulation is to blame, rewards of CPI look good, hard to estimate clock impact 1.) hard to know clock rate of an approach until design is well under way 2.) simulation tools for improving for CPI are really good compared to improving cycle time - Pitfall Improving only one aspect of a multiple-issue processor and expecting overall performance improvement Amdahl's law... better predictor may not matter if other resources are limiting facto - Pitfall Sometimes bigger and dumber is better simple 2-bit predictor better than Alpha tournament predictor for OLTP larger code size, higher branch frequency better local predictions 3.14 Concluding Remarks 276 - typically doubling clock is better than doubling issue width however when this is just because of really deep pipelining result is less sure Pentium 4 tradeoff decision - Practical Limitations on Exploiting more ILP more transistors, wire delays for clock and critical path longer clock needed or more stages like Pentium 4 but that costs more branch delays etc Power! Power! Power! more switching transisters switching more frequency power/performance is rising rapidly 3.15 Historical Perspective and References 280 - IBM 360 Model 91 tagging of data register renaming dynamic detection of memory hazards generalized forwarding Tomasulo's algorithm branch prediction complexity made it late to market outperformed by Model 85 that had IBM's first cache - Branch-Prediction - The Development of Multiple-Issue processors 70s IBM ACS (surpass Stretch, beat CDC 6600/6800) John Cocke, ..., Amdahl 80s IBM America "superscalar" => Power1 90s Johnson and AMD K-5, speculative superscalar - Studies of ILP and Ideas to Increase ILP - Going Beyond the Data Flow Limit "value prediction" deciding when to speculate also important - Recent Advanced Microprocessors Exercises 288 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches 4.1 Basic Compiler Techniques for Exposing ILP 304 - Basic Pipelining Scheduling and Loop Unrolling Need to find sequences of unrelated instructions to be overlapped in pipeline Need to push dependent instructions apart enough to tolerate cycle latency For example 5 stage pipline, we assume: Figure 4.1 Instruction Instruction Latency producing using in result result clock cycles ------------------------------------------------------------------------ FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Loop Unrolling Simple C loop adding constant to a every vector element for (i=100; i>0; i--) x[i] = x[i] + 5; MIPS (optimized with strenth reduction of *8 to -8, elimination of seperate i variable) loop: l.d f0,0(r1) ; f0=array element add.d f4,f0,f2 ; add scalar in f2 s.d f4,0(r1) ; store result daddui r1,r1.#-8 ; decrement pointer 8 bytes bne r1,r2,loop ; branch r1!=r2 with stalls, 10 cycles: loop: l.d f0,0(r1) ; f0=array element stall add.d f4,f0,f2 ; add scalar in f2 stall stall s.d f4,0(r1) ; store result daddui r1,r1.#-8 ; decrement pointer 8 bytes stall bne r1,r2,loop ; branch r1!=r2 stall reordering to reduce stalls, 6 cycles (3 cycles of overhead) loop: l.d f0,0(r1) ; f0=array element daddui r1,r1.#-8 ; decrement pointer 8 bytes add.d f4,f0,f2 ; add scalar in f2 stall bne r1,r2,loop ; branch r1!=r2 s.d f4,0(r1) ; store result overhead 3 cycles to do loop body (load/add/store) 3 cycles of loop overhead unroll of basic loop: 7 cycles per element (4 of overhead) loop: l.d f0,0(r1) add.d f4,f0,f2 s.d f4,0(r1) l.d f6,-8(r1) add.d f8,f6,f2 s.d f8,-8(r1) l.d f10,-16(r1) add.d f12,f10,f2 s.d f12,-16(r1) l.d f14,-24(r1) add.d f16,f14,f2 s.d f16,-24(r1) daddui r1,r1,#-32 bne r1,r2,loop schedule unrolled loop: 3.5 cycles per element (only .5 overhead) loop: l.d f0,0(r1) l.d f6,-8(r1) l.d f10,-16(r1) l.d f14,-24(r1) add.d f4,f0,f2 add.d f8,f6,f2 add.d f12,f10,f2 add.d f16,f14,f2 s.d f4,0(r1) s.d f8,-8(r1) daddui r1,r1,#-32 s.d f12,-16(r1) bne r1,r2,loop s.d f16,8(r1) ; 8-32=-24 - Summary of Loop Unrolling and Scheduling Example Necessary Decisions and Transformations 1.) determine it was legal to move s.d after daddui and bne and adjust the s.d offset 2.) determine that unrolling the loop would be useful by finding that iterations were independent (excluding maintenance code) 3.) use different registers to avoid unnecessary constraints that would be forced by using the same registers 4.) eliminate the extra test and branch instructions and adjust the loop termination and iteration code 5.) determine that the loads and stores in the unrolled loop can be interchanged (independence and aliasing analysis) 6.) schedule the code, preserving any dependences needed to yield the same result as the original code Example removing daddui removes data dependency between r1 Example Use register renaming to remove dependencies between "iterations" Three different types of limits to gains 1.) always some loop overhead 2.) growth in code size 3.) register pressure - Using Loop Unrolling and Pipeline Scheduling with Static Multiple Issue We can unroll the loop 5 times and schedule L.D in parallel with ADD.D (integer and fp operations) 2.4 clock cycles per element (versus 3.5 for non-superscalar) 4.2 Static Branch Prediction 313 - See also MIPS branch delay slot in Appendix A) - Fill stall slots with code that has a control dependence 1.) either because its safe to do (target register unused in target block) 2.) or if we are sure we will take it, we can undo the op in unlikely case that we do not - Compile time prediction schemes 1.) always predict as taken 2.) predict backward as taken (loops...) and forward as not taken 3.) predict backward as taken (loops...) and forward as taken (better for SPEC) 4.) use some program context info to improve 2 and 3 5.) use profile information 4.3 Static Multiple Issue: The VLIW Approach 315 - Approaches to multiple issue dynamic super-scalar flexible but lots of hardware static super-scalar easier to implement buts needs some compiler assistance VLIW HW does not need to think, compiler does all the work - The Basic VLIW Approach Each component of VLIW instruction not as wide as RISC instruction because position implies unit Scheduling code for issue local scheduling for loop unrolling etc global scheduling dealing accross branches trace scheduling (type of global) (section 4.4) Unrolled loop example (5 way VLIW with 2-mem, 2-fp, 1-integer) 1.29 cycles per result (versus previous best 2.5 for static supersclar) only 60% of slots used Code bloat causes 1.) from unrolling to generate parallelism 2.) from fillerNOOPs solutions - limit to one immediate field per packet - compress code see also 4.7 and 4.8 lock step operation seems simpler... if one unit suffers a cache miss, all stall... newer VLIW designs only assume no dependencies in issue packet do manage dependencies from stalls of independent functional units binary compatibility typically none in older VLIW either with pre-VLIW code or with one VLIW generation to another, as more units are added solutions translation/emulation IA-64 defines VLIW instructions independent of internal architecture multiple-issue (even not VLIW) versus vector processor multiple-issue can extract some paralellism even from irregular code vector good for very regular numerical programs as extension to multiple-issue processor 4.4 Advanced Compiler Support for Exposing and Exploiting ILP 319 - Detecting and Enhancing Loop-Level Parallelism typically done at or near source level (on loops themselves) basic ILP is typically scheduled at lower level "loop-carried dependence" analysis figuring out dependencies between loop iterations Types of dependencies some depend on previous iteration A[i+1] = A[i] + ... // S1 have to execute in order (S1 depends on S1) sometimes dependence does not prevent parallelism A[i] = A[i] + B[i] // S1 B[i+1] = C[i] + ... // S2 although S2 depends on S1, there is no cycle like S1 depends on S1 dependence analysis sometimes inexact dependence "may" exist recurrences (depending on earlier iterations) Y[i] = Y[i-1] + ... // S1 depends on S1 dependences force ordering of successive iterations (because "dependence distance" is 1 however, if distance is greater (say 5) Y[i] = Y[i-5] + ... // S1 depends on S1 then we can still unroll and find significant parallelism Finding Dependencies affine array indexes expressible as a*j+b (or c*k+d) gcd test for two elementsa if a loop-carried dependence exists, then GCD(c,a) must divide (d-b) sparse array indicies (x[y[i]]) are not affine classifying dependence true dependencies false dependencies antidependences output dependence solve with renaming array oriented dependence analysis useless for: - objects referenced via pointers - array index is inderect through another array (sparse array case) - input data dependent dependence (why not generate code for both cases...) - when an optimization needs to know more than possiblity of dependence, but needs to know exactly when "points-to" analysis relies on: 1.) type information 2.) information derived from when a object is allocated or when the address of an object is taken (if p always points to object allocated at line x and q was taken before that line) 3.) information derived from pointer assignments successes - determine if two parameters point to the same object - determine what type of data currently pointed to by pointer (for unions, classes, etc) - separate local vs global pointers limitations 1.) pointer arithmetic 2.) "interprodecural analysis" difficulty modern programming languages (Java) better at fixing both limitations Eliminating Dependent Computations eliminate or reduced dependences through back substitution copy propagation turn daddui r1,r2.#4 daddui r1,r1.#8 to daddui r1,r2.#8 tree height reduction turn add r1,r2,r3 // S1 add r4,r1,r6 // S2 depends on S1 add r8,r4,r7 // S3 depends on S2 to add r1,r2,r3 // S1 add r4,r6,r7 // S2 add r8,r1,r4 // S3 depends on S1 recurrence unrolling: sum = sum + x; creates: sum = sum + x1 + x2 + x3 + x4 + x5; change to: sum = ((sum + x1)) + (x2 + x3)) + (x4 + x5); basically tree height reduction similar solution to array index calculations - Software Pipelining: Symbolic Loop Unrolling software pipelining interleave instructions from different loop iterations separates dependent instructions from within one loop iteration software counterpart to Tomasulo's algorithm example (minus start-up/wind-down code) loop: s.d f4,16(r1) ; stores into M[i] add.d f4,f0,f2 ; adds to M[i-1] l.d f0,0(r1) ; loads M[i-2] daddui r1,r1.#-8 ; bne r1,r2,loop ; usually tricky to get register usage correct much better code density than straight loop unrolling loop unrolling can still reduce loop overhead instructions (can use both together...) - Global Code Scheduling moving code between basic blocks for better scheduling example move assignment after "if" in inner loop before if can we be sure that body of "if" does not affect assignment? potential use conditional or predicated instructions to make changes safe, especially to preserve exception behavior Trace Scheduling: Focusing on the Critical Path use profiling to find common case in inner loop conditionals "trace selection" determins the likely path, aka "trace" unroll loop "trace compaction" schedule trace for common case if common case is found to be false at runtime jump to more expensive code, may have to compensate for operations moved early in common case trace code rturn downsides: exit and return to trace can lead to very complicated exception code Superblocks similar to trace scheduling however on exception to common case, just bail out and don't return until after "superblock" "superblock" contains common case fast path, exception only slow path "tail duplication" creates duplicate code in both common case and exception paths prevents having to return after exception as was needed in trace scheduling 4.5 Hardware Support for Exposing More Parallelism at Compile Time 340 - Conditional or Predicated Instructions processor can evaluate condition/predicate in parallel with result only commit result if condition turns out to be true example (MIPS conditional move) before bnez r1,L add u r2,r3,r0 L: after CMOVZ r2,r3,r1 aka "if conversion" in vector processors helps eliminates simple branches, simplifying branch issues in pipeline not so useful for conditional execution of longer blocks of code full predication all operations can controlled by predicate convert if/then/else into single block with partial exec via predicate removes all non-loop based branches need to preserve proper exception behavior defer exceptions until condition is resolved, similar to out-of-order execution "annulling" - when to cancel instructions with false predicates early annulling need to know condition early to avoid hazard => pipeline stalls not used in practice for this reason late annulling consumes function unit resources => potential negative for performance complicates forwarding implementation conditional/predication summary pros useful for short alternative control flows eliminate some unpredictable branches reduce overhead of global code scheduling cons annulled instructions waste resources fetch cost at minimum can tie up functional units from other uses (although might be idle units anyway) most useful when predicate can be evaled early otherwise we stall for data hazard might have been better of with branch predictor in this case use limited with complex alternatives need to calculate compound predicates, taking extra cycles speed penalty for conditional over unconditional potentially more cycles or slower clock rate in practice MIPS, Alpha, PowerPC, SPARC, Pentium+ conditional move only Itanium full predication for all instructions - Compiler Speculation with Hardware Support move speculative instructions early not just before branch (like predication) but before condition evaluation needs: 1.) find instructions to move early without affecting data flow (COMPILER) 2.) ignore exceptions until non-speculative (HW) 3.) speculatively interchange loads and stores with potential dependencies (HW) Hardware Support for Preserving Exception Behavior 1.) HW and OS cooperatively ignore exceptions okay for "correct" programs, but SIGSEGV in buggy program might be lost, receiving undefined value consumes registers to hold speculative results no speculative stores 2.) special speculative instructions that require explicit exception check two of the same issues as above: consumes registers to hold speculative results no speculative stores extra code 3.) poison bits on result registers that cause exception when bad speculative value read non-precise exceptions, since exception appears to happen on later read no speculative stores special OS support for context switching "poison bits" 4.) hardware buffering for speculative results reorder buffer approach need to indicate in speculative instruction: - how many branches the speculative instruction was moved accross - what the prediction was for the branch compiler still responsible for register renaming Hardware Support for Memory Reference Speculation - "special" instruction at site of load - move speculative version of load early - similar to ll/sc, if we see a store to speculative load address, need to cope - if there was a change, we can just reload - other dependant code was run, "special" instruction needs to provide fixup code address 4.6 Crosscutting Issues: Hardware versus Software Speculation Mechanisms 350 - extensive speculation requires memory disambiguation - HW-based works better with unpredictable control flow - HW-based preserves precise exception model (some SW-based version support this as well) - HW-based does not require compensation or bookkeeping code like many SW-based - Compiler-based approaches benefit from large analysis window than HW-based - HW-based does not require different code for different implementations of the same ISA (IA-64 addresses this somewhat by making issue packet 3 wide even if architecture is narrower/wider) - SUMMARY Even with HW advantages, complexity and additional HW add to cost... - odd interactions condition move interacts with register renaming annulled condition move still has to write something to destination register! 4.7 Putting It All Together: The Intel IA-64 Architecture and Itanium Processor 351 - The Intel IA-64 Instruction Set Architecture The IA-64 Register Model - 128 64-bit GP registers (actually 65 bits...) register stack style 0-31 always avaialble 32-128 registers stack 0-96 for local function use on call, shift so arguments visible in r32... alloc instruction to indicate output register usage "register stack engine" HW for handing stack overflow special load/stores for saving registers stack - 128 82-bit FP registers (2 more than 80-bit IEEE) - 64 1-bit predicate regiters - 8 64-bit branch registers (for indirect branches) - misc system registers - register rotation can rotate GP And FP registers 32-128 for software pipelined loops combined with predication, can avoid pipeline prologue/epilogue code Instruction Format and Support for Explicit Parallelism - EPIC = VLIW + ability to specify dependencies implicit parallelism instruction groups can span multiple "bundles" seperated by explict "stops" ease of decode "bundles" of three instructions 128-bit wide 5-bit template and 3x41-bit instructions template implies instruction format of 3 instructions location of "stops" between instruction groups template cannot express all combinations of formats/stops only 24 of 32 possible currently defined Instruction Set Basics A=alu, I=immediate, M=memory, B=branch, F=floating, L+X=-move-immediate-long 41-bit instruction high 4-bits => major opcode low 6-bits => predicate register Predication and Speculation Support nearly all instructions predicated conditional move == move with predicate compare/test instructions set pair of predicates with opposite values control speculation (delay exceptions) 65th bit of GPR serves as poison bit special non-IEEE exponent used for poisoned FP "chk" instruction can test for poison memory reference speculation (dealing with aliasing) "ld.a" performces "advanced load" "ld.c" checks that "ld.a" value is valid "chk.a" checks that register derived from "ld.a" value is valid - The Itanium Processor 2001 800MHz, 6 issues/clock(3 branch,2 memory) 3-level cache, L3 is 4MB off chip (but within package) Functional Units and Instruction Issue 9 units (2 immediate, 2 memory, 3 branch, 2 floating) issue 2 x 3 instructionsbundles at a time max 6/clock worst when split by signle stop 4/clock (1+3) split on blocked functional units, consuming 1 bundle on next cycle Itanium issues MMF causes stop before and after 10 stage pipeline Front-end 1.) IPG 2.) Fetch 3.) Rotate Instruction delivery 4.) EXP 5.) REN (rename) Operand delivery 6.) WLD 7.) REG register Execution 8.) EXE execute 9.) DET detect exceptions 10.) WRB writeback NOTE: even though IA-64 is suppose to depend on smart compiler Itanium has many features of super-scalar out-of-order machine Itanium Performance SPECint 60% of Pentium 4 68% of Alpha 21264 SPECfp 108% of Pentium 4 solely based on 4x "art" performance, w/o art it is slower 120% of Alpha 21264 Alpha wins with non-baseline compiler flags with tuned compiler, beats Itanium Performance/Watt similar to Alpha 21264 58% of Pentium 4... 4.8 Another View: ILP in the Embedded and Mobile Markets 363 - The Trimedia TM32 Architecture closest to "classic" VLIW 5-wide VLIW each instruction has predicate register to control execution staticly scheduled need noops for unused slots need noops for data dependencies if 2 of 5 are branches, need to be sure they have opposite predicates loads can be speculated no virtual memory comparison to MIPS MIPS is general purpose with some extensions for embedded MIPS has much smaller code size, even with TM32 instruction compression - The Transmeta Crusoe Processor Running x86 on a VLIW using software translation Instruction Set Basics instruction formats 64-bit instructions with 2 operations 128-bit instructions with 4 operations (M,C,A,Immediate or M,C,A,Branch) instruction types ACMBI = ALU, Compute, Memory, Branch, Immediate registers 64 integer, 32 floating point Software Translation and Hardware Support options interpret x86 translate "hot" code to native VLIW cache translation at least a basic block at a time coping with exceptions need to preserve precise exceptions of x86 but need to allow reordering for performance reordering support 1.) shadowed register file 2.) program-controlled store buffer 3.) memory alias detecthion hardware with speculative loads 4.) conditional move instruction (select) 1&2 allow OOOe without commiting state until entire x86 sequence completes 2 is like a small TCC cache! software controls commit/rollback Performance Measures goal of power with low performance 4.22 shows that for mp3/dvd playback TM is .32/.42 of PIII power however, 4.23 shows what a PIII is only 8% of the power during DVD playback I/O is over 74% of power. display and DVD are over 50%. 4.9 Fallacies and Pitfalls 370 - Fallacy There is a simple approach to multiple-issue processors that yields high performance with a significant investment in silicon area or design complexity. 4.10 Concluding Remarks 372 - Symmetry of ideas in HW and SW "Software" techniques in hardware-centric approaches - support for conditional instructions - prefecth instructions and other cache hints - branch prediction hints - specoal support for speculative (nonexcepting) loads "Hardware" techniques in software-centric approaches - scoreboard scheduling of instructions - dynamic branch prediction - rollback or trap-and-fix-up support for speculation - hardware for checking speculated load correctness - IA-64 not a breakthrough - in scaling ILP - in avoiding complexity - in power consumption - Alternative move from ILP to multiprocessors SMT (hyperthreading) is half way point - Chip Multiprocessors Power4 2x Power3 processors and L2 cache memory interface and multiprocessor interconnect allows glueless 8 processor module using 4x Power4 TI TMS320C80 (before Power4!) 4 DSPS and a RISC processor as controller many multi-core embedded MIPS systems 4.11 Historical Perspective and References 373 - The Development of Multiple-Issue Processors 1981, Charlesworth, Floating Point Systems AP-120B Stanford MIPS had 2 ops per instruction feature (not in commercial MIPS) Yale, Fisher VLIW with 512-bits wide instruction trace scheduling *Ellis* [1986] Multiflow processor, Colwell 1987 Yale + controllable store buffer 100 million sold Cydrome Cydra 5 [1989] more HW than Multiflow Rau's "polycyclic scheduling" => software-pipelining Apollo DN 10000 Intel i860 pipeline organization traditional "self-draining" pipelines alternative is "push-and-catch" (AP-120B and i860) operations advance only when another operation pushes from behin pushing instructions specifies destination register of pushed instruction eliminates need to detect WAW and WAR hazards in HW increases code size because of NOOPS needed to push results - Compiler Technology and Hardware Support for Scheduling D. Kuck et al U o fIllinois 1970s terms "antidependence" "output dependence" GCD and Banerjee dependence tests Recent dependence research variety of exact tests ending with linear programming algo called Fourier-Motzkin D. Maydan and *W. Pugh* shows that the sequences of exact tests pratical solutio uncovering and scheduling ILP *Lam* 1988 software pipelining for Warp VLIW speculative code scheduling Smith, Horowitz, and Lam 1992 checking and recover mechanism similar to IA-64 and Crusoe U Illinois 90s IMPACT project Hwu superblock scheduling profiling based optimizations special buffer for conflict detection delayed branches early RISC processor idea from microprogramming - EPIC and the IA-64 Development HP proposed 64-bit VLIW to replace PA-RISC Intel wanted 64-bit replacement for x86 Together they created IA-64 and the Itanium implementation Exercises 378 Chapter 5 Memory Hierarchy Design 5.1 Introduction 390 - Want unlimited ammounts of fast memory memory heirarcy registers cache memory disk principle of locality CPU performance outstripping memory performance markets desktops concerned more with memory latency server concerned more with memory bandwidth (aka throughput) embedded 1.) worst case performance 2.) power/battery life 3.) often one application, little or no OS 4.) very little memory, often no disk harware supported memory protection 5.2 Review of the ABCs of Caches 392 cache cache hit cache miss block locality temportal spatial virtual memory page page fault - Cache Performance Review memory stall cycles miss penalty miss rate address trace less style ld/st trace used to study miss rate etc missed per instruction - Four Memory Hierarchy Questions Q1: Where can a block be placed in the upper level? BLOCK PLACEMENT direct mapped full associative set n-way set associative Q2: How is a block found if it is in the upper level? BLOCK IDENTIFICATION valid bit block address block offset Q3: Which block should be replaced on a miss BLOCK REPLACEMENT tag field index field random replacement least-recently used Q4: What happens on a write WRITE STRATEGY write buffer write through write back dirty bit write stall write allocate no-write allocate - An Example: The Alpha 21264 Data Cache data cache instruction cache unified cache 5.3 Cache Performance 406 average memory access time (AMAT) AMAT = hit time + miss rate * miss penalty hit time - Average Memory Access Time and Processors Performance 1.) other reasons for stalls such as bus contention with other devices 2.) in-order processor is stuck, OOOe can allow other instructions to proceed 3.) - Miss Penalty and Out-of-Order Execution Processors "miss penalty" defined to not be total stall time, but "exposed" stall time need to decide how to measure length of memory latency length of latency overlap measure stall when processors does not retire maximum number of instructions per cycle oldest instructions are faulted for stall latency can be measured from one of: - queued instruction window - effective address generated - request sent to memory system - Impoving Cache Performance 17 optimizations in 4 categories (sections 5.4, 5.5, 5.6, 5.7) 5.4 Reducing Cache Miss Penalty 413 - First Miss Penalty Reduction Technique: Multilevel Caches make L1 smaller so thats its faster, add larger L2 to lower overall penalty local miss rate vs global miss rate associativity usually increased at L2 to lowers its local miss penalty multilevel inclusion all data at level N is found in level N+1 inclusive L2 needs to be bigger, otherwise just mirror of L1 inclusion nice because processsor caches levels are consistent for I/O and SMP consistency challenges to inclusion complexity with different block sizes multilevel exclusion NO data at level N is found in level N+1 AMD Athlon trade offs L1 has more hits, needs faster hit time L2 has fewer hits, make it larger to have fewer misses - Second Miss Penalty Reduction Technique: Critical Word First and Early Restart critical word first (aka wrapped fetch or requested word first) return important word to CPU while filling in rest of cache line early restart fill words in order, but let CPU know when its requested word is done - Third Miss Penalty Reduction Technique: Giving Priority to Read Misses over Writes write-through case, have a read miss with outstanding writes option 1: wait to read until write buffer is empty option 2: check the write buffer on read miss if not address not in write buffer, give read miss priority over draining write buffer write-back cache if read needs to evict dirty block move it to write buffer perform read then peform write from buffer - Fourth Miss Penalty Reduction Technique: Merging Write Buffer merge related blocks into single blocks in write buffer NOTE: with memory mapped I/O regions want to avoid this! - Fifth Penalty Reduction Technique: Victim Caches VC is small fully-associative cache handles conflict misses of directly mapped L1 cache AMD Athlon 8 entries *Jouppi* [1990] found 1-5 are effective for small direct mapped L1s 4 entry VC can remove 1/4 misses of a 4KB direct mapped data cache - Summary of Miss Penalty Reduction Techniques most desktop and servers use techinques 1,2,3,4 5.5 Reducing Miss Rate 423 - 3 C's (4th C is Coherence, see 5.15) Compulsory - "cold start" or "first-references misses" Capacity - program cannot run in cache alone Conflict - "collision misses" or "inteference misses" from associativity (including direct mapped) reduce compulsory with larger block size reduce capacity with a larger cache reduce conflict with larger associativity - First Miss Rate Reduction Technique: Larger Block Size reduces compulsory by taking advantage of spatial locality but too large can increase conflict misses - Second Miss Rate Reduction Technique: Larger Caches but longer hit time and higher cost - Third Miss Rate Reduction Technique: Higher Associativity rule 1: 8-way closely approximates fully-associative rule 2: "2:1 cache rule of thumb" a direct mapped cache of size N has about the same miss rate as a two-way set-associative cace of size N/2 (for caches less than 128kb, which is typically for direct mapped caches) - Fourth Miss Rate Reduction Technique: Way Prediction and Psuedoassociative Caches way prediction cache bits to predict which will be next block accessed Alpha 21264 uses on 2-way I-cache 1 clock cycle hit if predictor correct, 3 cycle if incorrect 85% accuracy for SPEC works by setting multiplexor is set early and only one tag comparison is necessary psuedoassociative (aka column associative) direct mapped cache with two entries (different from 2-way because search is not in parallel but serial) always try "fast" entry first, then check second entry like way prediction, want to guess which one should be in the fast slot - Fifth Miss Rate Reduction Technique: Compiler Optimizations reduce conflict misses in I-cache through layout of related functions reduced misses 50% for 2KB direct mapped i-cache and 75% for 8KB cache align basic block entry points on cache block boundary for spatial locallity cache line aware array access Loop interchange don't stride between lines when you can traverse along a line Blocking work on submatrixes (aka blocks) that are sized to cache lines combines spatial and temporal locaity blocking also can help if register allocator can bring whole block into registers - Summary of Miss Rate Reduction Techniques processor-memory gap means cache misses are common problem with underperforming systems 5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism 435 - First Miss Penalty/Rate Reduction Technique: Nonblocking Caches to Reduce Stalls on Cache Misses - Second Miss Penalty/Rate Reduction Technique: Hardware Prefetching of Instructions and Data - Third Miss Penalty/Rate Reduction Technique: Compiler-Controlled Prefetching - Summary of Miss Penalty/Rate Reduction Techniques 5.7 Reducing Hit Time 443 - First Hit Time Reduction Technique: Small and Simple Caches SMALL enough to fit L1 on chip SIMPLE direct mapped allows overlap of tag check with transmission of data L2 hack keep tags on chip and data off chip L1 size has not been growing even with larger chips need to keep hit time within single clock cycle (maybe no longer true backoff now that MHz marketing is over) - Second Hit Time Reduction Technique: Avoiding Address Translation during Indexing of the Cache virtual cache vs physical cache indexing vs tag bits virtual indexed, virutally tagged caches pros eliminates translation time from cache hit cons need to check protection (could copy info from TLB to cache on miss) need to restrict access by process (could add PID (aka ASID (aka ASN)) to blocks) synonyms or aliases (two virtually addressed cache entries could be for te same physical address) HW: antialiasing can be used to solve this in restricted cases check all possible locations indexes on miss, flushing matches SW: page coloring force aliased pages to map to same cache index (force last bits of virtual/physical to match) I/O uses physical addresses virtual indexed, physically tagged caches pros fast lookup, index does not have to be translated (must be equal for both physical and virtual addresses) tag match happens after access on physical address cons directed mapped cache can be no bigger than page size associativity helps by allowing larger cache sizes - Third Hit Time Reduction Technique: Pipelined Cache Access Pentium Pro and Pentium III I-cache pros increased instruction bandwidth cons higher branch mispredict penalty - Fouth Hit Time Reduction Technique: Trace Caches trace cache instead of static cache blocks, use previous dynamic traces including taken branches acts as form of branch prediction as well Pentium 4 (NetBurst) approach pros potentially better space utilization than partially used wide cache lines cons complexity, especially of alignment instructions are duplicated when conflicting taken branches in recent history - Cache Optimizations Summary 5.8 Main Memory and Organizations for Improving Performance 448 - Introduction cache design is about lowering low latency memory design is about improving bandwidth (latency is more fixed) caches that access memory have long cache lines to take advantage of this bandwidth - First Technique for Higher Bandwidth: Wider Main Memory install pairs or quads of memory modules make memory bus as wider, along with block size pros wider means more bandwidth cons need multiplexor to narrow from wide cache lines to narrower levels closer to processor (okay between L2 and L1, but not really between L1 and CPU for performance reasons) consumer needs to add memory modules in pairs or quads... error correction over wider memory unit means loading whole unit to update ECC on word sized write (can arrange to ECC on words to avoid this problem with extra cost) - Second Technique for Higher Bandwidth: Simple Interleaved Memory organize memory into banks, striping physically addresses accross banks keep memory bus same width, no multiplexing typically strip at word granularity but interleaving factor could be higher pros can pipeline reads accross banks write buffer can write to banks in parallel cons have to time-share narrower bus as sinhle memory chip capacity increases, costly to distribute memory accross multiple banks complexity of interleaving hetergenous size and speeds of different memory modules how many banks? rule from vector computers number of banks >= number of clock cycles to access word in bank - Third Technique for Higher Bandwidth: Independent Memory Banks independant memory banks with multiple memory controllers 5.9 Memory Technology 454 memory latency terms "access time" - time between read requested and desired word arrives "cycle time" - time between requests to memory cycle time > access time because need the address lines stable between accesses - DRAM Technology to reduce number of pins, address lines multiplexed address is divided in half RAS and CAS (row access strobe and column access strobe) row is like cache index column is like selecting cache set D is for Dynamic single transistor per bit even reading can corrupt value need to write back after read can read and refresh a row at a time DRAM hardward periodically refreshes all of memory example every 8 milliseconds refresh can interfere with outside memory access introduced variability into memory access time try to keep to less than 5% of total time speed row access time growing very slowly 5%/yr (related to latency) CAS (or data transfer time) growing slightly more 10%/yr (related to bandwidth) DIMMs (dual inline memory modules) typically 4-16 DRAMS normally 8 bytes wide for desktops - SRAM Technology S is for Static 6 transistors per bit to prevent need for refresh need low power to retain values unlike DRAMs constant refreshing DRAM vs SRAM DRAM - cost/bit and capacity SRAM - speed and capacity SRAM no multiplexed addressed lines due to speed requirements no differerence between cycle and access time DRAM capacity 4-8 times SRAM for same transistor budget SRAM cycle time 8-16 times DRAM SRAM cost 8-16 times DRAM - Embedded Processor Memory Technology: ROM and Flash ROM single transistor per bit nonvolatile don't need power indestructible don't need memory protection (for code/data in embedded systems) Flash read speed approximately that of DRAM write speed 10x-100x slower than DRAM DRAM 4x-8x more megabytes/dollar (2001) - Improving Memory Performance in a Standard DRAM Chip 1.) fast page mode repeated access to different colums in the same row saves repeated RAS time 2.) SDRAM (Synchronous DRAM) replace asynchronous interface that requires synchronization use clocked interface with registers to store request parameters such as byte count 3.) DDR (Double Data Rate) transfer on both rising and falling edge of DRAM clock all three are small modifications that exploit high internal DRAM bandwidth - Improving Memory Performance via a New DRAM Interface: RAMBUS RDRAM turn memory component into memory system replace RAS/CAS interface with a bus allows multiple outstanding requests (aka packet-switched bus or split-transaction bus) 4 inch maximum bus length 4 banks DRDRAM (Direct RDRAM) seperate row/column/data busses 4-16 banks 4-8 row buffers - Comparing RAMBUS and DDR SDRAM caches need low-latency to the first byte and bandwidth for the rest of the block RAMBUS focussed on only second part of the problem (Amdahl's law issue...) price premimum for RAMBUS in 2001 is 2x limited to niche markets willing to pay for absolute performance 5.10 Virtual Memory 460 virtual memory protection speed: programs start faster due to demand paging easier than manual overlays relocation don't need contiguous physical memory (alternative is Multics/Honeywell/i386 style relocation/segment registers) (or rewriting addresses in code at load time instead of link time) sharing within a process and between processes read-only to share resources such as program executables read-write for cross process data sharing in cache terms block => page or segment miss => page fault or address fault memory mapping or address translation map virtual address to physical address cache vs VM cache replacement in HW, VM in SW longer miss penalty means we can take time to make a good decision size cache size is variable while processor address controls size of VM files cache is mem only, disk is multiplexed for files and VM (except in Multics :)) pages VS segments (Figure 5.34) fixed length VS variable length single word VS segment and offset invisble to programmer VS usually programmer (and compiler) visible trivial replacement VS hard to find contiguous space internal fragmentation VS external fragmentation easy map to disk blocks VS need to manage swap file allocation as well paged segments... multiple page sizes... - Four Memory Hierarchy Questions Revisited Q1: Where can a block be placed in the upper level? BLOCK PLACEMENT full associative Q2: How is a block found if it is in the upper level? BLOCK IDENTIFICATION page table inverted page table translation lookaside buffer (TLB) (aka translation buffer) Q3: Which block should be replaced on a miss BLOCK REPLACEMENT least-recently used use bit (aka reference bit) Q4: What happens on a write WRITE STRATEGY write back dirty bit - Techniques for Fast Address Translation Page tables usally large often paged themselves in main memory translating VA to PA could take two memory refs one for translation, one for actual data TLB (Translation Lookaside Buffer) locality if addresses have locality, so should translation TLB is cache of page table entry including translations protection valid bit use bit dirty bit OS may need to invalidate TLB entries at it changes page tables entries software managed coherence Entries may contain PID/ASID/ASN to identify a specific address space prevents need for invalidation on context switch Alpha example 128 entry full associative TLB TLB is often critical path pressure to keep it small to make it fast Alpha uses virtual tagged i-cache to avoid TLB cost (less chance of aliasing since RO data) - Selecting a Page Size larger page sizes pros smaller page tables (fewer entries) allows larger virtually indexed, physically tagged caches (section 5.7) larger unit for I/O TLB maps more of memory (superpages argument) serious impact on CPI cons more internal fragmentation slower startup from demand paging more than needed - Summary of Virtual Machines and Caches Figure 5.37 shows overview of TLB+L1+L2 5.11 Protection and Examples of Virtual Memory 469 process => address space process switch or (context switch) swapping approach move entire process out of memory to prevent other processes from view too slow VM provides easier approach by switching page tables per process - Protecting Processes Simple base and bound approach two registers that check each address for valid access range absolute (flat) address base <= address <= bound relative (segment) addresses (base+address) <= bound how do we change these registers without letting the user change them? 1.) two modes: user vs kernel (aka supervisor or executive) 2.) read only CPU state in user mode such as base/bound registers user/supervisor mode bit(s) exception enable/disable bit 3.) system call and return virtual memory provides fine granularity than base/bound permission flags per page or segment read only, no execute protection rings of questionable value for originally designed purpose hypervisor... capabilities use "key" to access "lock" need to allow programs to pass keys to each other without forgery HW support if performance desired... - A Paged Virtual Memory Example: The Alpha Memory Management and the 21264 TLB segments based on upper address bits (similar to VAX layout) kseg for kernel uniform protection, no memory management seg0: from bottom (0) and upward seg1: from top of memory downward 48-bit virtual address 2-bits for segment (VAX style) 10-bits for level1 page table 10-bits for level2 page table 10-bits for level3 page table 13-bits for page offset 45-bits (so far...) three level page table 8-bytes per PTE 2^10=1024 entries per level each level needs 8k exactly equal to page size! not a coincidence using larger 64k pages we have 3 more offset bits bringing us to the maximum total of 48-bits. - A Segment Virtual Memory Example: Protection in the Intel Pentium 8086 segments (base) but no bound no virtual memory i386 (aka IA-32) 4 rings (0 kernel) flexible segments and paging operating system and user code in same address space system call accesses argugements from caller with callers protection level avoids passing reference to something you cannot read but kernel can Adding Bounds Checking and Memory Mapping segments changed from base to index in descriptor table similar to alpha page table segment descriptor includes present bit (aka valid) base field (base physical address of segment) access bit (reference or use bit) attributes (valid operations and protection levels) limit field (to go with base) more details D for 16 vs 32 bit default address G for granularity (bytes versus pages) DPL (Descriptor privilege level) => required ring level to access code segments Conforming = use codes privlidges, not callers (for RPC like libraries) data segments expand down = used for stacks, implies limit is negative offset call gates word count = amount of stack to copy on call destination selector/destination offset = entry point for system call (base+offset) paging with segmentation 32-bit address upper bits - segment descriptor middle bits - index into page table selected by descriptor lower bits - offset Adding Sharing and Protection split address space global address space vs local address space (1/2 each originally) each half has its own descriptor table user segment registers descripor index - plus 1 bit for global vs local can access any segment with protection level greater than or equal to current level - plus 2 bit "requested protection level" (see below) Adding Safe Calls from User to OS Gates and Inheriting Protection Level Parameters call gates restrict location of entry points (don't allow to jump into middle of code) word count indicates how much data to copy from user stack to kernel stack popped from both on return, result copied to user stack does not allow arguments to be passed in registers which is ironic because DOS and BIOS interrupt calls did do this :) bad pointers kernel can set segment registers "requested protection level" bits allows to read data via pointer using callers protection - Summary: Protection on the Alpha versus the IA-32 IA-32 very complex to implement even worse, features are largely unused by Unix and Windows even with security problems, no one seems to be using them (except Nooks proposal...) 5.12 Crosscutting Issues: The Design of Memory Hierarchies 478 - Superscalar CPU and Number of Ports to the Cache need nonblocking cache so can speculate beyond misses x86 allows instructions to be unaligned UltraSPARC III 4-instructions per clock means 128-bits from I-cache per cycle 2 of the 4 instructions can be ld/st, need 2 separate 64-bit values from D-cache - Speculative Execution and the Memory System need to supress exceptions of speculative instructions don't want to tie up cache on speculative miss again nonblock cache needed - Combining the Instruction Cache with Instruction Fetch and Decode Mechanisms NetBurst trace cache stores micro-ops (uops) uops are 64-bits long, actually take more memory in trace cache relative to x86 with i-cache embedded also have bigger instructions decompress code in memory into i-cache - Embedded Compuer Caches and Real-Time Performance I-caches typically okay because code behavior relatively predictable D-Caches allow portions to be locked down for set-associative, lock a subset of the n-ways Xbox 360 does this, allowing DMA from cache to GPU etc - Embedded Compuer Caches and Power more power efficient to access on chip memory, avoids: driving pins driving memory bus driving memory and return trip way prediction alloys only half of 2-way cache to be power up a time - I/O Consistency of Cached Data cache coherence I/O could read from cache but: this interferes with CPU access this could destory cache data locality on page fault, load whole page but only need a few words hard to access on chip caches externally I/O could read from memory but: need write through caches Input 1.) mark input buffer pages as non-cacheable 2.) OS flushes address range from cache 3.) HW can check duplicate tag bits of cache contents (CPU has own set of bits for fast access) multiprocessors with I/O rare to want to have copies of data with SMP, common to want copies of data 5.13 Putting It All Together: Alpha 21264 Memory Hierarchy 482 - Overview detail of bootup through TLBs, PTEs, L1, L2, faults, etc Bootup chip loads cache serially from external PROM 64KB for 16k instructions chip loads system parameter from PROM off chip cache speed/timing system port speed/timing other interface logic parameters run instructions in PAL (privileged architecture library) mode update instruction TLB and PTEs for OS let OS take over from PAL initialization code - Performance of the 21264 Memory Hierarchy compare OOOe 21264 to in-order 21164 higher miss rate helps 21264 which can run other instructions during miss TPC-C much higher demands on overall memory hierarchy than SPEC CPU apps (as in Barroso) 5.14 Another View: The Emotion Engine of the Sony Playstation 2 490 - PS/2 Emotion Egnine 64-bits MIPS III core with SIMD extensions two vector units VPU0 attached to CPU for DSP VPU1 operates independantly of CPU Memory 32 MB DRDRAM I/O Processor old 32-bit MIPS for PS/1 games Graphics Synthesizer 16 parallel pixel processors 1024-bit path each way to 4MB multiported embedded DRAM Rather than traditional desktop/server unified memory mdoel more like embedded with 9 distict memory modules softare controlled coherency 5.15 Another View: The Sun Fire 6800 Server 494 4th "C": coherence misses ECC SEC/DED (Single Error Correcting/Double Error Detecting) used between memory chips and processor to catch end to end errors NUMA memory access not a strict hierarchy up to 4 dynamic system domains partitioning for fault tolerance can be changed without reboot (like Multics :)) redundant paths both between 24 processors and to I/O 5.16 Fallacies and Pitfalls 498 - Fallacy Predicting cache performance of one program from another - Pitfall Simulating enough instructions to get accurate performance measures of the memory hierarchy. 1.) small trace may not have time to fill cache 2.) cache behavior may change over programs time 3.) cache behavior may be data dependant - Pitfall Too small an address space older IBM/360 outlasted younger PDP-11 24-31 bits vs 16 bits x86 will die, need IA-64 :) NOT! AMD producing x86-64 - Pitfall Emphasizing memory bandwidth in DRAMs versys memory latency RDRAM takes 20% more die than SDRAM takes space away from redundant rows that improve yield RDRAM is good for small memories with high bandwidth needs that is PS/2 and ... servers can get better BW through many parallel chips - Pitfall Delivering high memory bandwidth in a cache-based system Streaming computing caches hurt since they just get in the way - Pitfall Ignoring the impact of the operating system on the performance of the memory hierarchy 25% of misses from OS or OS interference evicting user data - Pitfall Relying on the operating system to change the page size over time Alpha planned go grow architecture address space by increasing page size in the end, had to find other ways since OS wanted to stick with 8k pages TLBs and superpages OSs have not adopted superpages except for very specific functions largely mapping display memory and I/O devices 5.17 Concluding Remarks 504 scaling terabyte main memories vs 8kb pages tradeoffs cost/complexity tradeoffs to getting good balance different tradeoffs for different applications 5.18 Historical Perspective and References 504 Virtual Machine pioneers coined the term "memory hierarchy" 1962 before Kilburn et al proposed automatic management of 2 levels Atlas system 1 year before IBM/360 IBM/370 added VM TSS OS couldn't really handle it though coined the term "TLB" all computers today except super and embedded have VM GE 645 segmented virtual memory for Multics conceptual predecessor to i286/i386 capabilities simplicity of idea lost in complexity of efficient implementation 24->32 bit IBM/360 programs used upper 8 bits for their own purposes Motoroal 68000 (68k) had similar 24->32 bit migration issues today, trap if unused bits are non-zero Caches Wilkes 1965 direct mapped cache paper University of Cambridge built a direct mapped cache a year before wilkes paper IBM 360/85 first commercial machine with cache correctly used AMAT, not miss rate, to measure performance impact coined the term "cache" killed the 360/91 because cache beat OOOe Smith 1982 "temporal locality" and "spatial locality" terms Hill* 1987 3 C's Jouppi* 1998 said Hill's analysis lead to victim cache idea Koft 1981 non blocking cache Koft dual ported non blocking cache Baer and Wang 1988 inclusion Jouppi* and Wilson 1994 exclusion Jouppi 1990 prefetching via streaming buffers 90's ISCA and ASPLOS full of cache papers Exercises 513 Chapter 6 Multiprocessors and Thread-Level Parallelism 6.1 Introduction 528 - Introduction always the next big thing... current reasons 1.) easiest way to improve performance over single processor is to use more processors 2.) limits of ILP 3.) progress on parallel software, especially server and embedded, desktop is the challenge No room for all issues in H&P Culler, Singh, and Gupta: another 1000 pages just on multiprocessors Focus on mainstream design multiprocessors with <= 128 processors - A Taxonomy of Parallel Architectures Flynn 1966 1.) SISD - uniprocessor 2.) SIMD - multimedia extensions and vector architectures 3.) MISD - streaming computers (to some approximation) 4.) MIMD - multiprocessors MIMD rules 1.) flexible. can run - multiple single threaded apps - one multithreaded app - mixture of both 2.) cost-performance advantage by leveraging existing uniprocesor (both design and implementations) multiprogramming running multiple independent processes multithreading using multiple processors for one process thread-level parallelism term applies to both multithreading and multiprogramming centralized shared-memory architecture at most a few dozen processors in 2000 symetric (shared-memory) processors (SMP) uniform memory access (UMA) distributed memory multiprocessors non-uniform memory access (NUMA) 1.) cost effective way to scale memory bandwidth (if mostly to local node) 2.) reduces latency for local accesses two different programming models distributed shared memory (DSM) message passing interface (MPI) I/O also distributed among the nodes DSM nodes maybe SMP internally - Models for Communication and Memory Architecture two models: 1.) shared address space DSM NUMA 2.) multiple private address spaces message passing multiprocessors (aka multicomputers) RPC synchronous or asynchronous MPI (Message Passing Interface) standard loose some absolute performance for portability Performance Metrics for Communication Mechanisms 1.) Communication bandwidth limits of processor, memory, or interconnection bandwidth hopefully not communication mechanism bisection bandwidth of interconnect node bandwidth 2.) Communication latency Communication latency = Sender overhead + Time of flight + Transmission time + Reciever overhead naming and protection DSM - implemented by shared memory model in hardware at low cost OS - could be higher overhead if pushed to SW 3.) Communication latency hiding hide latecy by overlapping communication with computation better to reduce latency than hide if possible hiding is more application specific often requiring programmer effort characeristics of communication size of data regularity in the communication pattern Advantages of Different Communication Mechanisms DSM pros compatability with SMP example OpenMP easier to program with irregular communication simpler compiler support develop for SMP, tune for DSM lower overhead and better use of bandwidth for small items hardware controlled caching to reduce frequency of remote communication also serves to improve latency and contention for shared data Message Passing pros simpler hardware explicit communication makes analyizing communication patterns clear explicit communication makes programmer be aware of cost of communication synchronization is coupled to message passing easier to use sender-initiated communication Can build MP on top of DSM easily Build DSM on top of MP is harder depend on OS to handle shared memory on large page granularity "shared virtual memory" of section 6.10 MPP (Massively Parallel Processors) > 100 CPUs SMP >> DSM (market size) section 6.5 multicomputers (aka message-passing multiprocessros (aka clusters)) chapter 8 - Challenges of Parallel Processing 1.) insufficient parallelism 2.) long-latency remote communication Chapter focus on #2 6.2 Characteristics of Application Domains 540 many issues load balance synchronization sensitivity to memory latency data distribution structure of parallel algorithm spatial and temporal access patterns to data - A Commercial Workload (Barroso paper :)) 1.) OLTP (TPC-B) 2.) DSS (TPC-D) 3.) Web Search (AltaVista) I/O causing more kernel time and more idle time - Multiprogramming and OS Workload 2 copies of Andrew Benchmark 5 seconds on 8 processors no paging needed 3 phases 1.) compile CPU 2.) installing object files in library I/O 3.) removing object files mostly I/O and idle - Scientific/Technical Applications The FFT Kernel 3 data structures 1.) input array 2.) output array 3.) precomputed read only roots of the unity matrix 6 steps 1.) transpose data 2.) perform 1D FFT on each row 3.) muliply roots by data and write to data 4.) transpose data 5.) perform 1D FFT on each row 6.) transpose data partitioning contiguous chunks of rows all processors have local data transpose ensures that data is local for step 5 all-to-all communication no temporal locality but good spatial locality The LU Kernel LU factorization is dense matrix similar to QR factorization, Cholesky factorization, and eigenvalue methods blocking key for sequential version partitioning 1.) blocks assigned using 2D tiling 2.) dense matrix multiplication done by owner of *destination* block data structure representation natual to use 2D array however, then chunks are not contiguous instead use 4D array first two dimensions are grid partitioning last two dimensions are sub-array The Barnes Application Barnes-Hut n-body simulation n^2 interations between n bodies take advantage of inverse square law to treat groups of remote bodies as one body octree data structure 8 nodes represent 8 parts of a cube (cells) each node represents the bodies in its cell density varies so trees unbalanced algorithm for each body we traverse the tree ones if cell is far enough away, approximate it as a single mass in the cells center otherwise open cell and recurse if we reach leaves we calculate force directly irregularity because of varying density but 1.) system evolves slowly 2.) forces fall off quickly few cells involved most were same as last timestep partition each processor gets seperate subtree frequent communication within subtree because of physical locality of octree partitioning not just on number of nodes, but on approximate amount of work patterns still relatively irregular need efficient fine grained communication The Ocean Application red-black Gauss-Seidel multigrid technique red-black means alternating grid point updates like a checkerboard multigrid techniques solve finite difference equations use iterations over hierarchical grids each grid has is approximation of the fewer points of the lower grids finer grids have more accuracy and need more compuation to converge move up/down the grid dependign on rate of change at previous step solution reached when we have convergence at finest level data representation dynamically allocated and sized grids for a particular problem partitioning ocean basin is partitioned into square subgrids each processor has local subgrid communication 5 steps per iterations data exchanged between subgrids at each step communication along boundaries to nearest neighbors Computation/Communication for the Parallel Programs high comp/comm is desirable for scalability often sensitive to data set and number of processors FFT - log n (all-to-all) LU - sqrt(n)/sqrt(p) Barnes - sqrt(n)/sqrt(p) Ocean - sqrt(n)/sqrt(p) sqrt(n)/sqrt(p) is because of partitioning and communication need to increase problem sizes to keep comp/comm as we add procesors LU and Ocean communication is determined by boundaries 6.3 Symmetric Shared-Memory Architectures 549 bus-based multiprocessors helped by addition of cache cache lowers bus bandwidth needs of processor to main memory allows us to attach more CPUs to existing bus - What is Multiprocessor Cache Coherence? coherence what values can be returned by a read behavior of reads and writes to the same memory location consistency when a written value will be returned by a read behavior of reads and writes with respect to other memory locations coherence requirements 1.) PROGRAM ORDER P writes A to X no one else writes to X if P reads from X it should see A 2.) COHERENT VIEW OF MEMORY P write A to X no one else writes to X Q reads X after some "sufficient period of time, it should see A 3.) WRITE SERIALIZATION two writes to the same location are seen in the same order by all processors if values 1 and 2 are written to a location, processors should not see 2 then 1 - Basic Schemes for Enforcing Coherence coherence multiprocessor caches provide both migration and replication of shared items migration reduces latency since data is in local cache reduces bus bandwidth since cached access does not use bus replication similarly reduces latency prevents contention for read shared data, again lowering bus bandwidth two classes of cache coherence protocols directory based directories maintain sharing status of each block of physical memory snooping based caches with copies of blocks also track the block sharing status requies shared bus for caches to snoop to see operations to shared blocks - Snooping Protocols write invalidate take exclusive access before write ensures only one copy in all caches enforces write serialization write update (aka write broadcast) track if block is shared if shares, broadcast on write performance differences 1.) multiple writes to the same word write invalidate requires one invalidate write update require multiple broadcasts 2.) word updates to cache blocks write invalidate requires one invalidate write update requires multiple broadcasts can try merge word writes into single line in write buffer (chapter 5) 3.) write on one processor, read on another write invalidate requires reader to reload shared block (even if it was previously in cache) write update has update values available immediately (if it was in cache) performance summary bus and memory bandwidth usally limit factor in snooping scalablity invalidation generally generates less bus and memory traffic update also does not play well with memory consistency models makes 3.) less valuable than it seems update can be fine for small CPU counts (2-4) chip multiprocessors (CMP) focus on invalidate as common case for rest of chapter - Basic Implementation Techniques write invalidate acquire bus access write invalidation address snooping caches check for address invalidate if present write serialization one processor must obtain bus access before the other write to shared data cannot complete until bus access acquired locating block on cache miss write through look in memory, although write-buffers complicate this write back could be in other caches caches snoop loads to memory if see matching dirty block, cancel memory request, and return results focus on write-back as common case handling writes in write-back add shared bit to cache block only need to invalidate if block is shared contention for cache tags don't want to block CPU from accessing cache - duplicate tags with second port to access - inclusive L2 provides similar effect but need to keep L1/L2 tags in sync need to get data from L1 on snoop hit - An Example Protocol simple protocol in Figure 6.10 extensions ownership or upgrade misses instead of treating write hit as write miss upgrade read to write, already have data "owner" has exclusive copy to write shared vs shared+private shared+private avoids bus traffic on write hit cache to cache transfers avoid going to memory for blocks in other caches CPU state machine similar to plain cache just need to add bus action to invalidate on write miss Bus state machine only needed for SMP coherence protcolo property shared blocks always mirror memory values atomicity problems protocol assumes atomic transactions not compatible with modern split transaction buses non-atomic actions can cause deadlock small scale implementation Pentium III Xeon and Pentium 4 Xeon can connect two without any glue logic two extra chips for up to 4 ways system 6.4 Performance of Symmetric Shared-Memory Multiprocessors 560 sharing true sharing semantic sharing of data between nodes false sharing unrelated items in same cache block - Performance Measurements of the Commercial Workload (Barroso paper) Figure 6.14 increasing cache size helps :) Figure 6.15 especially I-cache Figure 6.16 true sharing a scaling issue Figure 6.17 increasing block size helps, even up 256 bytes mostly helps true sharing misses (good shared data locality) does not help instruction misses (poor instruction locality) - Performance of the Multiprogramming and OS Workload Figure 6.18 higher miss rate for kernel than user also drops slower for kernel vs user as cache size increased kernel issues has to initialize (zero) user pages has true sharing user processes are not paralel only suffer certain misses on process migration Figure 6.19 increasing cache decreases capacity misses for kernel :) Figure 6.20 increasing block size has significant impact on kernel misses... (remember all those zeroes pages?) Figure 6.21 comfirms that the compulsory misses are the reason for 6.20 improvement Figure 6.22 however, increasing block size increases memory traffic byes/reference grows quickly for kernel traffic - Performance of the Scientific/Technical Workload Figure 6.23 miss rate breakdown as processors increases FFT - capacity become coherence LU - flat Barnes - capacity become coherence, with coherence dominating Ocean - capacity increases, then decreases with flat coherence Figure 6.24 miss rate breakdown as cache size increased FFT - slight drops on capacity and coherence LU - flat Barnes - capacity to zero, coherence scaling Ocean - drop in capacity, coherence flat working set effect (plateaus) FFT isn't helped much because primary working set already in cache need to have both matrices fit before noticible improvement Figure 6.25 miss rate breakdown as block size increased FFT - both capacity and coherence improve LU - both capacity and coherence improve Barnes - mostly coherence misses, miss rate actually worsens due to false sharing (but still less than 1%) Ocean - both capacity and coherence improve Figure 6.26 bus traffic for figure 6.25 as clock size increases FFT - getting worse (miss miss rate) LU - low and flat Barnes - low and flat Ocean - getting worse (high miss rate) - Summary of Snooping Cache Schemes commerical very large off chip caches (4+MB) with large block sizes (64-128) multiprogrammed user time similar to best scientific/technical kernel time similar to worst scientific/technical scientific/technical these highly tuned have reasonable component for coherence misses untuned apps may have a lot more (, false sharing, etc) 6.5 Distributed Shared-Memory Architectures 576 - Scalable shared memory one approach: non-coherent memory Cray T3D/E shared data marked uncachable application can copy shared to private to build its own caches and be responsible for its own coherence and consistency problems 1.) poor compiler support for managing such coherence large granularity too conservative 2.) no small block granularity spatial locality no free cache hits for the approximate cost of a word access 3.) prefetching not as effective want to grab small granularity cache blocks want to keep coherent snooping issues even if hardware busses scale to more devices snooping cache coherency protocols don't scale the directory alternative basic approaches scale up to approximately 200 processors additional "scalable directory" work needed beyond that no central directory directory is distributed amongst node memories key property sharing status of block known in a single location avoids broadcast - Directory-Based Cache Coherence Protocols: The Basics two primary operations read miss write to shared clean block (write miss is combinartion of the two) cache block states shared - one or more processors and memory have up-to-date copy uncached - no processors has a copy exclusive - owner has writen copy of block, memory is out-of-date sharers vector bit vector listing sharers in shared state (and owner in exclusive) local caches mirror directory state for contained items comparison to snooping identical states and transitions in cache state machine with slightly different actions interconnect is not a bus 1.) not an source of arbitration 2.) message oriented, many messages need responses nodes local node - where request originates from home node - location of directory entry for a specific cache block remote nodes - other involed parties, such as sharers or owner mapping block to home node statically partitioned physical address space example assume ordered interconnect - An Example Directory Protocol directory entry has global of all sharers unlike snoopy where noone had global view sharers vector up to 64 notes, usually a simple bit vector beyond 64, "scalable directory" approach needed switch from dense to sparse representation if more than "n" sharers, only track clusters of sharing extensions forwarding ship data directly from remote node to new local node, not via home node 6.6 Performance of Distributed Shared-Memory Multiprocessors 584 location of shared data depends on initial allocation sharing patterns local versus remote misses breakdown separately since significantly different miss penalties Figure 6.31 miss rate as processors increased FFT - flat LU - flat Barnes - flat Ocean - rises at 64 cpus because of coherence misses (data set too small...) Figure 6.32 miss rate as cache size increased FFT - remote drops, local constant LU - drops Barnes - drops Ocean - remote drops, local constant Figure 6.33 miss rate as block size increased FFT - drops LU - drops Barnes - increases, but still less than 1% Ocean - drops Figure 6.34 bytes/reference (tracking 6.33) FFT - increasing LU - flat Barnes - increasing Ocean - increasing Figure 6.36 average cycles per reference (similar to AMAT) FFT - decreasing LU - flat Barnes - flat Ocean - 6,16,32,64 CPUs => 3, 2.5, 2.75, *5.5* (matching figure 6.31, higher missrate => higher memory cost) issues in real directory protocls nonatomicity of operations finite buffering increases possibilities of deadlock 6.7 Synchronization 590 high level synchronization built with user-level software supported by hardware primitives small scale and low contention cases uninterpretable instruction OR atomicly read and change a value spin locks easy to build with coherence based instructions - Basic Hardware Primitives need to be able to atomically read and update a value, testing if operation was successful - atomic exchange unlocked = 0, locked = 1 lock() atomic exchange 1 for 0, test success bit - test-and-set very similar to atomic exchange - fetch-and-increment same basic code lock() - load linked / store conditional (aka load locked) two instructions instead of one easier hardware implementation (especially on split transaction busses?) can be used to implement other "primitives" such as fetch-and-increment issues fails if another store store to address keep sequence short to prevent live lock issues fails if any processor exception generally only safe to do register operations between ll/sc other memory references could fault - Implementing Locks Using Coherence spin locks: busy wait to acquire lock with no cache coherence LOCK daddui r2,r0,#1 ; load immediate lockit: exch r2,0(r1) ; atomic exchange bnez r2,lockit ; already locked? UNLOCK st r0,0(r1) ; just write 0 keep variables in memory atomic instructions operate on memory directly (in fact may be implemented there for better scalability even on cache based systems) (SGI Origina paper: moved fetch-and-op to run directly at memory system) benefits with cache coherence 1.) spin on local cached copy 2.) locality in lock access #1 requires that we do spin with read only operations, no writes LOCK lockit: ld r2,0(r1) ; read lock bnez r2,lockit ; spin if bad value daddui r2,r0,#1 ; same exch r2,0(r1) ; same bnez r2,lockit ; same ll/sc version already splits read from write clear to avoid sc unless load was expected value - Synchronization Performance Challenges spin lock contention problems directory or bus acts as point of serialization interconnect traffic is an issues in addition to contention bus "fairness" delays quick release Barrier Synchronization simple barrier has problem with reuse sense-reversing barrier solves this by alternating test on each reuse barrier performance lots of little lock operations therefore lots of traffic contention on barrier lock is not uncommon - Synchronization Mechanisms for Larger-Scale Multiprocessors GOAL: synchronization mechanism with: 1.) low latency in unconteded cases 2.) minimize serializtion under contention Software Implementations spin lock issues contention overhead when many processes contending one solution is exponential backoff queueing locks barrier issues contention during both gather and release stages one solution is combining tree 4-way tree structure seems to be good way to avoid contention on single lock MPPs hardware support for barriers (T3D and CM-5) Hardware Primitives queueing locks spin lock suffers from notifyAll() problem of waking up all people when only one can take lock instead use notify() to just wake up one waiter from "queue" bus-based just implement in software directory-based directory tracks waiters of locks using sharing vector like structure need to identify access to lock need to identify lock release need to cope with context switch (and possible migration) of process with lock barrier performance fetch-and-increment 6.8 Models of Memory Consistency: An Introduction 605 Trying to run if block on at most 1 processor P1: A = 0; P2: B = 0; A = 1; B = 1; L1: if (B == 0) L2: if (A == 0) However, if "= 1;" assignments delayed relative to program order, both if's can run "sequential consistency" maintain program order no suprises SC reduces potential performance option 1.) try to hide latency to preserve performance option 2.) relaxes consistency models - The Programmer's View sequential consistency simple to understand but limits hardware and compiler reordering techniques synchronized programs (aka data-race-free programs) avoid "data races" all access to shared data ordered by synchronization primitives one synchronization primitive after write of shared data one synchronization primitive before read of shared data most programs are already synchronized using sequential consistency directly can be tricky building your own synchronization primitives can be error prone most people use library routines for synchronization primitives library can be architecture specific, dealing with SC or relaxed consistency as needed - Relaxed Consistency Models: The Basics (Adve paper) relaxed consistency allow reordering of reads and write 1.) relax W->R ordering TOTAL STORE ORDERING or PROCESSOR CONSISTENCY retains order among writes most sequentially consistent programs work with TSO/PC 2.) relax W->W ordering PARTIAL STORE ORDER 3.) relax R->W and R->R ordering WEAK ORDERING, ALPHA, POWERPC, RELEASE CONSISTENCY - Final Remarks on Consistency Models 1.) write synchronized programs using standard synchronization with architecture specific implementation 2.) speculation allows SC or PC to be implemented cheaply see pages 618-619 below for more 6.9 Multithreading: Exploiting Thread-Level Parallelism within a Processor 608 - multithreading allow multiple threads to share functional units multiple sets of thread state register file PC separate page table fine-grained multithreading switch threads on each instructon round robin, skipping any stalled threads pro: hides latency delay of stalls con: slows down non-stalled threads coarse-grained multithreading switch threads on "costly" stalls (L2 misses) pro: does not slow down non-stalled threads con: doesn't cope with short pipeline stalls - Simultaneous Multithreading: Converting Thread-Level Parallelism into Instruction-Level Parallelism run multiple threads on different functional units at the same time 1.) superscalar with no multithreading support "AA" "AX" "XX" "XX" "AA" 2.) superscalar with coarse-grained multithreading "AA" "AX" "BB" "BX" "AA" 3.) superscalar with fine-grained multithreading "AA" "BB" "AX" "BX" "BB" "AA" 4.) superscalar with simultaneous multithreading "AB" "AB" "AB" "BB" "AA" hardware per-thread renaming table separate PCs multi-thread commit: logically separate reorder buffer Design Challenges in SMT Processors preserving single thread performance "preferred" thread gets priority not perfect preferred thread is run most of the the time less mix of other threads to choose instructions from when preferred stalled, not many options available fetch issues single-thread performance needs to prefetch as much as possible space being consumed by other threads mispredicted branches needs to fetch immediately could be fetching for another thread generally preferred thread approach is sufficient, its just not perfect 1.) overhead is small and simple choices (preferred thread) are good enough 2.) superscalar efficiency (utilization) is low, enough extra resources notes - highly attractive for throughput oriented server market (Niagara) - can be disabled for single threaded performance (Intel Hyperthreading) Potential Performance Advantages from SMT - 8 contexts apps multiprogrammed SPEC CPU INT Apache barroso (OLTP and DSS) speedups of 1.7 to 4.2 - utilization much higher functional unit utilization much higher fetch utilization slightly lower data cache hit rate - commerical Alpha 21464 with 4 SMT threads (canceled) Pentium 4 Xeon with 2 SMT threads 30% performance improvement for server applications not as good because of segregated instruction issue need to share everything for best performance 6.10 Crosscutting Issues 615 - Memory System Issues Inclusion and Its Implementation "multilevel inclusion" (aka "subset property") large L2 blocks contain multiple L1 blocks on L2 miss, we potentially evict multiple L1 blocks most processors direct L2 to evict L1 blocks otherwise L1 has to snoop to see changes to blocks it contains that are not in L2 Nonblocking Caches and Latency Hiding nonblocking caches import for multiprocessors: 1.) hides long multiprocessor memory latencies 2.) allows weak and relaxed consistency models 3.) is critical for prefetching prefetching and coherence "nonbinding" prefetching bring values to memory, keep coherent "binding" prefetching bring value to register, out of view of coherence model nonbinding prefetch critical for multiprocessor cache coherence nonbinding prefetch implementation 1.) local node needs to keep track of multiple outstanding requests 2.) need to check for multiple requests for same block 3.) cannot assume stall on miss, increasing cache controller complexity (both snoop and directory) Compiler Optimizations and the Consistency Model sequential consistecy prevents reordering of loads and stores to shared data prevents register allocation because it can cause load/store reordering synchronized program model programmer tells compiler synchronization points explicitly (or implicitly in HPF) Using Speculation to Hide Latency in Strict Consistency Models use speculation to provide stroger consistency without big performance impact write invalidate causes speculation to fail common case is no invalidation, no extra cost Hill* 1998 paper 1.) SC/PC with speculation fast enough relative to relaxed models 2.) very little hardware cost 3.) easier to program MIPS R10000 (mid-1990s) followed this approach downside compilers with ability to performance alias analysis might want access to relaxed models - Using Virtual Memory to Build Shared Memory let the OS/VM system implement coherence instead of hardware "distributed virtual memory" or "shared virtual memory" similar to directory, just very, very large blocks each page has home node migration and replication problems false sharing major issue with large blocks fixing leads to sparsely used pages significant software overhead substitute for loosely coupled message passing future? better implementation could lower OS overhead hardware support could lower OS overhead compiler could help with false sharing multiple (smaller) page sizes could help with page usage and false sharing - Performance Measurement of Parallel Processors relative to uniprocessor code and hardware not just performance for single configuration but scalability measure for fixed problem size measure by scaling problem size memory-constrained scaling keep memory/processor constant time-constrained scaling keep total execution time constant often need to change algorithm with problem size example number of iterations for convergence see "linear speedup" pitfall below 6.11 Putting It All Together: Sun's Wildfire Prototype 622 - Intro hybrid SMP/DSM trying to keep memory access as uniform as possible more important for commerial workloads with less predictable access patterns - The Wildfire Architecture 2-4 nodes each node is E6000 with 14 boards with 2 processors 112 total processors (4*14*2) upper 2 bits of physical address partition memory among 2-4 nodes simple directory directory is local cache backed by remote main memory remote requests take 4-5 cycles (need to check local cache first) - Using Page Replication and Migration to Reduce NUMA Effects (similar to SGI Origin 2000 ccNUMA) "coherent memory replication" CMR a version of Simple-COMA "cache-only memory architecture" COMA counters track misses to remote pages replicate or migrate based on sharing patterns still HW cache coherent, page movement just optimization to help need node map of "local" address to "global" address for moved pages - Performance of Wildfire Basic Performance Measures: Latency and Bandwidth Wildfire prototype system up to 112 nodes GOAL: hide NUMA for commercial workloads SGI Origin 2000 system up to 2048 nodes GOAL: scalable bandwidth for scientific/technical workloads latency measurement summary Wildfire's large nodes increase amount of "local" memory, but have typically higher "remote" memory costs bandwidth measurment summary Origin generally has greater bandwidth, 3x/processor for local and bisection - Application Performance of Wildfire Performance on the OLTP Workload numbers validate importance of: CMR replication and migration locality aware scheduler as many processors per node as possible intelligent data placement relative performance proportional to local access unfortunately only evaluate 16 node system where single E6000 is better than wildfire Performance of Wildfire on a Scientific Application 24 cpus 3-way (3*8) wildfire beats single node e10000 (1*24) and e6000 (1*24) 36 cpus 4-way (4*9) wildfire beats 2-way (2*18) wildfire and e10000 (1*36) super-linear speedup for 4-way vs 24 cpus 2-way suffers from bus congestion CMR powerful can start with data all one one node after 2 or 3 minutes data has moved to where it should be costs migration is very low cost replication is harder, needs reverse memory maps - Concluding Remarks on Wildfire commerical market favors fat nodes 1.) more people want to buy a box with a dozen processors than 1000 cpu beast 2.) irregular memory access favors UMA over NUMA cons lower scalability and higher remote access costs (latency and bandwidth) 6.12 Another View: Multithreading in a Commercial Server 635 IBM RS64 III (Pulsar) PowerPC processor RS/6000 and AS/400 (where its known as A50) goals 1.) reuse staticly scheduled Northstar processor 2.) small performance penalty in silicon and clock 3.) single-threaded performance must not suffer implementation 1.) two threads (#2 above) 2.) coarse scheduled multithreading (#1 and #3) details < 10% area impact for dupliate register file and PC new register to limit time spent on one thread significant improvement on server application throughput 6.13 Another View: Embedded Multiprocessors 636 MXP processor for VOIP systems 1.) serial voice stream interfance 2.) pactet routing and channel lookup 3.) ethernet interface including MAC layer 4.) 4x MIPS32 R4k processors with private caches 4xMIPS cores handle - QoS - echo cancellation - simple compression - packet encoding embedded good for multiprocessing 1.) binary compatability not as critical 2.) nartual parallelism in set-top box, network, game system 6.14 Fallacies and Pitfalls 637 - Pitfall Measuring performance of multiprocessors by linear speedup versus execution time. relative speedup vs true speedup machine relative vs best cast uniprocessor superlinear suspicious unless cache effects - Fallacy Amdahl's Law doesn't apply to parallel computers. - Fallacy Linear speedups are needed to make multiprocessors cost-effective. - Fallacy Multiprocessors are "free." glueless processors are nice but still need better memory controller, etc puts more demands on other links in the system - Fallacy Scalability is almost free. Hardware AlphaServer is much cheaper price/performance than a Cray in Sun Enterprise line, interconnect cost grows faster than linear Software much more work to create parallel versions of code load balance, locality, contention, serial sections of code - Pitfall Not developing the software to take advantage of, or optimize for, a multiprocessor architecture. SPARCCenter SunOS was not ready for a NUMA machine no page placement, etc IRIX single lock on page table prevents scaling of even an multiprogrammed workload need to move to fine-grained locking - Pitfall Neglecting data distribution in a distributed shared-memory multiprocessor. Ocean would be 2x slower with round robin versus tiled memory partitioning 6.15 Concluding Remarks 643 - Future is bright - better understanding of software via scientific/technical and commerical workloads - multiprocessor and cluster commonly accepted approach for scaling performance - multiprocessor great for multiprogrammed workloads (OS issues fixes) - multiprocessors effective for certain commerical workloads (although some cluster) - chip multiprocessors: 1.) natural parallelism in embedded 2.) more effective use of more transistors in desktop/server - The Future of MPP Architecture 4 alternatives 1.) scale up large-scale multiprocessors using proprietary interconnects SGI Origin and Cray T3E 1.) not cost effective at small scale 2.) programming model may be more complex 2.) clusters of midrange multiprocessors with proprietary and standard interconnects Wildfire (HP, Sun, SGI) sometimes need to switch to message passing 3.) clusters of uniprocessor with custom interconnect IBM SP-2 custom interconnect is costly 4.) cluster of all off-the-shelf components PC or Workstation nodes ATM or GigE standard OS and tools message passing - The Future of Multiprocessor Architecture ILP dead performance/area, performance/connection, performance/watt all dropping chip multiprocessors? Power4 (2x Power3 cores) - Evolution versus Revolution and the Challenges to Paradigm Shifts in the Computer Industry Level 1 microprogramming pipelinning cache time-sharing MIMD Level 2 virtual memory vector instructions Level 3 there is no level 3 Level 4 RISC Level 5 Massive SIMD Latency MIMD Special Purpose How could multiprocessor adoption accelerate? 1.) no other choice 2.) hw/sw evolution (today #1 and #2 are in play) 6.16 Historical Perspective and References 649 - SIMD Computers: Several Attempts, No Lasting Successes Illiac IV 1972 disaster of cost and performance Connection Machine 1985 65k 1-bit computers Need SISD to do control flow for SIMD also want to mask operation of individual SIMD nodes data parallelism great for loop parallelism custom processors cannot leverage multiprocessor industry often very had single CPU performance array processor special cases good for image and signal processing - Other Early Experiments first Eckert-Mauchly computer had duplicate units for availability CMU 1970s C.mmp 16x PDP-11s crossbar to 16 memory units shared memory focus on OS Cm* cluster with NUMA distributed memory no caches - Great Debates in Parallel Processing TFLOP machine by 1995? well 1999, and it cost the estimated $100million not exactly scalable performance few (<100) or many (>1000) data streams few wins, MPP dead... - More Recent Advances and Developments The Development of Bus-Based Coherent Multiprocessors Bell 1985 multiprocessor vs multicomputer Goodman snooping cache coherence Towards Large-Scale Multiprocessors Message pasing Caltech Cosmic Cube advances in routing and interconnect, lowering cost Intel iPSC 860 based on these ideas Intel Paragon (alsi i860) lower dimensionality, more individual links CM-5 off the shelf processor and fat-tree interconnect direct user access to communication to lower latency Shared memory IBM RP3, NYU Ultracomputer, UIUC Cedar, BBN Butterfly and Monarch NUMA without cache coherence RP3 and Ultracomputer synchronization research such as fetch-and-op interconnect more costly than processing nodes DSM (before snooping) IBM 3081 Distributed Directories Agarwal* 1988 Stanford DASH first operational ccNUMA nodes were "plump" 4-way SMPs KSR-1 first commerican COMA approach (memory as cache) Swedish Instute of Computer Science true COMA, no predfined nodes for memory COMA-F (Flat COMA) home location allocated Reactive NUMA Wood* best of NUMA + S-COMA Convex Exemplar two level: 8 processr modes, 32 module ring SGI Origin (Stanford DASH++) bit vector fine grained: 1 bit = 1 node = 2 procs coarse grained: 1 bit = 8 nodes = 16 procs MIT Aleweif Agarwal* processor support for multithreading cooperative hw/sw handling of coherence Stanford FLASH programmable (MAGIC) processor for: coherence message passing synchronization performance measurement Developments in Synchronization and Consistency Models Lamport 1979 sequential consistency weak ordering, 1986 adve and hill*, 1990 better definition weak ordering data-race-free Charachorloo 1990 release consistency - Other References Kai Li* Ivy: usign VM to implement DSM - Multithreading and Simultaneous Multithreading multithreading dates back to TX-2 MIT lincoln labs 1959 one of first transistor computers context switching to handle I/O CDC 6600 used in I/O processor HEP fine grained multithreading to hide cache and pipeline latencies SPARCLE in MIT Alewief Agarwal* coarse grained 1990s 1.) fine-grained for maximum performance (keep pipelines going) Horowitz* 2.) need ILP+TLP to fill all units SMT came quickly after dynamically scheduled superscalars Eggers* first realistic simulation and coined "SMT" term Exercises 665 Chapter 7 Storage Systems 7.1 Introduction 678 - Introduction neglected by CPU architects institutional prejudice: use of CPU time for measurement even term "peripheral" is a bit pejorative common sense computer useless without I/O response time is what matters to users - Does I/O Performance Matter? just run other compute while waiting for I/O thoughput only ignores interactive users on single user systems, nothing to do while waiting for I/O even in TPC and other throughput benchmarks, response time requirements matter I/O's revenge with great increases in CPU performance, Amdahl's law leaves I/O as the performance limit - Does CPU Performance Matter? old view need to keep big expensive CPU busy (scientific/technical) new view cheap CPUs, just enough to keep I/O busy (commercial) - Does Performance Matter? would you rather have a faster or more stable CPU? people tolerate machine crashes for faster performance would you rather have a faster or more stable disk? people DO NOT tolerate data loss 7.2 Types of Storage Devices 679 - Magnetic Disks cylinder platters tracks sectors constant bit density more sectors on the outer tracks than the inner tracks arm read/write head seek seek time average seek time sum of the time for all possible seeks divided by the number of possible seeks. rotational latency (aka rotational delay) transfer time read ahead disk controller controller time queuing delay disk arrays array controller - The Future of Magnetic Disks areal density diameter smaller diameter means smaller mass saves power as well as physical volume access time DRAM latency 100,000x better DRAM bandwidth only 50x better - Optical Disks optical compact disks (CDs) digital video discs (aka digital versatile discs) (DVDs) - Magnetic Tapes disk vs tape - fixed rotating platters, limited area, sealed in device - linear scan only, "unlimited" area, removable media helical scan tapes tape moves at slower speed head records at angle to tape direction, moves faster 20x-50x improvement in recording density VCRs, camcorders, ... brought costs down issues - tapes wear out, helical faster than longitudinal tapes - helical read/write heads wear out - reind, eject, load times for all tapes spin-up times for helical - R&D costs higher for industry compatibility future - tapes no longer bigger than disks (2001) - price per gigabyte no longer less (2001 40GB disk and tape the same cost) - bandwidth failing behind, hurting recovery times for disks backed up to tape - future: backup over network to remote disks... - Automated Tape Libaries "nearline" tapes tape robots terrabytes within ten seconds STC PowderHorn 6000 tapes for 300TB 10 tape readers 450 tapes per hour 1.1 kilowats 4+ tons can hook 16 together and have them pass tapes! 10x failure rate for all those moving parts - Flash Memory similar to EPROM or EEPROM larger block size than EEPROM to reduce die area for control logic compared to disks low power small sizes DRAM read times compared to DRAMs slow write times erase first, write new data second NOR vs NAND NOR based erases faster, writes slower, 100,000 cycle wear out NAND based erases slower, writes faster 1,000,000 cycle wear out 2001, 6x cost of DRAM, 600x cost of disk 7.3 Buses - Connecting I/O Devices to CPU/Memory 692 bus communication bottleneck replaced by networks and switches storage area networks next chapter CPU-memory buses versus I/O buses (versus mezzanine) CPU-memory short high speed proprietary fixed speed synchronous I/O buses longer slower standard devices vary in latency and bandwidth asynchronous mezzanine (a little of both) PCI short synchronous often bridges to slower I/O buses bus transactions read transaction write transactions send address and read/write data reads often require waiting for data writes often just write and forget - Bus Design Decisions decisions affecting cost/performance FEATURE FASTER CHEAPER separate addres and data lines seperate multiplex wider data lines wider narrower multiple word transfers muliple single bus-master arbitration no arbitration split transaction request/reply no multiple bus-master clocking synchronous asynchronous split transaction aka connect/disconnect aka pipelined bus aka pended bus aka packet-switched bus synchronization skew limits bus length asynchronous tolerates varying speed devices, but adds handshake overhead - Bus Standards PDP-11 Unibus IBM PC-AT Bus (16-bit ISA) PCI - Examples of Buses parallel IDE/Ultra ATA SCSI PCI PCI-X (PCI Express) serial I2C 1-wire RS-232 SPI (USB?) CPU-memory HP HyperPlane Crossbar IBM SP Sun Gigaplane-XB - Interfacing Storage Devices to a CPU where? memory or cache? memory is typical how? memory-mapped is typical I/O instructions (Intel x86 and IBM/370) what? status and control DEC LP11 line printer status register done bit error bit data register wait for done, then write data *polling* interrupt-driven I/O (versus polling) DEC LP11 can interrupt CPU when done/error status changes interrupts can be higher overhead the polling hybrid clock interrupt specifies when to poll dynamic livelock paper: switch from interrupt to polling under high load - Delegating I/O Responsibility from the CPU direct memory access (DMA) let I/O devices talk to and from memory themselves scatter/gather allows I/O from virtually contiguous addresses which are not physical contiguous allows I/O from multiple callers to be sent together I/O processors (aka channel controllers) I/O control blocks (little programs for the I/O processor) only one interrupt when control block is complete typically restricted programs usually do not change data, just transfer it embedded often has lots of I/O Au1000 MIPS chip 10 DMA channels 20 I/O device controllers all on chip 7.4 Reliability, Availability, and Dependability 702 what is failure programming mistakes design time or runtime alpha particle error if we don't access changed bit? error if caught by ECC? operator error - Defining Failure "dependability" quality of delivered "service" observed "actual behavior" specified behavior specified by "server specification" "failure" when actual deviates from specified failure to "error" in module "fault" is cause of error "fault" creates "latent error" latent becomes "effective error" after "error latency" error is manifestation of fault "in the system" failure is manifestation of error "in the service" alpha particle fault that causes latent error on read, we have an error if ECC corrects error on read, we have no error summary faults => latent errors errors => more errors if error affects service => failure service 1.) service accomplishment service delivered as specified 2.) service interruption service differs from specification 1 => 2 on failure 2 => 1 on "restorations" module reliability MTTF - time from 1=>2 MTTR - time from 2=>1 module availability MTTF --------- = module availability MTTF+MTTR MTBF (between failures) = MTTF+MTTR faults 1.) hardware faults - device failure 2.) design faults - software or hardware design issues 3.) operational faults - operation or maintenance human error 4.) environmental faults - fire, flood, power, sabotage!!! duration transient intermittent permanent reliability improvements valid constructions error correction latent error processing effective error processing masking recovering error correction backward (undo) forward (redo) methods 1.) fault avoidance - by construction 2.) fault tolerance - by redundancy 3.) error removal - by verification 4.) error forecasting - by evaluation 7.5 RAID: Redundant Arrays of Inexpensive Disks 705 disk arrays striping MTTR/MTTF RAID Redundant Arrays of Inexpensive Disks Redundant Arrays of Independent Disks hot spares not swapping - No Redundancy (RAID 0) - Mirroring (RAID 1) mirroring (aka shadowing) RAID 1+0 (aka RAID 10) stripped mirrors 4 pairs of RAID 1 disks RAID 0+1 (aka RAID 01) mirrored strips 2 sets of 4 RAID 0 disks - (Memory Style ECC (RAID 2)) - Bit-Interleaved Parity (RAID 3) protection group parity RAID 1 mirror in special case of RAID 3 with 2 disk RAID 3 has to write all disks sits bits-interleaved between disks - Block-Interleaved Parity and Distributed Block-Interleaved Parity (RAID 4 and RAID 5) RAID 4 parity disk is bottleneck, all writes need to update it RAID 5 rotates location of parity block to prevent bottleneck - P+Q Redundancy (RAID 6) tolerate two failures instead of parity's one - RAID Summary higher throughput (MB/s and I/O/s) as well as higher availability size and power savings from small disks 7.6 Errors and Failures in Real Systems 710 1.) academics can't afford to build real systems 2.) industry does not want to disclose failure data - Berkeley's Tertiary Disk before assumed SCSI data disks would be higest source of failure assumed passive objects like cables would be error free after 50% of ethernet switches died 28% of Disk enclosure backplances 23% of IDE disks a higher percentage of SCSI cables died than SCSI disks! - Tandem hardware ever more reliable less maintence fewer chips and other components (operator error and environmental factors underreported) software the ever growing problem - VAX OS from 70% to 28% of failures from 1985 to 1993 operator error increasing problem (better data than Tandem for this category) - FCC mandated error reporting with separate investigators from operators include measure of impact (number of customers, length, etc) results overload less and less of a problem, cheaper to add capacity as a percentage, human error ever growing component did not include vandalism (very small) or environment (very big) 7.7 I/O Performance Measures 716 I/O throughput (bandwidth) I/O latency (response time) - Throughput versus Response Time transaction time entry time system response time think time transactions per hour ***people need less time to think if given a faster reponse!!! response time under one second shows non-linear productivity increase novice engineer with fast response time as produtive as experienced engineer with slow respose time major economic motivation for keeping response time low reponse time today launching applications slow due to disk I/O web surfing due to network delays - Response Time versys Throughput in Benchmarks TPC 90% of transactions must be below specific time SPEC average below a specific time 7.8 A Little Queuing Theory 720 initial startup to steady state? markoff chains or simulation too much effort instead focus on steady state input rate == output rate interarrival times are exponential distributed infinite population model (number of in service requests does not affect number of requesters) server starts one job right after another no queue length limit one server this is M/M/1 Markov / Markov / 1 server Markov => exponentially random request arrival rate Little's Law Mean number of tasks in system = Arrival rate * mean response time queue (aka waiting line) single server for now metrics Time[server] (micro) Time[queue] = Length[queue]*Time[server] + "average residual service time" (time left of existing job) = ... = Time[server] * (server utilization/(1 - server utilization)) Time[system] = Time[server] + Time[queue] Arrival rate (lambda) Length[server] = Arrival rate * Time[server] (Little's Law) Length[queue] = Arrival rate * Time[queue] (Little's Law) = ... = (server utilization^2) / (1 - server utilization) Length[system] = Length[server] + Length[queue] service rate = 1/Time[server] server utilization (rho) = Arrival rate * Time[server] must be between 0 and 1 in steady state aka traffic intensity random variables distribution histogram with buckets arithmetic mean time (amt) variance square of standard distribution squared coefficient of variance (unitless) C^2 = variance/(amt^2) C = standard-deviation/amt exponential distribution C=1 memoryless Poisson distribution Poisson process average residual service time = 1/2 * weight mean time * (1+C^2) M/M/m N[servers] Utilization = (Arrival Rate * Time[sever]) / N[servers] Length[queue] = Arrival rate * Time[queue] (Little's Law) Time[queue] = Time[server] * (P[tasks>=N[servers]]) / N[servers]*(1-Utilization) that is we need to factor in the probability that sometimes we have empty servers 7.9 Benchmarks of Storage Performance and Availability 731 queueing theory is nice, but real numbers are meaningful - Transaction Processing Benchmarks transaction processing (TP) (aka online transaction processing (OLTP)) I/O rate (disk access/second) not data rate (bytes/second) banking and airline reservation systems TPC (Transaction Processing Council) first benchmark was anonymous! TPC-C order entry of wholesace suppplier entering orders, devering orders, recording payment, checking state, stock level SPECjbb... tpmC (tranactions per minute) TPC reporting - price is include with benchmark results - the data set geneally must scale in size as the throughput increases - the benchmark results are audited - throughput is the performance metric, but reponse times are limited - an independent organization maintains the benchmarks - SPEC System-Level File Server (SFS) and Web Benchmarks SFS is for NFS scales data with throughput like TPC also limits average reponse time no price/performance information retired because of bugs! SPECWeb load increased until response time suffers significantly response sizes 35% < 1k 50% 1-10k 14% 10-100k 1% 100k-1m content static pages rotating advertisements dynamic pages user registraion linux crushing windows! - Examples of Benchmarks of Dependability and Availability TPC-C requires tolerating a disk failure => usually RAID study of injecting disk faults into SPECweb system performance impact linux auto correct, rebuilds slow, no impact on performance during rebuild linux auto correct, rebuilds fast, major impact on performance during rebuild windows operator correct, rebuilds fast, major impact on performance during rebuild handling of transient faults linux too paranoid, gives up too easily on disks, consuming hot spares quickly soliar and windows not paranoid enough, ignore transient errors that cause performance degradation 7.10 Crosscutting Issues 737 - Unix anecdote early unix boxes had 16-bit I/O processors system code limited requests to 63kb to work well with these later controllers were limited to this request size, even if they were more capable... - DMA and Virtual Memory issue: sequential virtual pages are not necessarily sequential physically scatter gather I/O issue: OS moving or paging out pages that are involved in DMA pinning page scatter/gather I/O can also avoid copying from user to kernel just to do DMA "virtual DMA" each page to be transfer needs relocation information in controller OS can update this information on page relocation - Asynchronous I/O and Operating Systems synchronous I/O issue request and wait for it to be done asynchronous I/O do something else while waiting (just like tolerating cache miss delays in OOOe) - Block Servers versus Filers logical units, logical volumes, physical volumes collections of disk blocks logical units configured with a certain RAID layout physical volumes OS interface to logical unit logical volumes combining multiple physical volumes into bigger volumes split or stripe Q: where is file defined? A1: server direct attached storage A2: disk subsystem network attached storaged (NAS) access via netowork protocols such as NFS and CIFS "filer" is a NAS with only file interface NASs can also provide block interface for use by databases - Caches Cause Problems for Operating Systems - Stale Data cache. memory, disk stale data keeping data updated by CPU and/or I/O in sync 1.) I/O system writes bad data because memory is not up-to-date - write through cache - write back cache with OS flush - DMA controller snoops second set of cache tags 2.) CPU reads bad data because cache is not up-to-date - flush from cache before I/O operation - mark region as non-cachable - cache snoops bus, invalidates blocks - Switches Replacing Buses active component costs droping passive bus cost savings is less substantial IDE - from PATA to SATA PCI => infiniband?!? how about PCI Express - Replication of Processors for Dependability IBM 390 ECC for caches and main memory... ECC for register file as well 14 processors including spares redundant instruction fetch/decode, execution units, L1 cache, register file all to check for errors retry inconsitent instructions to mask transient failures swap in backup processor in less than one second for non-transient failure 7.11 Designing an I/O System in Five Easy Pieces 741 tradeoffs 1.) cost 2.) variety of devices 3.) performance 4.) expandability 5.) dependability steps 1.) list types of devices or standard busses to support 2.) list physical requirements size, power, connectors, bus slots, expansion cabinets 3.) cost of each I/O device and controller 4.) list the reliability of each device 5.) record cpu resource demands for each device - clock cycles for instructions to initiate I/O, handle interrupts, complete I/O - CPU clock stalls due to waiting for I/O to finish using memory, bus, or cache - CPU clock cycles to recover from I/O activity such as a cache flush 6.) List the memory and I/O bus depands of each I/O device 7.) assess performance and availability of different configurations using simulation (or approximate with queueing theory) reliability with independent failures and exponential distribution of time to failure availablility = f(reliability, estimated MTTF, estimated MTTR) use cost, performance, and availablility goals to evaluate - First Example: Naive Design and Cost-Performance I/O system with two types of disks ignores dependability simplifying assumption: 100% utilization of I/O resources - Second Example: Calculating MTTF of First Example poor availablity of #1 - Third Example: Calculating Response Time of First Example impact of response time due to 100% utilization - Fourth Example: More Realistic Design and Cost-Performance more realistic utilization to keep response time reasonable rules of thumb - no disk should be used more than 80% of the time - no disk arm should be seeking more than 60% of the time - no disk string should be utililized more than 40% SCSI string bandwidth low because extra 20% for command overhead - no I/O bus should be utililized more than 75% - Fifth Example: Designing for Availablity use RAID 5 to improve availablity MTDL = mean time to data loss to lower, increase MTTF or lower MTTR orthogonal RAID don't put all disks for one logical volumn on one controller contoller death makes all data unavailable instead use orthogonal approach each controller gets one block from each logical volume 7.12 Putting It All Together: EMC Symmetrix and Celerra 754 Symmetrix is directly attached Celerra is network attached front end to Symmetrix - EMC Symmetrix 8000 stats disk 384 disks RAID-1 or RAID-S (variant of RAID-5) up to 28TB raw capacity with 73GB drives interconnect 4 buses with ECC each component on 2 busses for reliability controllers channel directors server interface SCSI, FC-AL (fiberchannel arbitrated loop), ESCON (IBM mainframe I/O) up to 16, 2 per card manage the disk caches disk cache memory 2-32GB 90-95% read hit rate for largest size acts as write buffer entire array has battery backup to avoid losing write buffer cache block size is 32kb disk directors up to 16, 2 per card each has 2 Ultra1 SCSI strings each string is connected to two directors for redundancy 12 SCSI disks per string * 2 strings per director * 16 disk directors = 384 disks director architecture (channel and disk) 2x PowerPC 750s each disk enhancements for priority queueing provided by hardware manufactures at EMCs request :) CRC is tracked between disk/cache and cache/host finding latent errors cache is scrubbed (re-read) to correct latent errors with ECC frequently bad cache blocks are "fenced" off from future use disks also scrubbed during idle time fences off bad tracks with many bad sectors move parity XOR from controllers to disk moves operation from single location to N disks allows read/write without intervening seek mirror policy depending on access patern: sequential pattern "interleaves" access between disk 0 and disk 1 random pattern ues "mixed" pattern to send requests to first "half" to disk 0, second to 1 service processor laptop connected to directors via ethernet predicting potential failures errors are logged to service processor calls home to EMC via telephone to report issues can be called by EMC to collect more data also for installation, upgrades, configuration - EMC Celera 500 "filter" for Symmetrix stats 14 slots for "data movers" data movers are just PC motherboards two SCSI busses to hook to symmetrix channel directors 700Mhz Pentium III network options ATM FDDI two GigE 8 100M bit ehternet DART real time OS each appears as autonomous file server control stations (similar to service processor) two for reliability same hardware as "data mover" OS is Linux instead of DART dependability - redundant fnas, power, batters, power cords - each data mover can reach all Symmetrix disks - each data mover has two ethernet cards to talk to each control station - each data mover has at least two client interfaces - data movers can act as standby spares - space for a redundant control station Symmetrix service processor also reports Celera errors - EMC Symmetrix and Celera Performance and Availablity focus on dependability, but also has leading performance 1.) cache fault different directoy processors discover fault over time after fault repair, I/O rate increases above steady state to catch up 2.) lock fault response time suffers because of rogue director holding lock nothing to do but wait for suffer through it 7.13 Another View: Sanyo VPC-SX500 Digital Camera 760 CCD -> 4mb frame buffer -> 8mb Flash (or 340mb ibm microdrive) 19 fine or 31 normal quality images on 8mb flash CompactFlash PCMCIA-ATA card interface ATA interface makes it look like a disk allows easy interface to existing systems with ATA controllers allows easy yield improvments by treating problems as bad disk blocks SOC (system on a chip) lowers battery requirement from 4 AA to 2 AA batteries buses 10 bits from CCD 32 bits interally 16 bits to video d/2 16 bits to SDRAM 16 bits to smart media, serial, irda, pcmcia, audio, ... 7.14 Fallacies and Pitfalls 763 - Fallacy The rated mean time to failure of disks is 1,2000,000 hours or almost 140 years, so disks practically never fail. This industrty deception assumes replacement every 5 years at "life of disk" Looking another way, about 5.5% of disks will fail in the five year period 1 in 20 chance it will be your disk :) - Fallacy Compoents fail fast. Tertiary disk project many intermittent failures operator had to decide to "terminate" component - Fallacy Computer systems achieve 99.999% availablity ("five nines") as advertised excludes operator error, application faults, and environmental faults excludes scheduled downtime Microsoft and the five nines their own website was down for 22 hours in January 2001 needs 250 years to get back to five nines. well managed servers in 2001 get 99% to 99.9% availability - Pitfall Where a function is implemented affects its reliability. software RAID culture of release and patch compatability issues between versions hardware RAID more testing before initial release because of high cost to fix more independent of OS versions - Fallacy Semiconductor memory will soon replace magnetic disks in desktop and server computer systems. first edition memory on course to wipe out tapes in between disk industry realized they needed to match DRAM growth current edition disk wiping out tapes - Fallacy Since head-disk assemblies of disks are the same technology independent of disk interface, the disk interface matters little in price. 1.) SCSI controllers sold in lower volume => less competition 2.) SCSI disks often have higher RPM and lower seek 3.) SCSI controllers sold in lower volume => higher cost because of "manufacturing learning curve" - Fallacy The time of an average seek of a disk in a computer system is the time for a seek of one-third the number of cylinderss 1.) seek time not linear with distance accelerate, decelerate, settle time also some waiting to control vibration 2.) average assumes no locality 7.15 Concluding Remarks 769 dependability reliability and availability maintainability 7.16 Historical Perspective and References 770 - Magnetic Storage tape invented to record sound, competitive in 1941 eniac 1947 used mag tape for digital data 1970s real to real for removable media 1980s IBM 3480 tapes 9840 cartridge in PowderHorn serpentine recording zigzag pattern (back and forth, reversing head each time) disk 1953 IBM Almaden (San Jose, CA) concept to prototype for magnetic disk in 24 months RAMAC-350 50 platters, 24 inch each, 5mb, access time 1 second later innovations head flies on cushion of air removable disk "winchester" disk integrated circuits sealed enclosure "30-30" (two 30mb spindles) let to "winchester" nickname magnetoresistive heads (enabling current densities) competition keeps gap betwen research and development short future market quiet drives for PVRs small drives for portables versus flash for portables a little bit of both - RAID EMC had been a DRAM manufacture for IBM computers switched from DRAM to RAID because of business issues - I/O Buses and Controllers intelligent devices offload tasks and queuing to devices (example SCSI disks) disk bus standards SCSI (small computer systems interface) FC-AL (fiberchannel arbitrated loop) I/O bus standards PDP-11 Unibus PCI, PCI-X, Infiniband interrupts from anomalies to async I/O DMA IBM 360 channel = I/O bus Exercises 778 Chapter 8 Interconnection Networks and Clusters 8.1 Introduction 788 networks (aka communcation subnets) end systems (aka hosts) internetworking architecture and networks 1.) moore's law allows networks to be used within computer 2.) almost all computers are connected to networks cluster naming wire area network (WAN) (aka long-haul network) ATM local area network (LAN) Ethernet storage area network (SAN) (aka system area network) Infiniband Terms from Figure 8.2 adaptive routing picking path based on measure of delay of outgoing links ATM attenuation lost of signal over distance back pressure flow control wires to tell sender to stop or slow down bandwidth base station wireless to wired bisection bandwidth bit error rate (BER) typically in per million bits blade blocking contention bridge OSI layer 2 carrier sensing detect collisions (Ethernet and wireless (aloha)) category 5 wire (cat5) channel wireless channels like in 802.11 checksum circuit switching cluster coaxial cable3 collision collision detection colocation site communication subnets credit-based flow control sort of like leases, get bandwidth allotment ahead of time, need to check to get more cut-through routing start transmitting based on header, before all data received message may get strung out, but often tail catches up with head within switch destination base routing deterministic routing end systems end-to-end argument Ethernet fat tree extra links at each level so bandwidth between levels is constant FC-AL fiber channel arbitrated loop (SAN) frequency-division multiplexing time share medium using different frequencies, not time slicing full duplex two way communication header host hub OSI layer 1 Infiniband SAN interference internetworking IP OSI layer 3 iSCSI SCSI over IP SAN using Ethernet LAN message multimode fiber cheap optical fiber that reduces bandwidth and distance for cost multistage switch OSI layers overhead packet switching vs circuit switching payload peer-to-peer protocol peer-to-peer wireless protocol rack unit 1U = 1.7" single slot in standard 19" VME rack 44U in standard 6 foot rack receiver overhead router OSI layer 3 SAN storage area network (aka system area network) sender overhead shadow fading wireless signal blocked by objects signal-to-noise ratio (SNR) simplex one way communication (aka half duplex) single-mode fiber more expensive (and narrower) fiber with more bandwidth and distance source-based routing store-and-forward TCP OSI layer 4 throughput time of flight trailer transmission time transport latency twister pairs two wires twisted together to reduce electrical interference virtual circuit WAN wavelength division multiplexing send several signals on one fiber with different wavelengths of light window wireless network wormhole routing similar to cut-through routing different is tail of message does not catch up with head 8.2 A Simple Network 793 paraphrased quote: bandwidth can be cured with money latency is harded because of speed of light message header payload trailer checksum protocol single vs multiple messages in flight byte order duplicates ordering flow control performance terms bandwidth aka throughput bits/second time of flight time from sender to receive transmission time time from start of transmission transport latency = time of flight + transmission time sender overhead hw and sw cost to send message receiver overhead usually higher than sender interrupt costs total latency = sender overhead + time of flight + message-size/bandwidth + receiver overhead ~= overhead + message-size/bandwidth effective bandwidth = message-size/total latency message size message size is important to get full benefit of fast networks unfortunatelym most apps send many small messages more request/reply(ack) are more common than data 8.3 Interconnection Network Media 802 medium twister pair Cat1 cable mainstay of phone company slow signal over many kilometers Cat3 10Mb/s Cat5 100Mb/sec <=100 meters coaxial cable 50ohm 10Mb/s over a kilometers T-junction vs vampire tap T requires cut and splice (annoying) vampire tab allows new drops to be added without cut fiber optics simplex vs duplex like twisted pair and coax multimode cheap LED 1000Mb/sec < 1km 100Mb/sec few kilometers single-mode expensive laser diode gigabits/second over hundreds of kilometers phone company backbone attenuation limits length have limits on bending radius glass instead of plastic fiber wavelength division multiplexing multiple wavelengths on same fiber 8 wavelengths for 40Gb/sec (2001) 80 wavelengths for 400Gb/sec (planned) cable vs fiber fiber has to use t-junctions 1.) passive fused into fiber failure means cable replacement 2.) active repeater optical to electrical to optical failure means repeater replacement, but network is down fiber optoelectronic components add significant cost 8.4 Connecting More Than Two Computers 805 - Shared versus Switched Media shared media like a bus inexpensive limited bandwidth arbitration scheme for sharing sharing via arbiter okay for small scale (like bus) need some way to talk to arbiter sharing via carrier sensing (aka collision detection) look before you leap if busy wait until free transmit if detect collision retry random retry delay helps prevent repeated collision exponential backoff helps prevent network overloading (livelock) token passing point-to-point communication switching aka data switching exchanges aka multistage interconnection networks aka interface message processors (IMPS) each node has dedicated line to switch switch has dedicated lines to everyone else tradeoff arbitration for switch overhead moore's law has made switches more attractive single chip 64x64 switch in 2001 shared advantages broastcast sending to all nodes of shared medium is easy multicast just send to all and let some ignore switch advantages higher aggregate bandwidth pairs of nodes can communicate at full link speed simultaneously - Connection-oriented versus Connectionless Communication connection phone call frequency-division multiplexing each call gets dedicated frequency to use "busy" based on number of connections, not "ammount" of usage connectionless postal model circuit switching traditional way to deliver connection-based service packet switching packets (aka frames) queueing theory says cannot use all of bandwidth however uses more of the bandwidth given waste of circuit switching beware connection does not imply circuit switching connectionless does not imply packet switching TCP is a connection-based service built on top of packet switching - Routing: Delivering Messages shared media routing broadcast switched media routing source-based routing message specifies path to destination virtual circuit estabilish circuit between source and destination message names circuit to take ATM simpler switch destination-based routing message states destination switch has to decide where to route IP more complex switch approaches: deterministic vs adaptive vs randomized deterministic always takes same path adaptive chooses between paths based on network status (load, failures, congestion, cost) randomized (either deterministic or adaptive) tries to distribute load packet storage in routing store-and-forward routing receive whole packet before acting on it cut-through routing receive header before acting on it room for whole packet if output blocked wormhole routing receive header before acting on it no room for entire packet if output blocked, blocks inbound store-and-forward has higher latency switch latency proportional to packet length * number of switches ordering can be provide at this layer but often managed by higher layer (often in software) - Congestion Control circuit switched avoid dynamic congestion congestion checks happen at circuit setup time once connection established, dedicated resource packet switches dynamic congestion control packet discarding UDP flow control backpressure separate channel (wires) to let receiver supress sender supercomputer networks SANs GigE switches via fake collision signals credit-based flow control sender needs "credit" from receiver to send "window" approach is form of credit choke packets overloaded switches return packets to source as "choke packets" allows adaptive sender to reduce percentage of packets sent via outbound link 8.5 Network Topology 814 - Centralized Switch single logical switch multistage switches switches made up of other switches crossbar n^2 swtiches source routing: list in header destination routing: destination picks outbound by table lookup spanning tree routing: on the fly omega network fewer switches n/2 lg n switches more chance of congestion fat tree more bandwidth on higher levels multiple paths between nodes CM-5 - Distributed Switch switch at each node ring pipelined bus + point-to-point links - bus repeater delays EIO between Cell PE and SPEs token ring combines ring with arbitration fully connected very expensive 2D grid or mesh 2D torus hypercube tree bisection bandwidth divide nodes approximately in half sum the bandwidth of all lines crossing the diving line fully connected: (n/2)^2 bus: link speed example with 64 nodes BUS RING 2DTORUS 6-CUBE FULLY CONNECTED bisection 1 2 16 32 1024 ports/switch NA 3 5 7 64 lines 1 128 192 256 2080 implementation issues 1.) transforming 3d into 2d chips and cabinets 2.) switch latency depends on routing pattern, which depends on topology 3.) bandwidth from processor with single port can be limiting factor quality of implementation depends more than topology redundancy matters at all levels: SAN, LAN, WAN 8.6 Practical Issues for Commercial Interconnection Networks 821 - Connectivity 100s easier than millions affects protocol complexity - Connecting the Network to the Computer where to connect? I/O bus most SANs and all LANs and WANs not usually hardware cache coherent memory bus some SANs can be hardware cache coherent DMA good for large messages to use memory bandwidth many small messages perhaps better to use PIO (programmed I/O) since data can come from cache - Standardization: Cross-Company Interoperability low cost and stability volume and multiple sources for products when to standardize before its built? userless and ommited features can supress innovation... WANs no proprietary standards since different organizations with different suppliers must connect LANs Ethernet... SANs some standards, more propriety: okay since spans fewest machines - Message Failure Tolerance end-to-end argument... but some lower level network retry as performance optimization - Node Failure Tolerance WANs have stongest requirement for failure tolerance LANs are close, but rogue machine hosing network can be hunted down easily (one organization) SANs: sacrifice fault tolerance for low-latency. build with more redundancy. interrupt network to add new nodes? WANs: are you crazy LANs: that is why t-junctions sucked SANs: not as much of an issue 8.7 Examples of Interconnection Networks 825 - 10 design decisions 1.) what is the target bandwidth 2.) what is the message format 3.) which media are used 4.) is the network shared or switched 5.) is it connection oreinted or connectionless 6.) does it use store-and-forward or cut-through routing 7.) is routing use source based, destination based, or virtual-circuit based 8.) what is used for congestion control 9.) what topologies are supported 10.) does it follow a standard - Ethernet: The Local Area Network growth 1978 10Mb/sec 1994 100Mb/sec GigE (10GigE) coax, cat3, cat5, optical, wireless parallelism hubs OSI level 1 extend Ethernet segment bridges OSI level 2 connect two Ethernet routers (aka gateways) OSI level 3 connect Ethernet to other networks switches GigE+ requirement cut-through routing - Storage Area Network: Infiniband 2Gig bits/sec per link point-to-point links can be bundled in groups of 4-12 packet swithced connectionless switched Cat5 - 17 meters optical - 100 meters backpressure for flow control SCSI command set to talk to storage SAN vs LAN SAN arguments 1.) lower protcol overhead 2.) no protection requirements (behind server firewall) 3.) graceful degradation under congestion LAN arguments 1.) cheaper switches 2.) IP networks allow scaling to many nodes 3.) TCP/IP off-loading engines reduce protocol overhead iSCSI use SCSI over TCP/IP + GigE as SAN solution - Wide area Network: ATM stats WAN only (no more LAN dreams) 155Mb/sec baseline with 4x scaling increments: 155, 620, 2480, ... switched connections store-and-forward small, fixed sized packet (48 bytes) credit-based flow control connections and small packets? QoS (Quality of Service) predicitability matters for voice traffic, not just bandwidth small packet allows fast routers and switches - Summary crossover ATM tries and failed to move from WAN to LAN iSCSI tries to move Ethernet from LAN to WAN 8.8 Internetworking 830 protocol families (aka protocol suites) TCP/IP internetworking protocol even used within a LAN such as for NFS layering OSI levels 1 physical IEEE 802 hub 2 data link Ethernet bridge, NIC 3 network IP router, switch 4 transport TCP gateway 5 session named pipes, RPC gateway 6 presentation format conversion gateway 7 application FTP,DNS,NFS,http gateway, smart switch logically communcation at same level ("peer-to-peer") services of lower levels implement upper levels protocol stack latency an issue in layering however protocol defines interface, not implementation can try and merge levels in implementation 8.9 Crosscutting Issues for Interconnection Networks 834 - Density-Optimized Processors versus SPEC-Optimized Processors colocation charge for network, space, and power SPEC optimization processors haven't cared about space or power use mobile processors designed for small space and lower power better performance/watt better perforamce/cubic-foot - Smart Switches versus Smart Interface Cards Ethernet cheap interface card means costly switches Infiniband cheaper switch with more expensive "host" interface trying to have restricted "target" interface to lower cost for disks - Protection and User Access to the Network Cray fine grained network access to global address space protection by TLB block access OS needs to be involved to check permissions for block transfers adds OS overhead - Efficient Interface to Memory Hierarchy versus Interconnection Network SPEC CPU reward integration of cache with processor avoid memory at all costs add more levels of caches do not reward network integration have to travel via memory each cache layer adds more cost - Compute-Oriented Processors versus Receiver Overhead SPEC doesn't need to take interrupts fancy OOOe processors can take huge penalities on interrupt processing 8.10 Clusters 838 Commodity, Commodity, Commodity - Performance Challenges of Clusters 1.) I/O bus interconnect versus memory bus interconnection software-based communication vs hardware 2.) division of memory single program cannot use more than cluster's worth of memory in 1995 this was a cost issue, in 2001 the problem is lack of expandability - Dependability and Scalability Advantage of Clusters replacement hotswapping a machine on a LAN is easier than hot swapping processor in running machine no special machine hardware or OS software needed applications end up designed to tolerate down machines often makes it easy to handle adding machines as well - Pros and Cons of Cost of Clusters cost of administration cluster proportional to number of nodes multiprocessor closer to single machine price/compute (price/bandwidth) cluster wins on commodity (both compute and interconnect) multiprocessor loses because of low volume falling memory costs have favored clusters - Shooting for the Best of Both Worlds multiprocessors dynamic partitioning to allow online hardware replacement and software upgrade clusters reduce admin cost by using multiprocessor nodes clusters use SAN approach of isolating storage nodes from compute nodes but GoogleFS and MapReduce don't do this locality matters as bandwidth requirements could be huge between two pools of nodes) - Popularity of Clusters scaling Internet applications scientific computing own your own, versus time-share a super TPC clusters fastest with good cost/performance TPC-C SMP has worst cost/performance NUMA typically not as offensive :) TPC-H general pattern: cluster beats NUMA beats SMP 8.11 Designing a Cluster 843 - Intro similar process to Section 7.1 building a cluster with 32 CPUs, 32GB DRAM, 32 or 64 disks costs Larger 1 MB L2 cache higher DRAM costs for SMP nodes, especially between 2 and 8 nodes no particularly good reason... - First Example: Cost of Cluster Hardware Alternatives with Local Disk fatter nodes fewer switches however, more cost for SMP interconnect less rack space however can't hold as many disks, need expansion unit on cost: 2-way system beats uniprocessor beats 8-way - Second Example: Using a SAN for Disks cost for FC-AL interface cards brings uniprocessor and 8-way to same cost point 2-way still best option SAN adds $40k-$100k to cost brings dependability: reliability and availability - Third Example: Accounting for Other Costs TCO (total cost of ownership for 3 years) OS license DB license Computer maintenance Network maintenance rack space rack power bandwidth operator tapes operator costs start to dominate SAN requires 1/2 less operator, justifies additional cost 2-way with SAN wins HW is only 1/3 to 1/2 of TCO - Fourth Example: Cost and Performance of a Cluster for Transaction Processing - disk space: bigger and faster and more of them - RAID: instead of SAN, since TPC-C only counts HW costs, not operator - Memory: max each node out - Processor: max out - PCI: 7 of 12 used for RAID - tape reader, 4 monitors, 4 UPSs: ... - maintenance and spares: spare ethernet switches, host adapters, and cables - Summary of Examples 1.) HW is only fraction of TCO SAN adds HW cost to lower operator cost 2.) smaller computers generally cheaper SMP needs large caches to reduce bus demand, increasing cost and limitng clock rate 3.) space and power matters for high end as well as embedded 8.12 Putting It All Together: The Google Cluster of PCs 855 - Intro Google tackled 1.) scale 2.) relevancy 3.) reliability 4.) speed 5.) bandwidth 6.) index freshness - Description of Google Infrastructure big pitures stats December 2000 6000 processors 12000 disks 1 petabyte of storage no RAID, GoogleFS provides data replication 3 data centers (2 norcal, 1 VA (soon 2 VA) each with OC48 2488Mbit/sec redundant OC12 between sister sites (CA-CA VA-VA) data center network stats 1 monster cisco 12000 switch to OC48 2 big foundary 8000 switches for redudany access to racks from cisco each can handle 128 1Gb/sec lines each rack has 4 1Gb/sec lines, 2 to each foundary switch 64 racks total capacity per data center rack stats 44U rack = 2x (4U switch + 40x1U PCs) 2x because use both front and back side of rack 4U HP ethernet switch 5 blades with 8 100Mb/sec links for 40 PCs 2 blades with 1 GigE link to 2 foundary switches 40x1U PCs only ethernet and power ethernet on left, power to strips on right edge of rack cooling via 3" chimney up center to fans on top pc stats standard issue 2 ATA drives 256 MB SDRAM modest Intel processor PC motherpower 1 power sully few fans linux upgrades over time larger disks faster CPUs March-November 2000 533Mhz Celeron to 800Mhz pentium III 40-80gb disk with 5400-7200rpm speed memory bus 100-133mhz - Performance most of bandwidth for indexing, not responding to queries latency means picking right data center based on "geographic" location of user VA for Europe, Norcal for CA - Cost $4.5 million to $6 million per data center PC $1,300-1,500 HP rack switch $1,500 Foundary switch $100,000 Racks $1000-$2000 much better price/performance than IBM TPC-C system from 8.11 TPC-C cost $10.8 million for 1/8 the CPU and 5/8 the DRAM power 4500W in 10 square feet 60amps/rack colocation site comparison 1000W and 20amps - Reliability software is bigest source of failure 20 machine reboots a day fix most problems no remote reboot call colocation site and have operator push reset button more errors here from wrong button or wrong machine label :) hardware rate 1/10 of software 2-3% PC/year 95% disk and DRAM 5% motherboard, power supply, connectors, etc DRAM 1/3 of failures bits inside the DRAM bits on the bus no ECC at this time ECC going forward Disks 1/2 of failures are performance only 28MB/sec disk will only deliver 4MB/sec or even 0.8MB/sec sent back to manufacture on PC failure: dynamically removed from pool physical replacement once a week repair for later reuse switches HP 200 total 2 or 3 failures Foundary 6 total with 16 blades/switch some DOA but none have died completely in production 2 or 3 blades have failed (out of 96) colocation reliability power once a year power outage even with diesel backup network once a year outage of several hours also "bathtub curve" issues new site lots of people and new equipment cause problems steady state no people or equipment changes, working fine upgrade time people and new equipment return, causing proble,s people turnover also an issue tolerating colocation reliability multiple sites different network providers backup leased lines between sites 8.13 Another View: Inside a Cell Phone 862 - Background on Wireless Networks terms wireless networks carrier signal bit erro rate (BER) signal-to-noise ratio (SNR) challenges path loss - signal drop over distance shadow fading - physical object inteference multipath fading - self-inteference because of same signal arriving at slightly different times interference - frequency reuse, adjacent channels, etc issues dynamic network makes routing is an issue signal issues from movement power issues from batteries BER 1,000x-1,000,000x of copper wire base station vs peer-to-peer celluar telephony uses base stations non-overlapping hexagonal cells adjacent cells to not reuse frequencies base-stations at hexagon intersections cells have 2-10 mile radius depending on topography, population, ... - The Cell Phone radio antenna -> RF amp -> mixer -> filter -> demodulator -> decoder move from analog to digital components multiband has led to software programming of DSPs frequency sweep to find best signal every 7 seconds (or more if signal drops below threshold) channels forward path reverse path 4-80 per cell 9600 bits/s variable power to save battery but also frequency inteference between cells nokia examples has Z-80! - Cell Phone Standards and Evolution AMPS (advanced mobile phone service) mostly analog send 9600b/s signal 5x to deal with interference CDMA (code division multiple access) mostly digital wider frequency band harder to block uses spread spectrum caller share channel convert 9600b/s signal to 1.25Mb/sec stretch bits 11x to deal with interference base station converts back to 9600b/s SNR gradually degrades, adding background noise speech compression and variable data rate to save bandwidth TDMA (time division multiple access) mostly digital GSM (global system for mobile communication) mostly digital 8.14 Fallacies and Pitfalls 867 - Pitfall Using bandwidth as the only measure of network performance latency matters, especially if many small messages overhead matters 10Mb/s ethernet only beats 155Mb/s ATM for NFS traffic by 1.2 - Pitfall Ignoring software overhead when determining performance CM-5 sending a message SW 20us HW .5us Intel Paragon SW 250us HW .2us at first SW 25us HW .2us later CM-5 wins with slower HW, because SW cost is lower - Pitfall Trying to provide features only within the network versus end-to-end beware: errors can happen in node as well as in transmission however: performance improvement, especially latency, of lowerlevel correction - Pitfall Relying on TCP/IP for all networks, regarless of latency, bandwidth, or software requirements. significant processor resources to process TCP/IP that grow proportional to network bandwidth 1MHz per 1Mb/sec 8.15 Concluding Remarks 870 hot field dealing with failure is expected switches (vs buses) is area of growth and change SANs, clusters continue to technology catchup from SANs, LANs, WANs 8.16 Historical Perspective and References 871 - Wide Area Networks 1970s TCP/IP Vince Cerf and Robert Kahn 1975 100 1983 200 1994 50,000 the web was the killer app 1989 Tim Berners-Lee 1992 UIUC Marc Andreessen Mosaic (later Netscape) 1995 30,000 pages 2001 1.3 billion pages ATM 48byte packet compromise North America wanted 64-byte packet to match existing equipment Europe wanted 32-byte packet to match existing equipment compromise so that no one wins! (has an advantage over the other) WANs now all fiber bandwidth is not the precious commodiy that packet switching assumed - Local Area Networks Xerox PARC Alto [Thacker] and Ethernet [Metcalfe and Boggs] desktop and LAN David Boggs, Butler Lampson, Ed McCreight, Bob Sprowl, Chuck Thacker 1974 3Mb/sec wit 50us latecy Boggs was ham radio operator, leading to realization that central arbiter was unneeded DEC, Intel, Xerox came up with 10Mb/sec standard, avoiding slow IEEE effort FDDI IEEE effort led to expensive system that couldn't compete with Ethernet - Massively Parallel Processors Cosmic Cube ethernet to connect 8086 computers in hypercube SAN for high bandwidth and low latency topology fadish, hypercubes not common today cut-through routing has persisted Dally* ASCI MPPs are basically clusters of SMPs - Clusters 1960s existed without name 1975 Tandem 16-node cluster 1984 VAX cluster shared I/O devices distributed OS geographic distribution for availability 25,000 clusters sold by 1993 (both Tandem Non-Stop and DEC VAX clusters ended up at Compaq/HP) 1993 NASA Beowulf project move scientific computing from MPP to clusters 1994 16-node cluster from 486's achieved NASA goal of 1GFLOPS for <$50,000 networking to improve latency (and bandwidth) VI interface standard => Infiniband Network of Workstations (Patterson*) Berkeley Myrinet switches (160MB/sec) 100 Ultra-SPARC record breaking database sort (8.6GB in 1 minute) DES key cracking (3.5hrs for 40-bit key) Intktomi web search engine - System or Storage Area Networks (SANs) features (original focused on clusters, now storage connection) single room wider bandwidth and faster for less cost then LAN/WAN hardware in order delivery less switch handshaking no need for fiber source based routing 2001 FC-AL (fiber channel arbitrated loop) October 2000 Infiniband (Intel, HP, IBM, Sun) lauched to create point-to-point PCI replacement iSCSI (SCSI over TCP/IP on 1GigE ethernet) TCP/IP offloading Exercises 877 Appendix A Pipelining: Basic and Intermediate Concepts A.1 Introduction A-2 A.2 The Major Hurdle of Pipelining - Pipeline Hazards A-11 A.3 How Is Pipelining Implemented? A-26 A.4 What Makes Pipelining Hard to Implement? A-37 A.5 Extending the MIPS Pipeline to Handle Multicycle Operations A-47 A.6 Putting It All Together: The MIPS R4000 Pipeline A-57 A.7 Another View: The MIPS R4300 Pipeline A-66 A.8 Crosscutting Issues A-67 A.9 Fallacies and Pitfalls A-77 A.10 Concluding Remarks A-78 A.11 Historical Perspective and References A-78 Exercises A-81 Appendix B Solutions to Selected Exercises Introduction B-2 B.1 Chapter 1 Solutions B-2 B.2 Chapter 2 Solutions B-7 B.3 Chapter 3 Solutions B-11 B.4 Chapter 4 Solutions B-16 B.5 Chapter 5 Solutions B-21 B.6 Chapter 6 Solutions B-25 B.7 Chapter 7 Solutions B-29 B.8 Chapter 8 Solutions B-30 B.9 Appendix A Solutions B-35 Online Appendices (www.mkp.com/CA3/) Appendix C A Survey of RISC Architectures for Desktop, Server, and Embedded Computers Appendix D An Alternative to RISC: The Intel 80x86 Appendix E Another Alternative to RISC: The VAX Architecture Appendix F The IBM 360/370 Architecture for Mainframe Computers Appendix G Vector Processors Revised by Krste Asanovic Appendix H Computer Arithmetic by David Goldberg Appendix I Implementing Coherence Protocols -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- CS315a Introduction to Parallel Computing: Second Edition by Ananth Grama, George Karypis, Vipin Kumar, Anshul Gupta -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Contents Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument -- From Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed Argument 3 1.1.3 The Data Communication Argument 4 1.2 Scope of Parallel Computing 4 1.2.1 Applications in Engineering and Design 4 1.2.2 Scientific Applications 5 1.2.3 Commercial Applications 5 1.2.4 Applications in Computer Systems 6 1.3 Organization and Contents of the Text 6 1.4 Bibliographic Remarks 8 Problems 9 CHAPTER 2 Parallel Programming Platforms 11 2.1 Implicit Parallelism: Trends in Microprocessor Architectures 12 2.1.1 Pipelining and Superscalar Execution 12 2.1.2 Very Long Instruction Word Processors 15 2.2 Limitations of Memory System Performance* 16 2.2.1 Improving Effective Memory Latency Using Caches 17 2.2.2 Impact of Memory Bandwidth 18 2.2.3 Alternate Approaches for Hiding Memory Latency 21 2.2.4 Tradeoffs of Multithreading and Prefetching 23 2.3 Dichotomy of Parallel Computing Platforms 24 2.3.1 Control Structure of Parallel Platforms 25 2.3.2 Communication Model of Parallel Platforms 27 2.4 Physical Organization of Parallel Platforms 31 2.4.1 Architecture of an Ideal Parallel Computer 31 2.4.2 Interconnection Networks for Parallel Computers 32 2.4.3 Network Topologies 33 2.4.4 Evaluating Static Interconnection Networks 43 2.4.5 Evaluating Dynamic Interconnection Networks 44 2.4.6 Cache Coherence in Multiprocessor Systems 45 2.5 Communication Costs in Parallel Machines 53 2.5.1 Message Passing Costs in Parallel Computers 53 2.5.2 Communication Costs in Shared-Address-Space Machines 61 2.6 Routing Mechanisms for Interconnection Networks 63 2.7 Impact of Process-Processor Mapping and Mapping Techniques 65 2.7.1 Mapping Techniques for Graphs 66 2.7.2 Cost-Performance Tradeoffs 73 2.8 Bibliographic Remarks 74 Problems 76 CHAPTER 3 Principles of Parallel Algorithm Design 85 3.1 Preliminaries 86 3.1.1 Decomposition, Tasks, and Dependency Graphs 86 3.1.2 Granularity, Concurrency, and Task-Interaction 89 3.1.3 Processes and Mapping 93 3.1.4 Processes versus Processors 94 3.2 Decomposition Techniques 95 3.2.1 Recursive Decomposition 95 3.2.2 Data Decomposition 97 3.2.3 Exploratory Decomposition 105 3.2.4 Speculative Decomposition 107 3.2.5 Hybrid Decompositions 109 3.3 Characteristics of Tasks and Interactions 110 3.3.1 Characteristics of Tasks 110 3.3.2 Characteristics of Inter-Task Interactions 112 3.4 Mapping Techniques for Load Balancing 115 3.4.1 Schemes for Static Mapping 117 3.4.2 Schemes for Dynamic Mapping 130 3.5 Methods for Containing Interaction Overheads 132 3.5.1 Maximizing Data Locality 132 3.5.2 Minimizing Contention and Hot Spots 134 3.5.3 Overlapping Computations with Interactions 135 3.5.4 Replicating Data or Computations 136 3.5.5 Using Optimized Collective Interaction Operations 137 3.5.6 Overlapping Interactions with Other Interactions 138 3.6 Parallel Algorithm Models 139 3.6.1 The Data-Parallel Model 139 3.6.2 The Task Graph Model 140 3.6.3 The Work Pool Model 140 3.6.4 The Master-Slave Model 141 3.6.5 The Pipeline or Producer-Consumer Model 141 3.6.6 Hybrid Models 142 3.7 Bibliographic Remarks 142 Problems 143 CHAPTER 4 Basic Communication Operations 147 4.1 One-to-All Broadcast and All-to-One Reduction 149 4.1.1 Ring or Linear Array 149 4.1.2 Mesh 152 4.1.3 Hypercube 153 4.1.4 Balanced Binary Tree 153 4.1.5 Detailed Algorithms 154 4.1.6 Cost Analysis 156 4.2 All-to-All Broadcast and Reduction 157 4.2.1 Linear Array and Ring 158 4.2.2 Mesh 160 4.2.3 Hypercube 161 4.2.4 Cost Analysis 164 4.3 All-Reduce and Prefix-Sum Operations 166 4.4 Scatter and Gather 167 4.5 All-to-All Personalized Communication 170 4.5.1 Ring 173 4.5.2 Mesh 174 4.5.3 Hypercube 175 4.6 Circular Shift 179 4.6.1 Mesh 179 4.6.2 Hypercube 181 4.7 Improving the Speed of Some Communication Operations 184 4.7.1 Splitting and Routing Messages in Parts 184 4.7.2 All-Port Communication 186 4.8 Summary 187 4.9 Bibliographic Remarks 188 Problems 190 CHAPTER 5 Analytical Modeling of Parallel Programs 195 5.1 Sources of Overhead in Parallel Programs 195 5.2 Performance Metrics for Parallel Systems 197 5.2.1 Execution Time 197 5.2.2 Total Parallel Overhead 197 5.2.3 Speedup 198 5.2.4 Efficiency 202 5.2.5 Cost 203 5.3 The Effect of Granularity on Performance 205 5.4 Scalability of Parallel Systems 208 5.4.1 Scaling Characteristics of Parallel Programs 209 5.4.2 The Isoefficiency Metric of Scalability 212 5.4.3 Cost-Optimality and the Isoefficiency Function 217 5.4.4 A Lower Bound on the Isoefficiency Function 217 5.4.5 The Degree of Concurrency and the Isoefficiency Function 218 5.5 Minimum Execution Time and Minimum Cost-Optimal Execution Time 218 5.6 Asymptotic Analysis of Parallel Programs 221 5.7 Other Scalability Metrics 222 5.8 Bibliographic Remarks 226 Problems 228 CHAPTER 6 Programming Using the Message-Passing Paradigm 233 6.1 Principles of Message-Passing Programming 233 6.2 The Building Blocks: Send and Receive Operations 235 6.2.1 Blocking Message Passing Operations 236 6.2.2 Non-Blocking Message Passing Operations 239 6.3 MPI: the Message Passing Interface 240 6.3.1 Starting and Terminating the MPI Library 242 6.3.2 Communicators 242 6.3.3 Getting Information 243 6.3.4 Sending and Receiving Messages 244 6.3.5 Example: Odd-Even Sort 248 6.4 Topologies and Embedding 250 6.4.1 Creating and Using Cartesian Topologies 251 6.4.2 Example: Cannon's Matrix-Matrix Multiplication 253 6.5 Overlapping Communication with Computation 255 6.5.1 Non-Blocking Communication Operations 255 6.6 Collective Communication and Computation Operations 260 6.6.1 Barrier 260 6.6.2 Broadcast 260 6.6.3 Reduction 261 6.6.4 Prefix 263 6.6.5 Gather 263 6.6.6 Scatter 264 6.6.7 All-to-All 265 6.6.8 Example: One-Dimensional Matrix-Vector Multiplication 266 6.6.9 Example: Single-Source Shortest-Path 268 6.6.10 Example: Sample Sort 270 6.7 Groups and Communicators 272 6.7.1 Example: Two-Dimensional Matrix-Vector Multiplication 274 6.8 Bibliographic Remarks 276 Problems 277 CHAPTER 7 Programming Shared Address Space Platforms 279 7.1 Thread Basics 280 7.2 Why Threads? 281 7.3 The POSIX Thread API 282 7.4 Thread Basics: Creation and Termination 282 7.5 Synchronization Primitives in Pthreads 287 7.5.1 Mutual Exclusion for Shared Variables 287 7.5.2 Condition Variables for Synchronization 294 7.6 Controlling Thread and Synchronization Attributes 298 7.6.1 Attributes Objects for Threads 299 7.6.2 Attributes Objects for Mutexes 300 7.7 Thread Cancellation 301 7.8 Composite Synchronization Constructs 302 7.8.1 Read-Write Locks 302 7.8.2 Barriers 307 7.9 Tips for Designing Asynchronous Programs 310 7.10 OpenMP: a Standard for Directive Based Parallel Programming 311 7.10.1 The OpenMP Programming Model 312 7.10.2 Specifying Concurrent Tasks in OpenMP 315 7.10.3 Synchronization Constructs in OpenMP 322 7.10.4 Data Handling in OpenMP 327 7.10.5 OpenMP Library Functions 328 7.10.6 Environment Variables in OpenMP 330 7.10.7 Explicit Threads versus OpenMP Based Programming 331 7.11 Bibliographic Remarks 332 Problems 332 CHAPTER 8 Dense Matrix Algorithms 337 8.1 Matrix-Vector Multiplication 337 8.1.1 Rowwise 1-D Partitioning 338 8.1.2 2-D Partitioning 341 8.2 Matrix-Matrix Multiplication 345 8.2.1 A Simple Parallel Algorithm 346 8.2.2 Cannon's Algorithm 347 8.2.3 The DNS Algorithm 349 8.3 Solving a System of Linear Equations 352 8.3.1 A Simple Gaussian Elimination Algorithm 353 8.3.2 Gaussian Elimination with Partial Pivoting 366 8.3.3 Solving a Triangular System: Back-Substitution 369 8.3.4 Numerical Considerations in Solving Systems of Linear Equations 370 8.4 Bibliographic Remarks 371 Problems 372 CHAPTER 9 Sorting 379 9.1 Issues in Sorting on Parallel Computers 380 9.1.1 Where the Input and Output Sequences are Stored 380 9.1.2 How Comparisons are Performed 380 9.2 Sorting Networks 382 9.2.1 Bitonic Sort 384 9.2.2 Mapping Bitonic Sort to a Hypercube and a Mesh 387 9.3 Bubble Sort and its Variants 394 9.3.1 Odd-Even Transposition 395 9.3.2 Shellsort 398 9.4 Quicksort 399 9.4.1 Parallelizing Quicksort 401 9.4.2 Parallel Formulation for a CRCW PRAM 402 9.4.3 Parallel Formulation for Practical Architectures 404 9.4.4 Pivot Selection 411 9.5 Bucket and Sample Sort 412 9.6 Other Sorting Algorithms 414 9.6.1 Enumeration Sort 414 9.6.2 Radix Sort 415 9.7 Bibliographic Remarks 416 Problems 419 CHAPTER 10 Graph Algorithms 429 10.1 Definitions and Representation 429 10.2 Minimum Spanning Tree: Prim's Algorithm 432 10.3 Single-Source Shortest Paths: Dijkstra's Algorithm 436 10.4 All-Pairs Shortest Paths 437 10.4.1 Dijkstra's Algorithm 438 10.4.2 Floyd's Algorithm 440 10.4.3 Performance Comparisons 445 10.5 Transitive Closure 445 10.6 Connected Components 446 10.6.1 A Depth-First Search Based Algorithm 446 10.7 Algorithms for Sparse Graphs 450 10.7.1 Finding a Maximal Independent Set 451 10.7.2 Single-Source Shortest Paths 455 10.8 Bibliographic Remarks 462 Problems 465 CHAPTER 11 Search Algorithms for Discrete Optimization Problems 469 11.1 Definitions and Examples 469 11.2 Sequential Search Algorithms 474 11.2.1 Depth-First Search Algorithms 474 11.2.2 Best-First Search Algorithms 478 11.3 Search Overhead Factor 478 11.4 Parallel Depth-First Search 480 11.4.1 Important Parameters of Parallel DFS 482 11.4.2 A General Framework for Analysis of Parallel DFS 485 11.4.3 Analysis of Load-Balancing Schemes 488 11.4.4 Termination Detection 490 11.4.5 Experimental Results 492 11.4.6 Parallel Formulations of Depth-First Branch-and-Bound Search 495 11.4.7 Parallel Formulations of IDA* 496 11.5 Parallel Best-First Search 496 11.6 Speedup Anomalies in Parallel Search Algorithms 501 11.6.1 Analysis of Average Speedup in Parallel DFS 502 11.7 Bibliographic Remarks 505 Problems 510 CHAPTER 12 Dynamic Programming 515 12.1 Overview of Dynamic Programming 515 12.2 Serial Monadic DP Formulations 518 12.2.1 The Shortest-Path Problem 518 12.2.2 The 0/1 Knapsack Problem 520 12.3 Nonserial Monadic DP Formulations 523 12.3.1 The Longest-Common-Subsequence Problem 523 12.4 Serial Polyadic DP Formulations 526 12.4.1 Floyd's All-Pairs Shortest-Paths Algorithm 526 12.5 Nonserial Polyadic DP Formulations 527 12.5.1 The Optimal Matrix-Parenthesization Problem 527 12.6 Summary and Discussion 530 12.7 Bibliographic Remarks 531 Problems 532 CHAPTER 13 Fast Fourier Transform 537 13.1 The Serial Algorithm 538 13.2 The Binary-Exchange Algorithm 541 13.2.1 A Full Bandwidth Network 541 13.2.2 Limited Bandwidth Network 548 13.2.3 Extra Computations in Parallel FFT 551 13.3 The Transpose Algorithm 553 13.3.1 Two-Dimensional Transpose Algorithm 553 13.3.2 The Generalized Transpose Algorithm 556 13.4 Bibliographic Remarks 560 Problems 562 APPENDIX A Complexity of Functions and Order Analysis 565 A.1 Complexity of Functions 565 A.2 Order Analysis of Functions 566 Bibliography 569 Author Index 611 Subject Index 621 # turn off auto-fill in this file with wide lines. # only works because I have auto-fill on by default, # so specifying it toggles it off...duh. ;;; Local Variables: *** ;;; mode:text *** ;;; mode:auto-fill *** ;;; End: ***