-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Computer Architecture 2004 Notes HEX LOOKUP TABLE POWERS OF TWO 0 0 0000 0 1 1 1 0001 1 2 2 2 0010 2 4 3 3 0011 3 8 4 4 0100 4 16 5 5 0101 5 32 6 6 0110 6 64 7 7 0111 7 128 8 8 1000 8 256 9 9 1001 9 512 10 a 1010 10 1024 11 b 1011 11 2048 12 c 1100 12 4096 13 d 1101 13 8192 14 e 1110 14 16384 15 f 1111 15 32768 VIRTUAL ADDRESS BITS virtual address = virtual page number + virtual page offset virtual page number = TLB tag + TLB set index physical address = physical page number + physical page offset physical page offset = virtual page offset CACHE BITS physical address = cache tag + cache set index + cache block offset TIME - ARTHIMETIC NORMALIZED TIME - GEOMETRIC RATES - HARMONIC Seconds Instructions Clock cycles Seconds Time = ------- = ------------ * ------------ * ----------- Program Program Instruction Clock cycle Formulas... CPU-time = (CPU-execution-clock-cycles+Memory-stall-clock-cycles)*Clock-cycle-time Memory-stall-clock-cycles = Read-stall-cycles + Write-stall-cycles Read-stall-cycles=Reads/Program*Read-miss-rate*Read-miss-penalty Write-stall-cycles=(Writes/Program*Write-miss-rate*Write-miss-penalty)+Write-buffer-stalls AMAT=Time-for-a-hit*(Miss-rate*Miss-penalty) In write-through cache, Read-miss-rate ~= Write-miss-rate so Memory-stall-clock-cycles = Memory-accesses/Program * Miss-Rate * Miss-Penalty Memory-stall-clock-cycles = Instructions/Program * Misses/Instruction * Miss-Penalty total=seek-time+rotational-delay+transfer-time+controller-time ------------------------------------------------------------------------ Notes on Patterson and Hennesey ------------------------------------------------------------------------ CHAPTER 1 - Computer Abstractions and Technology 1.1 Introduction 1.2 Below your program instructions (machine code) assembly language and assemblers high-level programming languages and compilers 1 allow programm to think in a more natural language (possible domain specific) 2 improved developer productivity 3 machine independence subroutine libraries for I/O evolution in operating systems systems software operating systems assemblers compilers applications software 1.3 Under the covers 5 classic components 1 input devices keyboard mouse 2 output devices screen printer 1/2 I/O devices disk network 3 memory main memory caches data cache instruction cache 4 datapath 5 control 4+5 = processor (aka CPU) opening the box (other terms) motherboard integrated circuits or chips (ICs) memory DRAM - dynamic random access memory cache memory instruction set architecture (ISA) (aka architecture) architecture vs implementation a safe place for data primary memory main memory volatile typically DRAMs secondary memory nonvolatile typically magnetic disks magnetic disks arm read/write head secondary vs primary 1 nonvolatile 2 slower 3 cheaper communicating to other computers 1 communication - shared informatio 2 resource sharing - shared resources such as I/O devices 3 nonlocal access - remote access 1.4 Integrated Circuits: Fueling Innovation technology evolution generations 1 (vacuum tubes) 2 transistor 3 integrated circuit 4 large scale integrated circuit (LSI) 4 very large scale integrated circuit (VLSI) manufacturing process semiconductor - silicon ignots sliced into wafers naturally have defects wafer is diced into dies (aka chips) yield is ration of good dies to total dies good dies are bonded to a package and retested 1.5 Real Stuff: Manufacturing Pentium Chips 1.6 Fallacies and Pitfalls Fallacy: Computers have been built in the same, old-fashioned wat for far too long and this antiquated model of computation is running out of steam. Pitfall: Igoring the inexorable progress of hardware when planning a new machine. 1.7 Concluding Remarks 1.8 Historical Perspective and Further Reading 1.9 Key Kerms 1.10 Exercises ------------------------------------------------------------------------ CHAPTER 2 - The Role of Performance 2.1 Introduction some measures of performance response time (execution time) throughput 1 Performance[x] = ------------ Execution[x] Performance[x] > Performance[y] 1 1 ------------ > ------------ Execution[x] Execution[y] Execution[x] < Execution[y] Performance[x] Execution[y] -------------- = n = ------------ Performance[y] Execution[x] improve performance => increase performance => decrease execution time 2.2 Measuring Performance execution time wall-clock time, reponse time, elapsed time total time to complete a task CPU execution time (or CPU time) time spent by CPU excluding I/O, waiting for other programs, etc user CPU time vs system CPU time system performance - elapsed time on unloaded system CPU performance - user CPU time clock clock cycle clock period = 1/(clock rate) clock and CPU performance CPU-time = CPU-clock-cycles * Clock-cycle-time CPU-clock-cycles = Instruction-Count * Average-CPI CPU-time = Instruction-Count * Average-CPI * Clock-cycle-time CPI and IPC CPI = Clock Cycles Per Instruction IPC = Instructions Per Clock Cycle CPI = 1/IPC Big Picture Seconds Instructions Clock cycles Seconds Time = ------- = ------------ * ------------ * ----------- Program Program Instruction Clock cycle the three components Instruction count - limited by architecture - measure with hardware counters or simulation CPI - the big variable in implementation - varies with memory system, processor structure, mix of instructions - varies by applicatio Clock period - fixed in a specific implementation - historically just a fixed documented value but.. 2.4 Choosing Programs to Evaluate Performance workload benchmarks reproducibility 2.5 Comparing and Summarizng Performance total execution time arithmetic mean of total time for average time WEIGHTED ARITHMETIC MEAN (vs GEOMETRIC MEAN) 2.6 Real Stuff: The SPEC95 Benchmarks and Performance on Recent Processors SPEC ratio (normalize by dividing by Sun SPARCstation10/40 time) SPECfp95 and SPECint95 are GEOMETRIC MEAN of SPEC ratios For a given ISA, three typical sources of improvement: 1.) increase clock rate 2.) improve processor organization to lower the CPI 3.) compiler enhancements that - lower instruction count - use lower CPI instructions 2.7 Fallacies and Pitfalls Pitfall: Expecting the improvement of one aspect of a machine to increase performance by an amount proportional to the size of the improvement. Amdahl's law - the law of diminishing returns The performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. Corollary: Make the common case fast Fallacy: Hardware-independent metrics predict performance. Example: using code size as measure of speed Pitfall: Usings MIPS as a performance metric. Instruction count "native MIPS" = --------------------- Execution time x 10^6 Problems with MIPS: 1.) specifies instruction execution rate but not the capabilities of the instructions. different ISAs matter. 2.) MIPS varies between programs on the same computer 3.) MIPS can vary inversely with performance Fallacy: Synthetic benchmarks predict performance Whetstone and Dhrystone have unrealistic patterns that are either easily optimized or exploitable by benchmark specific optimization.. Pitfall: Using the arithmetic mean of normalized execution times to predict performance. Result will depend on choice of reference machine. Use geometric mean, not arithmetic mean, for normalized times Fallacy: The geometric mean of execution time ratios is proportional to total execution time. Violates fundamental principle of performance measurement - They do not predict execution time 2.8 Concluding Remarks high-performance design low-cost deisgn cost/performance ratio 2003 cheat sheet TIME - ARTHIMETIC NORMALIZED TIME - GEOMETRIC RATES - HARMONIC 2.9 Historical Perspective and Further Reading "peak MIPS", MOPS, and other FLOPS "relative MIPS" Quest for the average program kernel benchmarks (instead of Whetstone...) Quest for a simple program using quicksort etc for benchmarking ... bad! SPEC SPECfp SPECint - CPU oriented SPEC SDM = Systems DEvelopment Multitaskinbg SPEC SFS = System-level File Server SPEChpc = high end scientific workloads 2.10 Key Terms 2.11 Exercises ------------------------------------------------------------------------ CHAPTER 3 - Instructions: Language of the Machine 3.1 Introduction instructions instruction set 3.2 Operations of the Computer Hardware DESIGN PRINCIPLE 1: Simplicity favors regularity. rationale for a fixed number of arguments for all arthimetic ops 3.3 Operands of the Computer Hardware registers word = 32 bits, size of MIPS register unliked variables at higher level, # of registers is limited 32 registers for MIPS DESIGN PRINCIPLE 2: Smaller is faster. rationale for a limited number of registers load/store indexed load "lw $t0, 8($s3)" address in bytes alignment restriction, can't load any address just aligned words spilling registers "store"ing values from registers back to memory to make room registers are important because 1.) less time to access 2.) higher throughput than memory (can operatate on more than one at once) registers faster to access and simpler to use 3.4 Representing Instructions in the Computer decimal vs binary registers referenced by number 0..31 NAME # Usage Preserved $zero 0 always zero na $at 1 assembler only $v0..$v1 2..3 value return no $a0..$a3 4..7 arg passing no $t0..$t7 8..15 temporaries no $s0..$s7 16..23 saved yes $t8..$t9 24..25 more temps no $k0..$k1 26..27 OS only $gp 28 global pointer yes $sp 29 stack pointer yes $fp 30 frame pointer yes $ra 31 return address yes machine language - numeric version of an instruction machine code - numeric version of assembly language instruction format - 32bits for MIPS R-type (or R-format) R=Register +--+--+--+--+-----+-----+ |op|rs|rt|rd|shamt|funct| +--+--+--+--+-----+-----+ | 6| 5| 5| 5| 5 | 6 | +--+--+--+--+-----+-----+ op = opcode rs = register source 1 rt = register source 2 td = register destination shamt = shift amount (chapter 4) funct = function code, selects opcode variant DESIGN PRINCIPLE 3: Good design demands good compromises. rationale for multiple instruction formats within 32 word I-type (or I-format) I=Immediate +--+--+--+-------+ |op|rs|rt|address| +--+--+--+-------+ | 6| 5| 5| 16 | +--+--+--+-------+ op = opcode rs = register source rt = register design address = 16-bit offset Big Picture stored-program concept 1.) instructions are represented by numbers 2.) programs can be stored in memory to be read or writen just like numbers 3.5 Instructions for Making Decisions if-then branches beq/bne - conditional branches labels - symbolic assembler notation for instruction address if-then-else j - unconditional branch basic block sequence of instructions with two properties 1 no branches except possibly at end 2 no branch labels except possibly at beginning early phase of compilation is breaking a program into basic blocks while/for loop slt - set if less than for expressions like "i saved argument registers (if any) saved return address saved saved registers (if any) local arrays and structures (if any) $sp -> global pointer $gp register used for accessing globals PRESERVED NOT PRESERVED $s0-$s9 $t0-t9 $sp $a0-$a3 $ra $v0-$v1 above $sp below $sp $fp, $gp 3.7 Beyond Numbers characters ASCII encoding - one byte per character lb/sb - load/store byte (only uses lower 8 bits of register) 3.8 Other Styles of MIPS Addressing two more ways of addressing operands 1.) faster access to small constants 2.) make branches for efficient Constant or immediate operands addi $sp, $sp, 4 (add immediate) slti $t0, $s2, 10 (set less than immediate) without immediate, need to load constant in separate instruction DESIGN PRINCIPLE 4: Make the common case fast rationale for immediate format 52% of gcc arithmetic instructions, 69% of spice lui - load upper immediate immediates instructions just take 16 bit immediate value to load 32 constant 0xdeadbeef into $t0 lui $t0, 0xdead ori $t0, 0xbeef (not addi, it would sign extend) assembler uses reserved $at register along with lui when using 32 constants for load/store addresses etc Addressing in branches and jumps: jump instruction can use 26-bit word address (28-bit byte address): J-type (or J-format) J=Jump +--+---------+ |op| address | +--+---------+ | 6| 26 | +--+---------+ op = opcode address = 26-bit offset jump only changes lower 28 bits of PC, not upper 4... see loader/linker in 3.9 beq/bne use I-type so only 16 bit word address this address is actually word offset from $pc+4 known as "PC-relative addressing" note for both of these that the lower two-order zero bits aren't stored which gives the 26+2=26 and 16+2=18 dealing with larger conditional branch offsets beq $s0, $s1, L1 replace with bne $s0, $s1, L2 j L1 L2: done automatically by assembler MIPS Addressing mode summary 1 register addressing R-type add 2 base or displacement addressing I-type ld/st 3 immediate addressing I-type addi 4 PC-relative addressing I-type beq/bne 5 psuedo-direct addressing J-type j/jal 3.9 Starting a Program Translation hierarchy C program COMPILER assembly language program ASSEMBLER object: machine language mode + object: library routines LINKER executable: machine language program LOADER memory Compiler Assembler handle pseudo-instructions "move $t0, $t1" => "add $t0, $t1, $zero" "blt" => "slt/bne" handle long conditional branches as noted above handle load of 32-bit constants handle nuermic bases: binary, octal, decimal, hexidecimal generate object file (primary purpose of course) typical object file format (from unix) object file header size and position of other parts of file text segment machine language code data segment both static data and dynamic data relocation information list of absolute addresses symbol table list of undefined labels debugging information source and data structure information Linker aka link editor allows seperate compilation of modules three steps 1 place code and data modules symbolicly in memory 2 determine the address of data and sintruction labels 3 patch both the internal and external references much faster to patch (aka edit) code than recompile and reassemble produces executable that is like object file but typically lacks no unresolved references no relocation information no symbol table no debugging information partially linked files created used to create libraries of routines still are object files Loader executes an executable file unix example 1 Reads the executable file header to termin size of the text and data segments 2 Creates an address space large enough for the text and data 3 Copies the instructions and data from from the executable file into memory 4 Copies the parameters (if any) to the main program onto the stack 5 Initializes the machine registers and sets the stack pointer to the first free locatio 6 Jumps to a start-up routine that copies the parameters ino the argument registers and calls the main routine of the program. When the main routine returns, the start-up routine terminates the program with an exit system call. 3.10 An Example to Put It All Together inlining small functions like swap mips compiler always saves space on stack for arguments in case of vararg, it actually saves them there 3.11 Arrays versus Pointers optimizing compilers can make array code as efficient as pointer code... ...but lets compare the straight forward assembly for both memzero example array 9 instructions 7 in loop pointer 8 instructions 4 in loop !!! 3.12 Real Stuff: PowerPC and 80x86 Instructions PowerPC Extra addressing modes Indexed Addressing lw $t1, $a0+$s3 add two registers to create address Update Addressing lw $t0, 4($s3) after using address, increment by immediate value Extra instructions load multiple / store multiple transfer up to 32 works at a time fast copies of memory when used together also for save/restore registers counter register bc "branch counter" decrements ctr register only branch when ctr matches conditions condition codes (or flags) for conditional branch Intel 80x86 8080 8 bit registers are special purpose, not general purpose registers 8086 16 bit extension to 8080 (not entirely backward compatible) 8087 floating point co-processor for 8086 60 instructions special FP stack, not registers 80286 24 bit address space memory mapping protection model 80386 32 bit extension (32 bit registers and 32 bit address space) new instructions and addressing modes - make more like a general purposed register machine paging mode, not just segments 80486, Pentium, Pentium Pro 3 multiprocessing instructions conditional move MMX 57 new SIMD style instructions using floating point stack registers and data addressing about 8 gprs arthimetic/logical mostly two address, not three address 7 addressing modes integer operations 1 data movement - move, push, pop 2 arthimetic/logical - test and integer and decimal arthimetic 3 control flow - conditional, unconditional, calls, returns control flow uses condition registers (like powerpc) 4 string instructions - string move and string compare 8080 heritage - faster not to use them - see fallacy instruction encoding 1-17 bytes in length 3.13 Fallacies and Pitfalls Fallacy: More powerful instructions mean higher performance moving data faster with FP registers than move-with-repeat on x86 Fallacy: Write in assembly language to obtain the highest performance Pitfall: Forgetting that sequential word address in machines with byte addressing do not differ by one. Pitfall: Using a pointer to an automatic variable outside its defining function. 3.14 Concluding Remarks SUMMARY OF DESIGN PRINCIPLES 1: Simplicity favors regularity. 2: Smaller is faster. 3: Good design demands good compromises. 4: Make fhe common case fast 3.15 Historal Perspective and Further Reading accumulator architectures just one register general purpose register architectures register-memory (CISCy) - one operand in memory load-store or register-register (RISCy) - both operands in registers - need to ld/st from memory memory-memory (VAXy) - all operands can be in memory (or registers) compact code and stack architectures memory scarcity encouraged variable width instruction formats stack machine have dense encoding no registers, makes for easy compiler - Java VM High-Level-Language computer architecture doomed by more efficient programming languages and compilers RISC architecture fixed instruction length load-store instruction set limited addressing modes limited operations MIPS, SPARC, PA-RISC, PowerPC, Alpha 3.16 Key Terms 3.17 Exercises ------------------------------------------------------------------------ CHAPTER 4 - Arithmetic for Computers 4.1 Introduction 4.2 Signed and Unsigned Numbers binary representation of numbers bits are numbered right to left least significant bit on right most significant bit on left overflow - when not enough bits to represent number negative representations sign and magnitude two zeros (+0 and -0) one's complement negation algorithm: invert all bits still two zeros (+0 and -0) hardware needs extra step to subtact two's complement MSB acts as sign bit negation algorithm: invert all bits and add one (unsigned) biased notation number+bias always non-negative used for floating point 00..00 is most negative 11..11 is most positive sign extension example: when loading a byte into a word register with "lb", the upper 24 bits match the sign bit to preserve two's complete for numbers. "lbu" is used to characters, since we don't they don't logically have a signed bit. addresses slt/slti need unsigned variants as well (sltu/sltiu) so we can compare addresses which are logically unsigned (note address offsets are usually signed) 4.3 Addition and Subtraction addition - just like decimal subtraction - two's complement, just negate and add overflow (addition) - only when both numbers have the same sign addding two positives and get a negative addding two negatives and get a positive - add/sub cause exception on overflow MIPS addu/subu variants never overflow C never has overflow exceptions so always uses addu/subu - in order to recover from exception, use mfc0 to recover the ExceptionPC. exception handler uses the reserved registers $k0..$k1 to avoid disturbing "caller" registers. 4.4 Logical Operations shifts sll/slr (use shamt field) bitwise logic and/or/andi/ori 4.5 Constructing an Arithmetic Logic Unit A 1-Bit ALU logical operations A and B, A or B, !A, A, B given directly by hardware AND, OR, INVERTER, MULTIPLEXOR adder inputs: a, b, CarryIn outputs: sum, CarryOut ALU inputs: a, b, CarryIn, operation outputs: result, CarryOut operation: selects AND, OR, NOT, A, B, A+B A 32-Bit ALU at worse, connect 32 1-Bit ALUs using "ripple carry" subtraction need to convert B to two's complement - negate B's bits using inverter - add one (use CarryIn) A+!B+CarryIn=1 add BInvert input to ALU MIPS additions slt need extra output "set" from adder need extra input "less" feed MSB "set" to LSB "less" for slt overflow bit use "set" from adder plus extra logic BNegate since BInvert == CarryIn always, just combine to make BNegate Zero performance A-B and make sure all output bits are zero used for condition branch testing Finished ALU inputs: a,b,operation outputs: zero, result, overflow, CarryOut Carry Lookahead our 32 alu is slow propagating carryout through all 32 ALUs carry-lookahead adder generate and propagate - P&G - only scales to 4 input carry-loookahed unit combines most significant P&G to produce Carry inputs for adders Shift use separate barrel shifter handle arbitraty rotation in one time of add without propagation delay 4.6 Multiplication multiplicand*multiplier=product overflow - addition had 1-bit overflow, mul overflow can be a word 1st version algorithm walk multiplier from LSB to MSB if current multiplier bit is on add multiplicand to product shift left multiplicand requires 64-bit multiplicand, 64-ALU, 64-product, 32-multiplier 2nd version algorithm walk multiplier from LSB to MSB if current multiplier bit is on add multiplicand to product shift right product requires 32-bit multiplicand, 32-ALU, 64-product, 32-multiplier final version algorithm walk multiplier from LSB to MSB if current multiplier bit is on add multiplicand to product shift right product (also shifts multiplier giving "walk") requires 32-bit multiplicand, 32-ALU, 64-product(shared with 32-multiplier) signed numbers just sign extend as we shift right booth's algorithm replace some arithmetic with shifts (assuming shifting is faster than arithmetic) strength reduction similar conceptually to using left shift for multiplying by 2 unfortunately sometimes booth's algorithm does more arithmetic depends on input MIPS: overflow checking must be done in software 4.7 Division Dividend=Quotient*Divisor+Remainder Remainder