# Computer Architecture Comprehensive Exam

#### **Exam Instructions**

Answer each of the questions included in the exam. Write all of your answers directly on the examination paper, including any work that you wish to be considered for partial credit. The examination is open-book, and you may make use of the text, handouts, your own course notes, and a calculator.

**On equations:** Wherever possible, make sure to include the equation, the equation rewritten with the numerical values, and the final solution. Partial credit will be weighted appropriately for each component of the problem, and providing more information improves the likelihood that partial credit can be awarded.

**On writing code:** Unless otherwise stated, you are free to use any of the assembly instructions listed in the Appendix at the back of the book, including pseudoinstructions. You do not need to optimize your MIPS code unless specifically instructed to do so.

**On time:** You will have one hour to complete this exam. Budget your time and try to leave some time at the end to go over your work. The point weightings correspond roughly to the time each problem is expected to take.

#### THE STANFORD UNIVERSITY HONOR CODE

The Honor Code is an undertaking of the students, individually and collectively:

- 1. that they will not give or receive aid in examinations; that they will not give or receive unpermitted aid in class work, in the preparation of reports, or in any other work that is to be used by the instructor as the basis of grading;
- 2. that they will do their share and take an active part in seeing to it that others as well as themselves uphold the spirit and letter of the Honor Code.

I acknowledge and accept the Honor Code.

Magic Number: \_

 Score
 Grader

 1. Short Answer
 (15)
 \_\_\_\_\_\_

 2. Pipelining
 (15)
 \_\_\_\_\_\_\_

 3. Memory Heirarchy
 (15)
 \_\_\_\_\_\_\_

 4. Cache Math
 (5)
 \_\_\_\_\_\_\_

 5. MIPS Assembly
 (10)
 \_\_\_\_\_\_\_\_

 TOTAL
 (60)
 \_\_\_\_\_\_\_\_\_

### Problem 1: Short Answer (15 points)

Please provide short, concise answers.

(1) (2 points) Your company, Acme Corp., is deciding between two computer systems to deploy its new killer Road Runner Tracking application. Ben Bitdiddle says that his company's system has the better performance because it has the higher clock speed and the higher IPC. Explain why his logic is flawed.

(2) (2 points) Some RISC architectures require the compiler (or assembly programmer) to guarantee that a register not be accessed for a given number of cycles after it is loaded from memory. Give an advantage and a disadvantage of this design choice.

(3) (2 points) Ben Bitdiddle is writing an optimizing compiler for an architecture that supports virtual memory. He notices that his target processor can execute an unaligned 32 bit load more quickly than an 8 bit load. Is it okay for his compiler to emit 32 bit loads and then ignore the extra 24 bits? Why or why not?

(4) (3 points) Briefly describe the data access pattern of an application for which an LRU (least recently used) cache replacement policy performs worse than a random replacement policy.

(5) (3 points) Briefly describe the data access pattern of an application for which a cache with a random replacement policy performs worse than an LRU replacement policy.

(6) (3 points) Briefly give two ways in which loop unrolling can increase performance and one in which it can decrease performance.

### Problem 2: Pipelining (15 points)

The National Security Agency has designed a new encryption device called the Conundrum, composed of nine combinational modules connected as shown in the diagram below:



The device takes an integer value X and computes an encrypted version C(X). In the diagram above each combinational component is marked with its propagation delay in microseconds; contamination delays are zero for each component. Assume an ideal (zero-delay) register on both the input and the output of the device.

(1) (4 points) What is the latency and throughput of the Conundrum device?

latency  $(\mu s)$  \_\_\_\_\_ throughput  $(1/\mu s)$  \_\_\_\_\_

(2) (9 points) The NSA needs to produce a version of the Conundrum device that has a throughput larger than 1/15 but wants the implementation with smallest latency (cycle time \* pipeline depth) that meets the throughput constraint. Using the diagram below indicate the locations of ideal (zero-delay) registers to create a pipelined implementation that meets these goals. You may use as many registers as necessary. We'll give partial credit for designs with throughput larger than 1/15; full credit for achieving the smallest possible latency. (Extra copies of this diagram can be found on the following page, but only solutions written on the diagram below will be graded).



(3) (2 points) What is the latency of your pipelined implementation?

latency  $(\mu s)$  \_\_\_\_\_ throughput  $(1/\mu s)$  \_\_\_\_\_



(These diagrams are for scratch work; no solution written here will be graded. Record your solution on the previous page.)

## Problem 3: Memory Heirarchy (15 points)

Assume you have a 1 GHz processor with 2-levels of cache and DRAM main memory. The first level cache is split for instructions and data. The system does not use early restart or critical word first, i.e. data blocks must be completely transfered before their results are available. The memory system has the following parameters (Note that here 1KB = 1024 bytes):

|               | Hit Time                              | Miss Rate          | Block Size |
|---------------|---------------------------------------|--------------------|------------|
| Level-1 cache | 1 cycle                               | 6% for data        | 32 bytes   |
|               |                                       | 2% for instruction |            |
| Level-2 cache | 12  cycles + (1  cycle per  64  bits) | 2%                 | 256 bytes  |
| DRAM          | 70ns + (10ns per 8 bytes)             | _                  | _          |

The system includes a TLB with a miss rate of 0.5% for data and never incurs a TLB miss for instructions. The TLB miss penalty is 300 cycles and TLB hits take place in parallel with level-1 cache access. All caches in the system are virtually indexed and physically tagged. Assume that the system never swaps memory out to disk.

(1) (5 points) What is the average memory access time (AMAT) in clock cycles for <u>instructions</u>?

(2) (5 points) What is the AMAT in clock cycles for <u>data</u>? Assume all data accesses are loads.

(3) (5 points) Suppose that we measure the following instruction mix for a program:

Loads: 25%, Stores: 15%, Integer: 30%, Floating-Point: 20%, Branches: 10%

Assume that the processor is using the 5-stage pipeline (base CPI of 1.0). Data hazards cause an average penalty of 0.9 cycles for floating point operations. Integer operations run at maximum throughput. The processor uses the predict-branch-not-taken technique, which turns out to be correct for 80% of the branches. For the remaining branches, there is a 1 cycle stall. What is the average CPI of this program including memory misses (from questions a and b)? Assume that stores have the same AMAT as loads.

#### Problem 4: Cache Math (5 points)

Answers to this question may be in the form of a bare number (decimal or hex),  $2^n$ , or nk (=  $n*2^{10} = n*1024$ ), for example, 65536, 0x10000,  $2^{16}$ , or 64k are all equivalent and acceptable answers.

Consider a 256kb 4-way set associative cache with 256 byte cache lines for a processor that uses 64-bit data words and 48-bit byte addresses. Assume the variable x, of type uint $64_t$ , is stored in memory at location  $0x4A85_B413_A518$ .

(1) (2 points) Fill in the bit ranges in the following diagram. Bit ranges should be inclusive. For example, if a field uses bits 0, 1, 2, and 3, label it 3:0.

| 47 : | :   | :            | : 0          |
|------|-----|--------------|--------------|
| tag  | set | word in line | byte in word |

(2) (3 points) Assume that x is present in the cache, and char\* ptr = 0x4A85\_B400\_0000. Determine if the following accesses will cause a cache miss. For each access circle MISS if it <u>must</u> cause a miss, HIT if it will <u>never</u> cause a miss, or NOT ENOUGH INFO.

| $(ptr + 0x11_A538)$ | MISS | HIT | NOT ENOUGH INFO |
|---------------------|------|-----|-----------------|
| $(ptr + 0x13_A588)$ | MISS | HIT | NOT ENOUGH INFO |
| $(ptr + 0x13_0218)$ | MISS | HIT | NOT ENOUGH INFO |

#### Problem 5: MIPS Assembly (10 points)

Here is mips assembly for a function with the signature (in C):

int f(char\* a, char\* b);

Remember, in a MIPS function call, ra contains the return address, a0 and a1 contain the first and second arguments, respectively, and upon return v0 contains the return value. The s# registers must be preserved across procedure calls. Also remember that standard MIPS has a branch delay slot.

00000000 <f>:

- v1,0(a0) 0: lb 4: lb v0,0(a1) 8: a0,a0,1 addiu c: subu v0,v1,v0 v0,20 <f+0x20> 10: bnez 14: addiu a1,a1,1 18: bnez v1,0 <f> 1c: nop 20: jr ra 24: nop
- (1) (7 points) Explain briefly, in english, what this function does.

(2) (3 points) Give one optimization that can be performed on the assembly code.