Memory Coherence in Shared Virtual Memory Systems

Li, Hudak

Context: Multiprocessor, distributed physical memory

Goal: provide shared virtual memory – virtual memory address space that can be accessed by any processor in the multiprocessor system

don’t just page memory between processor physical memory and disk, but across the physical memory of different processors

main challenge: memory coherence – ensure that every value that a processor reads is the most recent value written by some other processor.

traditionally, programmers were required to do explicit message passing between processors. this approach requires less programmer intervention (no modifications to programs), but we need to deal with the overhead of ensuring memory coherence. message passing requires alot of effort to program the application to take advantage of parallelism, and has trouble passing large or hard to marshal data structures. distributed shared memory addresses these problems because the application does not need to be modified to specifically deal with parallel processors, and data structures don’t need to be marshalled – pages containing the data that make up the structures are simply paged in at the appropriate processor where the data is needed. on the other hand, it is probably possible to achieve better performance by hand-coding message passing. (although the techniques described in this paper and in Munin significantly close the performance gap between the two approaches).

false sharing: two processes are writing data on differnent parts of the same page. from the standpoint of the DSM, the processes need to share the page, but really only write to different memory locations on the page. Large page sizes can result in false sharing, and unnecessary contention / page invalidation.

similar problem is keeping the caches of the processors (multicaches) in sync, but different because all the processors are sharing the same large physical memory, and their caches just contain different parts of it. usually solved in hardware by a small time delay between conflicting writes. solution not applicable here due to there is no single uniform large physical memory, communication between processors has high latency, and must be dealt with more like a page fault.

basic model: each processor has its own memory mapping manager (ie. TLB + page table), that maps virtual addresses to physical addresses. if a translation for a given virtual address does not exist in the page table, the processor takes a (read/write) page fault. the page fault handler looks for that page in the memory of some other processor before going to disk.

performance of parallel program depends on: 1) number of concurrent processes, and 2) amount of contention between processes for shared data.

multiprocessor characteristics: sending ten bytes between processors takes about the same time to send a hundred or a thousand (due to software/protocol overhead) -> encourages larger page sizes. contention & false sharing -> encourages smaller page sizes. Authors find that 1K is usually good enough to avoid contention. Suggest avoiding larger sizes, but believe that smaller sizes could work OK as well.

How to deal with mutliple writers to a page?

1) write-broadcast. broadcast written page to every processor for each write to a memory location. too expensive to be practical.

2) invalidation. allow for one writer at a time. when a processor wants to write, if it does not “own” the page, it sends a message to all other processors to invalidate their copy of the page, and now owns the page before proceeding to the faulting instruction.

How to deal with multiple readers for a page? (little easier than dealing with multiple writers) If some processor had write access, change their access to read, and get a copy of the page.

Page ownership: fixed (only the owner of the page is allowed to write to the page, and that owner does not change; other processors that want to write to that page negotiate with the owner) or dynamic (owner of the page changes over time). fixed page ownership is very expensive. dynamic ownership strategies are practical.

Dynamic Page Ownership Strategies: Centralized or Distributed. Centralized = Manager of all pages is centralized. Distributed = Different processors manage different pages.

Manager = responsible for memory address mapping for the page.

Owner = is allowed to write to the page.

In multiprocessors, page tables has additional fields:
access: read, write, or none

copy set: set of processors that have read access to the page

lock: synchronizes multiple read/write page faults (by different processes on the same processor, as well as across different processors)

Three ways to invalidate: individual (send an invalidate request to every processor in the copy set), broadcast, multicast

QUESTIONS: WHAT IS THE DIFFERENCE BETWEEN BROADCAST & MULTICAST? (broadcast = everyone, multicast = selected processors concurrently ?)

DIDN’T UNDERSTAND HIS MESSAGE COUNTS ON PAGE 329.

Centralized Manager Algorithm:

similar to monitor. all read/write requests sent to central manager. manager knows the owner, copy set, and lock for each page. when manager gets a read request, he tells the owner to send a copy of the page to the requestor, and puts the requestor in the copy set. when manager gets a write request, he tells the owner to give the page back, the owner sends invalidations to everyone in the copy sent, and the manager changes the owner to the requestor. (the owner sends a confirmation message back to the manager once the replies to the invalidation messages have been recieved, so the the manager can safely give ownership to the requestor).

Read Example (4 messages; each request is given a reply)

requestor: manager.read(p): /* lock */

manager: owner[p].sendCopy(requestor,p)

manager: copyset[p] = copyset[p] + requestor /* unlock */

Write Example (4 messages)

requestor: manager.write(p) /* lock */

manager: owner[p].giveMeBackThePage(p);

owner: for all i in copyset, invalidate (i,p)

manager: owner[p] = requestor /* unlock */

Improved Centralized Manager Algorithm: (eliminates the confirmation message from being necessary)

manager no longer handles syncrhonization of page ownership—this responsiblity is now moved to owners. manager just tells requestor who the owner is for a given page. (actually, the manager just forwards the request to the owner.)

Read Example (3 messages; manager does not send reply to requestor)

requestor: manager.read(p);

manager: owner[p].sendCopy (requestor, p) /* lock (owner) */

owner: copyset[p] = copyset[p] + requestor /* unlock (owner)*/

Write Example (3 messages)

requestor: manager.write (p);

manager: owner[p].doWrite(requestor,p) /* lock (owner) */

owner: for all i in copyset, invalidate (i,p)

owner: owner[p] = requestor /* unlock (owner) */

Fixed Distrubted Manager Algorithm

manager for all pages is not centralized anymore. manager for a page can be determined by computing a hash on the address of the page to determine who the manager for that page is. performs betters than centralized manager.

Broadcast Distributed Manager Algorithm

each processor manages the pages that it owns. read and write requests are broadcast to all processors. algorithm is simple, and works when number of processors (N) <=4. does not scale when N >4 due to cost of broadcast.

Dynamic Distributed Manager Algorithm

each processor keeps track of the owners for all pages in its local page table. actually, it keeps track of the “probable” owners for all pages. on a read page fault, the owner gives ownership to the requestor, and changes the probable owner to the requestor. if it is ever asked about the page, it forwards the request onto the probable owner. on a write page fault, same sort of thing, except that once the requestor gets the page, it sends an invalidation to all processors in the copy set (which it maintains).

This scheme makes analysis easy (similar to union-find set algorithm). A page fault arrives at the owner in at most N-1 messages. Worst-case number of messages for locating a owner of a single page K times is O(N + K log N). Worst-case number of messages for locating owner of page K times is O (p + K log q) where p is number of processors that have used the page, and q is the number of contendors for it. Note that performance does not degrade as the number of processors increase! :)

In reality, however, to make the algorithm more efficient, the owner of a page does not change on a read. On a read, the owner will hand out a copy of the page, and add the requestor to the copy set.

Improvement on Dynamic Distributed Manager Algorithm

After every write page fault, the new owner becomes known to all that were in the copyset (or is it to all processors?). However, this information does not get disseminated on reads, which can result in building up a large probable owner chain. The improvement is done by broadcasting to all processors the new owner of a page after every M page faults. (M is maintained by the owner of the page, and this value is copied to the new owner whenever the owner changes.) If M=0, equivaent to broadcast distributed algorithm; if M=N-1, equivalent to unmodified dynamic distributed manager algorithm. Paper provides tables that provide average number of message required per find (based on simulations) that can be used to determine good value of M based on N. The smaller M is, the fewer messages required to do a find.

Distribution of Copy Sets

Copy sets of all processors that have a given page form a tree. Probable owner is the parent link, and all processors in the copy set are the children. On read page fault, requestor just finds some processor that has a copy of the page, and gets added to their copy set. (Only difference is there needs to be a lock on this operation.) On a write page fault, the owner sends invalidation to every processor in its copy set, and all processors that receive the invalidation propagate the invalidation to the processors in their copy sets. Doing the invalidation now takes log m time if there are m read copies, and the copy set tree is balanced.

Experiments and conclusions

Experiments run on 4 parallel programs:

1) Jacobi program for solving 3D PDEs (partial differental equations)

2) Parallel sorting

3) Martix multiply

4) Parallel dot product

Best a parallel program is provide a linear speedup with respect to number of processors if algorithm is perfectly parallelizable.

Superlinear speedup achieved for 3D PDE because size of problem (N=50^3) was too big to fit in single processor memory, and resulted in large amount of paging between physical memory and disk. However, when parallel version was run, problem fit in combined memory of all processors, and didn’t result in paging to disk, and achieve superlinear speed up. Sort of unfair though, because uniprocessor only had 1/N available memory. When size of the problem was reduced to N=40^3 which fit into physical memory of single processor, there was sub-linear as would be expected with increase in number of processors. Sub-linear results were good and comparable to best existing parallel implementations.

Parallel sorting didn’t do very well, although seemed expected as “parallel sorting on a loosely coupled multiprocessor is generally difficult.” (I guess this is due to all the copying and recombination of results that must take place?)

Parallel dot-product was horrible. (each processor computed a subset of the sum required for the dot product) Data set only needs to be read once. Shared virtual memory system in general is not likely to provide a speedup.

Matrix multiplication did well.

Conclusions

* Central manager alg: straightforward implementation, but becomes bottleneck

* Fixed distributed manager alg: alleviates bottleneck, but two messages required per locate owner.

* Dynamic distributed manager alg: most desirable. avg num messages just less than 2.

* Possible to apply dynamic distributed manager algorithms to multiprocessor.

* Even for parallel programs that shared fine grained data, most mem refs were read-only.

* Expect poor performance when there are frequent updates to shared data, or large data sets that are only read once (i.e. dot product)

* Good page size is still up in the air. Also, how do things play out with >8 processors?