Simple But Effective Techniques for NUMA Memory Management William J. Bolosky, et al. - 1989 ---------------------------------------------------------- Summary: The key NUMA problem is that performance depends heavily on the extent to which data reside close to the processes that use them; this paper presents simple techniques for dealing with this problem. Example for reference: IBM ACE multiprocessor workstation - Global memory: 2.3 times slower than local memory for fetches and 1.7 times slower for stores. - Remote memory: usually slower than global memory Page placement strategy (similar to Li's directory-based ownership protocol used in distributed shared memory): - Replicate read-only pages on the processors that read them. - Move written pages to the processors that write them. - Permanently place a page in global memory if the page is routinely written by more than one processor. Mach's policy is to count the number of migrations for a page; when a threshhold is reached, the page is permanently placed in global memory. - Why not use remote memory? This is appropriate for data used frequently by one processor and infrequently by all others. However, there is no reasonable way to determine the best location without pragmas or special-purpose hardware to measure frequency of reference for each processor on all pages. - Fundamental problem: Without programming pragmas, it is difficult to use reference behavior to make a good page placement decision because there are two competing desires: 1. Don't pin pages in global memory too early; transfers in ownership can reflect transient behavior. 2. Pin pages earlier; page migration takes time, so globally shared pages should be pinned as soon as possible. Implement data replication and migration in four possible levels: - Hardware (consistent caches): Eliminates software overhead and reduces false sharing by performing consistency at the granularity of a cache line. However, the authors question whether this is feasible and claim that it will be expensive. - Application-specific code: With application-specific information, decent performance can be achieved, but this option places a burden on the programmer and requires that all applications be modified to run on Mach. - Operating system: Existing parallel programs can be run on Mach without modification and it is easy to make future programs portable. This leaves the option of modifying applications open. In Mach, the page placement mechanism is placed in machine-dependent portions of the virtual memory system. - Compilers or library routines: Evaluation: - Page placement performs well in general, but poorly when pages are heavily shared. However, even optimal placement cannot solve this problem. Restructuring the program to reduce the amount of write/false sharing is the only way to improve performance. - False sharing occurs when an object that is not writably shared, but is on a writably shared page. It can be reduced by separating objects by adding page-size padding and forcing proximity of objects by merging them into a single object (???). Note that some false sharing is created explicitely by applcation source code, while some of it is created implicitely by the compiler and loader.