Simple But Effective Techniques for NUMA Memory Management
William J. Bolosky, et al. - 1989
----------------------------------------------------------

Summary: The key NUMA problem is that performance depends heavily on
the extent to which data reside close to the processes that use them;
this paper presents simple techniques for dealing with this problem.

Example for reference: IBM ACE multiprocessor workstation

   - Global memory: 2.3 times slower than local memory for fetches and
     1.7 times slower for stores.

   - Remote memory: usually slower than global memory

Page placement strategy (similar to Li's directory-based ownership
protocol used in distributed shared memory):

   - Replicate read-only pages on the processors that read them.
   
   - Move written pages to the processors that write them.

   - Permanently place a page in global memory if the page is
     routinely written by more than one processor.  Mach's policy is
     to count the number of migrations for a page; when a threshhold
     is reached, the page is permanently placed in global memory.

   - Why not use remote memory?  This is appropriate for data used
     frequently by one processor and infrequently by all others.
     However, there is no reasonable way to determine the best
     location without pragmas or special-purpose hardware to measure
     frequency of reference for each processor on all pages.

   - Fundamental problem: Without programming pragmas, it is difficult
     to use reference behavior to make a good page placement decision
     because there are two competing desires:

     1. Don't pin pages in global memory too early; transfers in
        ownership can reflect transient behavior.

     2. Pin pages earlier; page migration takes time, so globally
        shared pages should be pinned as soon as possible.

Implement data replication and migration in four possible levels:

   - Hardware (consistent caches): Eliminates software overhead and
     reduces false sharing by performing consistency at the
     granularity of a cache line.  However, the authors question
     whether this is feasible and claim that it will be expensive.

   - Application-specific code: With application-specific information,
     decent performance can be achieved, but this option places a
     burden on the programmer and requires that all applications be
     modified to run on Mach.

   - Operating system: Existing parallel programs can be run on Mach
     without modification and it is easy to make future programs
     portable.  This leaves the option of modifying applications
     open.  In Mach, the page placement mechanism is placed in
     machine-dependent portions of the virtual memory system.

   - Compilers or library routines:
   
Evaluation:

   - Page placement performs well in general, but poorly when pages
     are heavily shared.  However, even optimal placement cannot solve
     this problem.  Restructuring the program to reduce the amount of
     write/false sharing is the only way to improve performance.

   - False sharing occurs when an object that is not writably shared,
     but is on a writably shared page.  It can be reduced by
     separating objects by adding page-size padding and forcing
     proximity of objects by merging them into a single object (???).
     Note that some false sharing is created explicitely by applcation
     source code, while some of it is created implicitely by the
     compiler and loader.