"Operating system support for improving data locality on CC-NUMA
compute servers", by Verghese, Devine, Gupta, and Rosenblum [ASPLOS '96]

-------
General
-------

Two NUMA types: CC-NUMA (DASH, Alewife) and CC-NOW (D.Flash, Sun's s3.mp)
Remote/Local access latency ratio: 3-5x (CC-NUMA) and 10-20x (CC-NOW)

IBM ACE NUMA is not CC; Bolosky et al. use page faults to trigger
migration/replication, Verghese et al. use cache misses. Big
difference: CC-NUMA can cache remote data, so you only incur a loss
the 1st time.

Goal: minimize runtime of user apps by playing with
memory. Specifically: improve data locality.

Two things to consider:
1. Access patterns: primarily single-process (migrate w/ process),
   multiple procs. w/ mostly read (replicate), multiple procs. w/ R/W
   (neither is good).
2. Cost of page migration/replication = gathering information + kernel
   overhead + data movement + memory pressure due to replication

------------------
Meat of the matter
------------------

To make policy decisions, they use values of 3 per-page counters:
misses, migrations, and writes. To tune policy decisions, they use a
counter reset interval and 4 thresholds: trigger, sharing, write,
migrate. If any counter reaches the trigger threshold, that page is
hot. If miss counter exceeds sharing threshold, that page is
shared. If write counter above write threshold, page is to expensive
to replicate. If migration counter above threshold, page is to mobile
for migration.

Write to a replicated page causes a collapse of replicas to one page.

Changes needed in the IRIX kernel: support for replicated pages
(linked page table), finer grain locking (use per-page locking), and
back mappings in page table (a la inverted PT).

-----------
Experiments
-----------

Used SimOS: through software simulation, allows non-intrusive data
collection and complete modelling of the system.

Improving data locality is beneficial to CC-NUMA even when
remote/local access latency ratio is below 4.

When moving to CC-NOW, they were suprised they didn't see major
performance benefits. Explanation: controller occupancy and higher
migration/replication cost.

Main sources of kernel overhead: TLB flushes (have to do it on all
processors), page allocation (contention for memlock, which can't be
avoided). Data copy was only 10%.

Alternative policies: 3 fixed (round-robin, first-touch, post-facto)
and 3 dynamic (migration only, replication only, migration +
replication).

Performance is very sensitive to the trigger threshold (controls
policy aggressiveness), but not to sharing threshold, if reasonable
(thus, pages are clearly differentiated).

End result: a meager 29% maximal improvement on CC-NUMA, and
"potentially" (read "hopefuly") better on CC-NOW.

------
Issues
------

Paging kernels, drivers, etc.