"Operating system support for improving data locality on CC-NUMA compute servers", by Verghese, Devine, Gupta, and Rosenblum [ASPLOS '96] ------- General ------- Two NUMA types: CC-NUMA (DASH, Alewife) and CC-NOW (D.Flash, Sun's s3.mp) Remote/Local access latency ratio: 3-5x (CC-NUMA) and 10-20x (CC-NOW) IBM ACE NUMA is not CC; Bolosky et al. use page faults to trigger migration/replication, Verghese et al. use cache misses. Big difference: CC-NUMA can cache remote data, so you only incur a loss the 1st time. Goal: minimize runtime of user apps by playing with memory. Specifically: improve data locality. Two things to consider: 1. Access patterns: primarily single-process (migrate w/ process), multiple procs. w/ mostly read (replicate), multiple procs. w/ R/W (neither is good). 2. Cost of page migration/replication = gathering information + kernel overhead + data movement + memory pressure due to replication ------------------ Meat of the matter ------------------ To make policy decisions, they use values of 3 per-page counters: misses, migrations, and writes. To tune policy decisions, they use a counter reset interval and 4 thresholds: trigger, sharing, write, migrate. If any counter reaches the trigger threshold, that page is hot. If miss counter exceeds sharing threshold, that page is shared. If write counter above write threshold, page is to expensive to replicate. If migration counter above threshold, page is to mobile for migration. Write to a replicated page causes a collapse of replicas to one page. Changes needed in the IRIX kernel: support for replicated pages (linked page table), finer grain locking (use per-page locking), and back mappings in page table (a la inverted PT). ----------- Experiments ----------- Used SimOS: through software simulation, allows non-intrusive data collection and complete modelling of the system. Improving data locality is beneficial to CC-NUMA even when remote/local access latency ratio is below 4. When moving to CC-NOW, they were suprised they didn't see major performance benefits. Explanation: controller occupancy and higher migration/replication cost. Main sources of kernel overhead: TLB flushes (have to do it on all processors), page allocation (contention for memlock, which can't be avoided). Data copy was only 10%. Alternative policies: 3 fixed (round-robin, first-touch, post-facto) and 3 dynamic (migration only, replication only, migration + replication). Performance is very sensitive to the trigger threshold (controls policy aggressiveness), but not to sharing threshold, if reasonable (thus, pages are clearly differentiated). End result: a meager 29% maximal improvement on CC-NUMA, and "potentially" (read "hopefuly") better on CC-NOW. ------ Issues ------ Paging kernels, drivers, etc.