44. First-Touch NUMA
On multi-socket HFT boxes, remote NUMA accesses inflate latency and jitter. You’re initializing per-core buffers and want predictable locality without OS-specific calls.
struct Buf { int* p; size_t n; };
void touch(int* p, size_t n) {
for (size_t i = 0; i < n; i += 64/sizeof(int)) p[i] = 0;
}
void init(Buf& b) noexcept { touch(b.p, b.n); }
Part 1.
Under Linux first-touch policy, where and when should init be called to ensure memory is local to the owning thread? What are the latency consequences if it isn’t?
Part 2.
(1) Where should init execute relative to thread pinning on NUMA nodes?
(2) Why stride 64 bytes, and how would differing cache-line sizes affect correctness and performance?
(3) How do huge pages change first-touch, TLB pressure, and touch loop design?
(4) Does noexcept here influence inlining or code generation meaningfully?
(5) How to detect and remediate remote accesses at runtime in production safely?
Answer
Answer (Part 1)
Call init on the same core/socket that will own and access the buffer, after the thread is pinned. First-touch will place pages on that NUMA node; pre-touching elsewhere causes remote accesses, increasing average latency and fattening tails via elevated memory and interconnect contention.
Answer (Part 2)
(1) Execute init after pinning the owning thread to its target core/node. The first write faults pages local to that node.
(2) 64 bytes matches common cache-line size; it warms lines and ensures stores happen. On different line sizes, adjust the stride to the hardware line size.
(3) Huge pages reduce TLB misses and inter-page overhead. Touch once per huge page (or sequentially) rather than per 64 bytes.
(4) noexcept can enable better inlining and code-motion because the compiler needn’t model unwinding paths. It also documents a strong latency-oriented contract.
(5) Use perf counters (perf mem, LLC/uncore metrics), numastat, and latency probes to spot remote hits. Remediate by pinning, migrating pages/threads, or rebalancing buffer ownership.