Plast interview

31. Cache-Local Access

In a market data hot path, we repeatedly aggregate selected levels’ quantities. Cache behavior often dominates latency more than instruction count. Consider how layout and access patterns affect throughput.

struct Tick { int px; int qty; };
struct Book { Tick t[64]; };
int hot(const Book* b, const int* idx, int n) {
  int s = 0;
  for (int i = 0; i < n; ++i) s += b->t[idx[i]].qty;
  return s;
}

Part 1.

What is the primary cache inefficiency here, and how would you redesign layout/access to improve locality and throughput while preserving semantics?

Part 2.

(1) AoS vs SoA here when accessing only qty: cache line utilization and bandwidth?

(2) When does alignas(64) on Book help, and when can it waste cache or regress?

(3) How do index patterns impact prefetcher effectiveness and TLB misses in this loop?

(4) Is software prefetch useful here; where would you issue it relative to i?

(5) Any aliasing constraints between b and idx you’d assert to aid vectorization?

Answer

Answer (Part 1)

The indirection t[idx[i]] produces scattered accesses that defeat spatial locality; AoS also loads px we don’t use. Prefer SoA (e.g., int qty[64]; int px[64];) so summation touches only qty, and batch or reorder indices (when commutative) to group by cache line. If order must be retained, prefetch future &qty[idx[i+K]] a few iterations ahead and process in small tiles to increase L1 hit rate. Consider aligning the qty array and sizing tiles to cache lines to minimize conflict and capacity misses.

Answer (Part 2)

(1) SoA improves line utilization by fetching only qty, reducing bandwidth and aiding vectorization; AoS wastes bytes on unused px.

(2) It prevents line splits and false sharing on adjacent objects; over-alignment can expand footprint and increase conflict or instruction cache pressure.

(3) Sequential or small-stride indices let hardware prefetchers work; random patterns cause LLC misses and more TLB misses across pages.

(4) Yes, when memory-latency-bound; prefetch qty[idx[i+K]] with K tuned (~8–32 iterations) ahead of use, avoiding cache pollution.

(5) Under strict aliasing, Book* and int* don’t alias; adding __restrict__-like qualifiers helps compilers schedule/generate gathers and vectorize.