Preparing Perfect Interview

In an HFT order-book path, a hot loop transforms millions of integers per second. Indirect calls, blocked inlining, and mispredicted branches often dominate; perf or VTune must identify and quantify these.

struct Handler { virtual int f(int) noexcept = 0; };
int sum(Handler& h, const int* p, size_t n) {
    int s = 0;
    for (size_t i = 0; i < n; ++i)
        s += h.f(p[i]);
    return s;
}

Part 1.

Identify the likely bottleneck in sum. Propose a minimal change enabling inlining/vectorization and state how you'd validate it with perf/VTune.

Part 2.

(1) What obstructs vectorization and instruction level parallelism in this loop?

(2) How could you restructure the API to allow devirtualization without losing extensibility?

(3) Which perf or VTune metrics confirm reduced indirect branch cost?

(4) Would noexcept on f impact codegen or inlining decisions here?

(5) How would LTO or PGO change call targets and pipeline utilization?

Answer

Answer (Part 1)

The per-iteration virtual call causes an indirect branch, blocking inlining, vectorization, and aggressive unrolling; it also harms branch prediction. Replace it with a direct call path: make the concrete handler final and rely on devirtualization/LTO, or change the API to a templated sum<H>(H& h, ...) or a callable object so the callee is known at compile time. Validate with perf/VTune by observing higher IPC, a top-down shift toward Retiring, fewer indirect branch mispredicts, and a vectorized loop in VTune’s optimization report.

Answer (Part 2)

(1) The indirect virtual call prevents inlining and vectorization. The loop’s recurrence on s then caps ILP even further.

(2) Use templates/CRTP or pass a concrete functor so the type is known at compile time. Mark concrete classes final and enable LTO.

(3) perf: lower branch-misses/branches, higher IPC, reduced idq_uops_not_delivered. VTune: lower Bad Speculation, fewer Indirect Branch Mispredicts.

(4) noexcept reduces unwind overhead and can improve inlining heuristics slightly. It doesn’t remove the indirect call; devirtualization is required.

(5) LTO exposes concrete types across TUs, enabling devirtualization and inlining. PGO supplies target profiles, enabling guarded devirtualization and better pipeline utilization.

37. Devirtualize Loop

Part 1.

Part 2.

Answer