Benchmarks
==========

All benchmarks were run on an Intel Xeon Gold 6234 (8 physical cores, 16 logical
processors via HyperThreading) under WSL2.  Example data shipped with the package
was used throughout: the IVUS rest/stress pullbacks for the step-size benchmark
and the OCT pullback (280 frames) for the parallelization benchmark.

.. _benchmark-algorithm:

1. Algorithmic improvement: bruteforce vs. optimized
-----------------------------------------------------

The optimized alignment algorithm uses a coarse-to-fine hierarchical search
instead of evaluating every candidate angle exhaustively.  The effect is small
at coarse step sizes (few angles to evaluate) but grows rapidly as the step size
decreases, because the number of candidate angles scales as
:math:`n = 2 \times \text{range} / \text{step}`.

**Test setup** — ``from_file_full`` on the IVUS rest/stress example data,
``range_rotation_deg = 90°``, ``write_obj = False``, ``smooth = False``,
``postprocessing = False``.  Three repetitions per condition; median wall time
reported.

.. figure:: ../benchmarks/results/bruteforce_stepsize.png
   :name: fig-benchmark-stepsize
   :alt: Bruteforce vs. optimized alignment wall time and speedup across step sizes
   :align: center
   :width: 900px

   Wall time (left, log-log) and speedup factor (right) of the optimized
   algorithm over bruteforce as a function of the rotation step size.
   The O(n) reference line confirms the linear scaling of bruteforce with the
   number of candidate angles; the optimized search is sub-linear.

At step sizes of 1° and above the difference is modest (< 2x).  Below 1° the
gap widens substantially: at **0.1°** the optimized algorithm is **5.5x faster**
and at **0.05°** the advantage grows to **10.3x** (6.25 s vs. 64.4 s).  This is
the practically relevant regime: fine step sizes are required for high-accuracy
alignment of OCT data and dense IVUS pullbacks.

.. _benchmark-parallelization:

2. parallelization scaling
--------------------------

The second benchmark tests how much additional speed is gained by increasing
the number of CPU cores, using ``from_array_single`` on the OCT dataset
(280 frames, ``step_rotation_deg = 0.01°``, ``range_rotation_deg = 6°``).
Each core count was run in a fresh subprocess so that rayon's global thread
pool re-initialises from ``RAYON_NUM_THREADS``.

.. list-table:: Median wall time (s) across CPU core counts
   :header-rows: 1
   :widths: 12 20 20 18 18

   * - Cores
     - Bruteforce (s)
     - optimized (s)
     - Alg. speedup
     - Core scaling (opt.)
   * - 2
     - 92.36
     - 10.08
     - 9.2x
     - 1.00x (baseline)
   * - 4
     - 46.78
     - 5.56
     - 8.4x
     - 1.81x
   * - 8
     - 24.27
     - 3.49
     - 7.0x
     - 2.89x
   * - 12
     - 16.74
     - 2.64
     - 6.3x
     - 3.82x
   * - 16
     - 14.15
     - 2.40
     - 5.9x
     - 4.20x

.. figure:: ../benchmarks/results/cpu_scaling.png
   :name: fig-benchmark-cpuscaling
   :alt: Bruteforce vs. optimized alignment speedup across differnet cpu cores
   :align: center
   :width: 900px

**Key observations**

* Parallelizing the angle search inside ``search_range`` (rather than the
  point-rotation loop) provides enough rayon tasks per frame to utilise cores
  effectively: bruteforce scales **6.5x** from 2 to 16 cores, optimized
  scales **4.2x** — both close to practical expectations under Amdahl's law
  given the sequential frame-dependency chain.

* The previous 8-core anomaly (WSL2 HyperThreading interference) has
  disappeared.  With hundreds of angle-evaluation tasks per frame, rayon
  keeps all workers busy and the idle HT sibling effect is negligible.

* The optimized algorithm remains **5.9-9.2x faster** than bruteforce at
  every core count, with the gap slightly narrowing at higher core counts
  because bruteforce has more angles to parallelize and therefore scales
  more aggressively.

* The two gains **compound**: relative to bruteforce at 2 cores (92.4 s),
  the optimized algorithm at 16 cores (2.40 s) achieves a combined
  **38.5x speedup** — roughly 9x from the algorithm and 4x from parallelization.

**Conclusion** — algorithm choice and hardware scaling are now both meaningful
levers.  For the best achievable throughput, use the optimized algorithm on as
many cores as available; for rapid prototyping where accuracy matters less,
coarser step sizes reduce runtime regardless of core count.