Header-only C++20 asynchronous sample rate converter (ASRC) for the
near-unity case: two audio clock domains at nominally the same rate (e.g.
48 kHz ↔ 48 kHz) sourced from independent oscillators, each within a few
hundred ppm and drifting slowly. One thread pushes input samples at the input
clock, another pulls output samples at the output clock; the converter
absorbs the rate mismatch transparently — including the whole-sample
slips that occur roughly once every 1/ppm samples.
- Header-only, no dependencies,
cmake≥ 3.24, GCC 11+/Clang 14+/MSVC 19.30+ - Real-time safe audio path:
push()/pull()arenoexcept, lock-free and allocation-free; all allocation and filter design happen in the constructor - Measured quality (default balanced preset, +200 ppm offset, THD+N-style residual): 135 dB SNR at 997 Hz, 112 dB at 12 kHz, 105 dB at 19.5 kHz
- ~1.5 ms designed latency with the default configuration at 48 kHz (24-frame filter group delay + 48-frame FIFO setpoint)
add_subdirectory(SampleRateTap) # or FetchContent
target_link_libraries(app PRIVATE SampleRateTap::SampleRateTap)#include <srt/srt.hpp>
srt::Config cfg;
cfg.sampleRateHz = 48000.0;
cfg.channels = 2;
srt::AsyncSampleRateConverter asrc(cfg); // allocates + designs filter; may throw
// Input-device thread (input clock):
asrc.push(inputInterleaved, frames); // noexcept, lock-free
// Output-device thread (output clock):
asrc.pull(outputInterleaved, frames); // noexcept, lock-free; silence
// until filled/locked
srt::Status st = asrc.status(); // any thread: state, ppm, fill,
// underruns/overruns/resyncsexamples/drifting_clocks.cpp runs two real threads 500 ppm apart and shows
the lock acquisition and rate estimate. For a visual tour — lock, measured
transparency vs. a naive FIFO, spectrograms, latency, drift tracking,
dropout recovery — see
notebooks/asrc_demo.ipynb, which drives the
library through its C ABI (-DSRT_BUILD_CAPI=ON, tools/capi/) via ctypes
(Python needs numpy and matplotlib; the comparison notebook below
additionally needs the samplerate and soxr packages; the first cell
builds the shared library if missing). A second notebook,
notebooks/asrc_block_size_study.ipynb,
measures how processing block size (32 / 64 / 240 frames) trades latency
against servo observability — including per-impulse latency-breathing
measurements and a calibrated FM/wideband quality decomposition.
For real hardware there are three more entry points:
examples/alsa_bridge.cpp (two ALSA devices on their real crystals — the
hardware testing Setup 1 harness, with CSV
telemetry and post-ASRC capture), examples/pico2_cyccnt/ (flashable
RP2350 firmware measuring real cycles per block against the QEMU
instruction baselines), and examples/pico2_dualcore/ (the
one-clock-domain-per-core RP2350 deployment, self-validating).
Consuming the library: add_subdirectory or FetchContent only —
there are no install/package rules yet. Version 0.1.0 (SRT_VERSION_* in
srt/srt.hpp, srt_version() over the C ABI); pre-1.0, the API may
still change between versions.
The design follows the classic commercial-ASRC architecture (AD1896-style polyphase FIR + clock servo), specialized for the near-unity regime where the conversion degenerates into a creeping fractional delay.
Datapath. A Kaiser-windowed sinc prototype is designed at construction
and decomposed into L polyphase branches of T taps (default 256 × 48,
120 dB stopband, flat to 20 kHz). Each output sample evaluates one branch
pair: coefficients are linearly interpolated between the two phases adjacent
to the fractional position μ, with the dot product accumulated in double.
The table stores an extra row (phase 0 advanced one tap) so the μ wrap
1.0 → 0.0 with a one-sample window shift — the whole-sample slip — is
exactly continuous and branch-free.
Phase accumulator. The fractional position lives in an unsigned Q0.64 integer that accumulates only the rate deviation ε per output sample (converted from the servo's double once per block); the unity part of the ratio is the integer window advance, and whole-sample slips are detected by 64-bit wraparound. The per-sample path is integer-only — no doubles — which keeps it cheap on DSPs without double-precision FPUs. Resolution is 2⁻⁶⁴ samples, far below the ~8 ps jitter budget for 120 dB transparency at 20 kHz.
Clock servo. A lock-free SPSC FIFO sits between the domains and its
occupancy is the phase detector of a type-2 (PI) loop whose output ε̂ drives
the phase increment — a PLL in which the FIFO comparison is the phase
detector and the μ accumulator is the NCO. Gains derive from the standard
2nd-order matching: Kp = 2ζωₙ/fs, Ki = ωₙ²/fs. The loop runs in three
stages with the integrator (the ppm estimate) handed across transitions:
| Stage | Bandwidth | Error smoothing | Role |
|---|---|---|---|
| Acquire | 10 Hz | 1-pole, 50 Hz | fast pull-in (locks in ~1 s) |
| Track | 1 Hz | 1-pole, 5 Hz | robust lock; terminal stage for coarse block transfer |
| Quiet | 0.05 Hz | 3-pole cascade, 0.5 Hz | steady state for fine-grained transfer |
Why three stages. The FIFO count is quantized to whole frames (or whole
push blocks), so the occupancy observable carries a deterministic sawtooth at
the beat frequency ppm × pushRate. Whatever the loop passes into ε̂
frequency-modulates the audio. The Quiet stage rejects a one-frame sawtooth
to roughly −120 dBc equivalent at 20 kHz while still tracking a 1 ppm/s
oscillator drift ramp with under half a frame of standing error. With coarse
blocks (e.g. ≥32-frame callbacks) that level of quiet is information-
theoretically unavailable from counts alone, so the servo deliberately stays
in Track, where the block beat is phase-tracked mostly as benign latency
breathing, the remainder as cent-scale low-rate FM (measured in
notebooks/asrc_block_size_study.ipynb:
~0.9 cents rms / 61 dB wideband at 32-frame blocks, ~1.3 cents rms / 53 dB
at 5 ms blocks). Promotion to Quiet is gated on the
cascade-smoothed error, which is exactly the discriminator between the two
regimes.
Multichannel. channels is a runtime count with no architectural limit
(mono through 7.1.4 and beyond). Use one converter instance per clock
domain, not per channel group: every channel of an instance is resampled
at literally the same fractional position each frame, so inter-channel
phase coherence is exact by construction — no skew to budget for in
surround imaging or microphone-array processing (relevant when an AVB
stream bundles reference mics with the program feed; AVB Class A's 8-frame
packets are also fine-grained enough for the Quiet servo stage). Per frame
the coefficient blend is computed once and shared, so N channels cost
blend + N×dot, not N×(blend + dot). Channel independence is enforced
by test at 12 and 16 channels (distinct tone per channel; crosstalk
bounded below −100 dB float / −72 dB Q15), including a Track-stage
variant that runs on the emulated embedded targets.
Under/overrun policy. pull() always fills its buffer: silence while
filling or after an underrun (the converter then refills and re-locks,
keeping the ppm estimate). If the consumer stalls until the high watermark,
the converter discards down to the setpoint and counts a resync. push()
into a full FIFO drops the newest frames and counts an overrun.
latency = targetLatencyFrames + (L·T − 1)/(2L) [input frames]
= 48 + ~24 ≈ 72 frames ≈ 1.5 ms at 48 kHz (defaults)
designedLatencySeconds() reports the figure; the FIFO term breathes by a
fraction of the block size as the servo tracks drift. The filter is linear
phase. For lower latency use FilterSpec::fast() (~16-frame group delay)
and a smaller targetLatencyFrames.
The setpoint must exceed the pull block size — a pull synthesizes from
frames already buffered, so a setpoint at or below the callback size is
infeasible and would drain into a permanent dropout cycle. The converter
enforces this automatically: when it observes pull blocks larger than the
configured setpoint it raises the effective setpoint (block + ~half-block
margin, bounded by FIFO capacity) and reports the value in
Status::effectiveTargetLatencyFrames; latency follows the raised
setpoint. Callbacks above ~340 frames also need fifoFrames sized
explicitly. The setpoint must additionally stay above the peak occupancy
excursion of your push/pull jitter, as before.
From the test suite (deterministic two-clock simulation, +200 ppm offset, sample-granular transfer, 0.5 FS sine, 1 s analysis window after settling):
| Preset | 997 Hz | 6 kHz | 12 kHz | 19.5 kHz | group delay |
|---|---|---|---|---|---|
balanced() (L=256, T=48) |
135 dB | 120 dB | 112 dB | 105 dB | 0.50 ms |
transparent() (L=512, T=80) |
133 dB | — | — | 108 dB | 0.83 ms |
AES17-style THD+N measured under identical conditions against libsamplerate, soxr and hardware datasheet figures: docs/COMPARISON.md (−132 dB THD+N / 149 dB DR at the 24-bit interface, servo in the loop; notebooks/asrc_comparison.ipynb).
The high-frequency residual is dominated by the linear interpolation between
adjacent phase-table rows (≈ −12 dB per doubling of L, +12 dB per octave of
signal frequency). Servo lock from a cold start takes ~1 s; a 0 → 300 ppm
drift ramp at 10 ppm/s is tracked without unlocking.
The same structure holds at other deployment rates, e.g. 16 kHz for
reference-microphone processing — but FilterSpec band edges and
ServoConfig bandwidths are absolute Hz designed for ~48 kHz, and running
another rate with unscaled defaults silently costs quality (measured:
~32 dB at 16 kHz). Start any non-48 kHz deployment from
srt::Config::forSampleRate(rateHz), which rescales both (plus the servo
hold times); FilterSpec::scaledTo / ServoConfig::scaledTo exist for
custom presets. Measured through that factory
(tests/test_asrc_quality_16k.cpp), 16 kHz matches the 48 kHz
normalized-frequency structure: 136.6 dB at 333 Hz and 106.5 dB at 6.5 kHz,
within ~1 dB of the 48 kHz tones at the same f/fs. Interpolation noise
depends only on f/fs; group delay at the same tap count stays ~24 input
samples and therefore triples in milliseconds (1.5 ms vs 0.5 ms).
CI builds and tests every push on:
- Linux (GCC, Clang) and macOS (AppleClang) with warnings as errors, Windows (MSVC /W4)
- AddressSanitizer + UBSan and ThreadSanitizer over the full suite
- Qualcomm Hexagon (hexagon-unknown-linux-musl, the open-source clang
toolchain from quic/toolchain_for_hexagon) cross-compiled and executed
under
qemu-hexagonuser-mode emulation — validating the library on a 32-bit audio DSP target:size_twidth, atomics lowering, musl libc and soft-float doubles. Emulation proves correctness, not performance; Hexagon and HiFi-class DSPs have no double-precision FPU, so the float datapath's double accumulation runs soft-float there (the Q15/Q31 fixed-point traits are the performance-appropriate path for such targets). Cycle accuracy requires the vendor simulator. - Performance gating on both DSP targets: fixed workloads run under
QEMU with an instruction-counting plugin and are compared against
committed baselines (
bench/baselines.json) at ±3% — a hot-path regression on Hexagon, Cortex-M55 or Cortex-M33 fails CI. See docs/PERFORMANCE.md. - Arm Cortex-M33 (Raspberry Pi Pico 2 / RP2350 class), bare metal on
QEMU's MPS2+ AN505 model, sharing the Armv8-M platform layer below. The
M33 has no FP64 and no Helium, and the instruction baselines make the
consequences concrete: the float datapath costs ~19× the M55's
instructions (soft-double accumulation) — on Pico-class parts use
Q15/Q31. The instruction baselines suggest 48 kHz Q15 mono fits a
150 MHz core and stereo wants the
fast()preset or the RP2350's second core — instruction counts are not cycle counts, so treat these as budgets pending real-silicon validation:examples/pico2_cyccnt/is a flashable DWT.CYCCNT harness built to measure exactly this, andexamples/pico2_dualcore/validates the one-clock-domain-per-core deployment shape. - Arm Cortex-M55, bare metal (newlib + semihosting, no OS/threads),
executed on QEMU's MPS3 AN547 board model via
qemu-system-arm. The platform layer lives inplatform/mps3_an547/(linker script + minimal startup: vector table, FPU/MVE enable, 64-bit atomic helpers) with the toolchain filecmake/arm-cortex-m55-mps3.cmake; the on-target run covers the polyphase kernel, the fixed-point datapaths and the end-to-end converter (seetests/bare_metal_main.cppfor the emulation-sized filter).
A separate workflow (.github/workflows/ci-arm64.yml, manual + weekly)
runs the suite natively on GitHub's ubuntu-24.04-arm runners, including
the ring-buffer stress under ThreadSanitizer on genuinely weakly-ordered
hardware — coverage x86 TSan and QEMU cannot provide. It is kept out of
per-push CI because arm64 hosted runners are not available on every plan
for private repositories.
For Tensilica HiFi4/HiFi5 the audio ISA, xt-clang compiler and xt-run
instruction-set simulator are proprietary Cadence tools, so they cannot run
in public CI; .github/workflows/ci.yml contains a commented self-hosted
runner job template (hifi-iss) that drops in once a runner with a Cadence
license is available. cmake/hexagon-linux-musl.cmake shows the general
cross + emulator pattern (CMAKE_CROSSCOMPILING_EMULATOR makes ctest run
every test through the emulator transparently).
Everything above runs on emulated or simulated clocks. For validating against real independent oscillators with commodity hardware (a Pi with two USB audio dongles, a Pi + Pico 2, two Pis over Ethernet), see docs/HARDWARE_TESTING.md.
Methodology, optimization roadmap and regression gating live in
docs/PERFORMANCE.md. Build the benchmarks with
-DSRT_BUILD_BENCHMARKS=ON (host only). A measured computational
head-to-head against libsamplerate and soxr — host wall-clock and embedded
instruction counts (-DSRT_BUILD_COMPARE_BENCH=ON, SRT_ICOUNT_COMPARE) —
lives in docs/COMPARISON.md.
Executed instructions per fixed workload (bench/icount/), measured under QEMU with a counting plugin — deterministic, and gated in CI at ±3% against bench/baselines.json:
| Workload | Cortex-M33 | Cortex-M55 | Hexagon |
|---|---|---|---|
kernel_float |
1,897,321,329 | 99,468,474 | 339,027,222 |
kernel_q15 |
587,096,252 | 181,994,196 | 102,819,852 |
kernel_q31 |
634,168,961 | 210,789,622 | 110,455,141 |
pipeline12_q15 |
962,613,655 | 387,876,968 | 378,858,793 |
pipeline_float |
1,856,735,553 | 92,751,177 | 335,912,671 |
pipeline_q15 |
484,146,844 | 127,446,817 | 119,847,854 |
pipeline_q31 |
566,751,937 | 162,708,581 | 120,694,199 |
Indicative numbers from a shared machine (Intel(R) Xeon(R) Processor @ 2.80GHz, 2026-06-12); regenerate with scripts/update_perf_docs.py. Items are output samples (kernel) or frames (pipeline); ×realtime is per 48 kHz stream.
| Benchmark | ns/item | ×realtime @48k |
|---|---|---|
BM_Kernel_Float_Fast |
48.9 | 426× |
BM_Kernel_Float_Balanced |
69.0 | 302× |
BM_Kernel_Float_Transparent |
109.1 | 191× |
BM_Kernel_Q15_Balanced |
45.2 | 461× |
BM_Kernel_Q31_Balanced |
62.9 | 331× |
BM_Pipeline_Float_Balanced_1ch |
67.7 | 308× |
BM_Pipeline_Float_Balanced_2ch |
107.8 | 193× |
BM_Pipeline_Float_Balanced_8ch |
336.2 | 62× |
BM_Pipeline_Q15_Balanced_2ch |
56.0 | 372× |
BM_Pipeline_Q31_Balanced_2ch |
120.8 | 173× |
BM_Pipeline_Float_Transparent_2ch |
164.2 | 127× |
BM_Pipeline_Float_Balanced_12ch |
511.8 | 41× |
BM_Pipeline_Q15_Balanced_12ch |
189.1 | 110× |
BM_Pipeline_Float_Balanced_16ch |
649.6 | 32× |
BM_Pipeline_Q15_Balanced_16ch |
241.8 | 86× |
The datapath is templated on the sample type via srt::SampleTraits
(include/srt/sample_traits.hpp). Three formats are provided:
| Type | Alias | Format | Measured SNR (997 Hz / 19.5 kHz, half scale, +200 ppm) |
|---|---|---|---|
float |
AsyncSampleRateConverter |
float I/O, double accumulation | 135 dB / 105 dB |
std::int32_t |
AsyncSampleRateConverterQ31 |
Q31 I/O, Q1.30 coeffs, int64 accumulation, saturating | 133 dB / 105 dB |
std::int16_t |
AsyncSampleRateConverterQ15 |
Q15 I/O, Q1.14 coeffs, int64 accumulation, saturating | 77 dB (format-limited) |
The fixed-point datapaths have integer-only inner loops (the μ blend factor is converted once per output sample), making them the appropriate choice for DSPs and MCUs without double-precision FPUs. Q31 is bit-for-bit as quiet as the float path; Q15's floor is the 16-bit format itself. The servo and the filter design always run in double (control path / one-time init, a handful of operations per block).
- Near-unity ratios only (±
maxDeviationPpm, default 1000 ppm). No 44.1 ↔ 48 kHz conversion. - The rate estimate is derived from FIFO counts only. With block-quantized
transfer its instantaneous value wobbles at the block-beat frequency
(see
Status.ppmvs. a few seconds of averaging), and ultra-quiet servo operation requires fine-grained transfer. A future version may accept per-block timestamps for sub-sample phase observation. - One producer thread and one consumer thread; construction/destruction must not overlap either.
MIT (see LICENSE). All code implements long-published methods: Kaiser
window design (Kaiser 1974), band-limited interpolation (J. O. Smith,
CCRMA), polyphase decomposition and the harris length estimate, and textbook
2nd-order PLL servo design. No third-party source was copied. GoogleTest
(BSD-3) is fetched for tests only and is not part of the shipped headers.