Skip to content

tap/SampleRateTap

Repository files navigation

SampleRateTap

CI License: MIT C++20

Header-only C++20 asynchronous sample rate converter (ASRC) for the near-unity case: two audio clock domains at nominally the same rate (e.g. 48 kHz ↔ 48 kHz) sourced from independent oscillators, each within a few hundred ppm and drifting slowly. One thread pushes input samples at the input clock, another pulls output samples at the output clock; the converter absorbs the rate mismatch transparently — including the whole-sample slips that occur roughly once every 1/ppm samples.

  • Header-only, no dependencies, cmake ≥ 3.24, GCC 11+/Clang 14+/MSVC 19.30+
  • Real-time safe audio path: push()/pull() are noexcept, lock-free and allocation-free; all allocation and filter design happen in the constructor
  • Measured quality (default balanced preset, +200 ppm offset, THD+N-style residual): 135 dB SNR at 997 Hz, 112 dB at 12 kHz, 105 dB at 19.5 kHz
  • ~1.5 ms designed latency with the default configuration at 48 kHz (24-frame filter group delay + 48-frame FIFO setpoint)

Quick start

add_subdirectory(SampleRateTap)            # or FetchContent
target_link_libraries(app PRIVATE SampleRateTap::SampleRateTap)
#include <srt/srt.hpp>

srt::Config cfg;
cfg.sampleRateHz = 48000.0;
cfg.channels = 2;
srt::AsyncSampleRateConverter asrc(cfg);   // allocates + designs filter; may throw

// Input-device thread (input clock):
asrc.push(inputInterleaved, frames);       // noexcept, lock-free

// Output-device thread (output clock):
asrc.pull(outputInterleaved, frames);      // noexcept, lock-free; silence
                                           // until filled/locked

srt::Status st = asrc.status();            // any thread: state, ppm, fill,
                                           // underruns/overruns/resyncs

examples/drifting_clocks.cpp runs two real threads 500 ppm apart and shows the lock acquisition and rate estimate. For a visual tour — lock, measured transparency vs. a naive FIFO, spectrograms, latency, drift tracking, dropout recovery — see notebooks/asrc_demo.ipynb, which drives the library through its C ABI (-DSRT_BUILD_CAPI=ON, tools/capi/) via ctypes (Python needs numpy and matplotlib; the comparison notebook below additionally needs the samplerate and soxr packages; the first cell builds the shared library if missing). A second notebook, notebooks/asrc_block_size_study.ipynb, measures how processing block size (32 / 64 / 240 frames) trades latency against servo observability — including per-impulse latency-breathing measurements and a calibrated FM/wideband quality decomposition.

For real hardware there are three more entry points: examples/alsa_bridge.cpp (two ALSA devices on their real crystals — the hardware testing Setup 1 harness, with CSV telemetry and post-ASRC capture), examples/pico2_cyccnt/ (flashable RP2350 firmware measuring real cycles per block against the QEMU instruction baselines), and examples/pico2_dualcore/ (the one-clock-domain-per-core RP2350 deployment, self-validating).

Consuming the library: add_subdirectory or FetchContent only — there are no install/package rules yet. Version 0.1.0 (SRT_VERSION_* in srt/srt.hpp, srt_version() over the C ABI); pre-1.0, the API may still change between versions.

How it works

The design follows the classic commercial-ASRC architecture (AD1896-style polyphase FIR + clock servo), specialized for the near-unity regime where the conversion degenerates into a creeping fractional delay.

Datapath. A Kaiser-windowed sinc prototype is designed at construction and decomposed into L polyphase branches of T taps (default 256 × 48, 120 dB stopband, flat to 20 kHz). Each output sample evaluates one branch pair: coefficients are linearly interpolated between the two phases adjacent to the fractional position μ, with the dot product accumulated in double. The table stores an extra row (phase 0 advanced one tap) so the μ wrap 1.0 → 0.0 with a one-sample window shift — the whole-sample slip — is exactly continuous and branch-free.

Phase accumulator. The fractional position lives in an unsigned Q0.64 integer that accumulates only the rate deviation ε per output sample (converted from the servo's double once per block); the unity part of the ratio is the integer window advance, and whole-sample slips are detected by 64-bit wraparound. The per-sample path is integer-only — no doubles — which keeps it cheap on DSPs without double-precision FPUs. Resolution is 2⁻⁶⁴ samples, far below the ~8 ps jitter budget for 120 dB transparency at 20 kHz.

Clock servo. A lock-free SPSC FIFO sits between the domains and its occupancy is the phase detector of a type-2 (PI) loop whose output ε̂ drives the phase increment — a PLL in which the FIFO comparison is the phase detector and the μ accumulator is the NCO. Gains derive from the standard 2nd-order matching: Kp = 2ζωₙ/fs, Ki = ωₙ²/fs. The loop runs in three stages with the integrator (the ppm estimate) handed across transitions:

Stage Bandwidth Error smoothing Role
Acquire 10 Hz 1-pole, 50 Hz fast pull-in (locks in ~1 s)
Track 1 Hz 1-pole, 5 Hz robust lock; terminal stage for coarse block transfer
Quiet 0.05 Hz 3-pole cascade, 0.5 Hz steady state for fine-grained transfer

Why three stages. The FIFO count is quantized to whole frames (or whole push blocks), so the occupancy observable carries a deterministic sawtooth at the beat frequency ppm × pushRate. Whatever the loop passes into ε̂ frequency-modulates the audio. The Quiet stage rejects a one-frame sawtooth to roughly −120 dBc equivalent at 20 kHz while still tracking a 1 ppm/s oscillator drift ramp with under half a frame of standing error. With coarse blocks (e.g. ≥32-frame callbacks) that level of quiet is information- theoretically unavailable from counts alone, so the servo deliberately stays in Track, where the block beat is phase-tracked mostly as benign latency breathing, the remainder as cent-scale low-rate FM (measured in notebooks/asrc_block_size_study.ipynb: ~0.9 cents rms / 61 dB wideband at 32-frame blocks, ~1.3 cents rms / 53 dB at 5 ms blocks). Promotion to Quiet is gated on the cascade-smoothed error, which is exactly the discriminator between the two regimes.

Multichannel. channels is a runtime count with no architectural limit (mono through 7.1.4 and beyond). Use one converter instance per clock domain, not per channel group: every channel of an instance is resampled at literally the same fractional position each frame, so inter-channel phase coherence is exact by construction — no skew to budget for in surround imaging or microphone-array processing (relevant when an AVB stream bundles reference mics with the program feed; AVB Class A's 8-frame packets are also fine-grained enough for the Quiet servo stage). Per frame the coefficient blend is computed once and shared, so N channels cost blend + N×dot, not N×(blend + dot). Channel independence is enforced by test at 12 and 16 channels (distinct tone per channel; crosstalk bounded below −100 dB float / −72 dB Q15), including a Track-stage variant that runs on the emulated embedded targets.

Under/overrun policy. pull() always fills its buffer: silence while filling or after an underrun (the converter then refills and re-locks, keeping the ppm estimate). If the consumer stalls until the high watermark, the converter discards down to the setpoint and counts a resync. push() into a full FIFO drops the newest frames and counts an overrun.

Latency

latency = targetLatencyFrames + (L·T − 1)/(2L)      [input frames]
        = 48 + ~24 ≈ 72 frames ≈ 1.5 ms at 48 kHz   (defaults)

designedLatencySeconds() reports the figure; the FIFO term breathes by a fraction of the block size as the servo tracks drift. The filter is linear phase. For lower latency use FilterSpec::fast() (~16-frame group delay) and a smaller targetLatencyFrames.

The setpoint must exceed the pull block size — a pull synthesizes from frames already buffered, so a setpoint at or below the callback size is infeasible and would drain into a permanent dropout cycle. The converter enforces this automatically: when it observes pull blocks larger than the configured setpoint it raises the effective setpoint (block + ~half-block margin, bounded by FIFO capacity) and reports the value in Status::effectiveTargetLatencyFrames; latency follows the raised setpoint. Callbacks above ~340 frames also need fifoFrames sized explicitly. The setpoint must additionally stay above the peak occupancy excursion of your push/pull jitter, as before.

Measured performance

From the test suite (deterministic two-clock simulation, +200 ppm offset, sample-granular transfer, 0.5 FS sine, 1 s analysis window after settling):

Preset 997 Hz 6 kHz 12 kHz 19.5 kHz group delay
balanced() (L=256, T=48) 135 dB 120 dB 112 dB 105 dB 0.50 ms
transparent() (L=512, T=80) 133 dB 108 dB 0.83 ms

AES17-style THD+N measured under identical conditions against libsamplerate, soxr and hardware datasheet figures: docs/COMPARISON.md (−132 dB THD+N / 149 dB DR at the 24-bit interface, servo in the loop; notebooks/asrc_comparison.ipynb).

The high-frequency residual is dominated by the linear interpolation between adjacent phase-table rows (≈ −12 dB per doubling of L, +12 dB per octave of signal frequency). Servo lock from a cold start takes ~1 s; a 0 → 300 ppm drift ramp at 10 ppm/s is tracked without unlocking.

The same structure holds at other deployment rates, e.g. 16 kHz for reference-microphone processing — but FilterSpec band edges and ServoConfig bandwidths are absolute Hz designed for ~48 kHz, and running another rate with unscaled defaults silently costs quality (measured: ~32 dB at 16 kHz). Start any non-48 kHz deployment from srt::Config::forSampleRate(rateHz), which rescales both (plus the servo hold times); FilterSpec::scaledTo / ServoConfig::scaledTo exist for custom presets. Measured through that factory (tests/test_asrc_quality_16k.cpp), 16 kHz matches the 48 kHz normalized-frequency structure: 136.6 dB at 333 Hz and 106.5 dB at 6.5 kHz, within ~1 dB of the 48 kHz tones at the same f/fs. Interpolation noise depends only on f/fs; group delay at the same tap count stays ~24 input samples and therefore triples in milliseconds (1.5 ms vs 0.5 ms).

Platform support

CI builds and tests every push on:

  • Linux (GCC, Clang) and macOS (AppleClang) with warnings as errors, Windows (MSVC /W4)
  • AddressSanitizer + UBSan and ThreadSanitizer over the full suite
  • Qualcomm Hexagon (hexagon-unknown-linux-musl, the open-source clang toolchain from quic/toolchain_for_hexagon) cross-compiled and executed under qemu-hexagon user-mode emulation — validating the library on a 32-bit audio DSP target: size_t width, atomics lowering, musl libc and soft-float doubles. Emulation proves correctness, not performance; Hexagon and HiFi-class DSPs have no double-precision FPU, so the float datapath's double accumulation runs soft-float there (the Q15/Q31 fixed-point traits are the performance-appropriate path for such targets). Cycle accuracy requires the vendor simulator.
  • Performance gating on both DSP targets: fixed workloads run under QEMU with an instruction-counting plugin and are compared against committed baselines (bench/baselines.json) at ±3% — a hot-path regression on Hexagon, Cortex-M55 or Cortex-M33 fails CI. See docs/PERFORMANCE.md.
  • Arm Cortex-M33 (Raspberry Pi Pico 2 / RP2350 class), bare metal on QEMU's MPS2+ AN505 model, sharing the Armv8-M platform layer below. The M33 has no FP64 and no Helium, and the instruction baselines make the consequences concrete: the float datapath costs ~19× the M55's instructions (soft-double accumulation) — on Pico-class parts use Q15/Q31. The instruction baselines suggest 48 kHz Q15 mono fits a 150 MHz core and stereo wants the fast() preset or the RP2350's second core — instruction counts are not cycle counts, so treat these as budgets pending real-silicon validation: examples/pico2_cyccnt/ is a flashable DWT.CYCCNT harness built to measure exactly this, and examples/pico2_dualcore/ validates the one-clock-domain-per-core deployment shape.
  • Arm Cortex-M55, bare metal (newlib + semihosting, no OS/threads), executed on QEMU's MPS3 AN547 board model via qemu-system-arm. The platform layer lives in platform/mps3_an547/ (linker script + minimal startup: vector table, FPU/MVE enable, 64-bit atomic helpers) with the toolchain file cmake/arm-cortex-m55-mps3.cmake; the on-target run covers the polyphase kernel, the fixed-point datapaths and the end-to-end converter (see tests/bare_metal_main.cpp for the emulation-sized filter).

A separate workflow (.github/workflows/ci-arm64.yml, manual + weekly) runs the suite natively on GitHub's ubuntu-24.04-arm runners, including the ring-buffer stress under ThreadSanitizer on genuinely weakly-ordered hardware — coverage x86 TSan and QEMU cannot provide. It is kept out of per-push CI because arm64 hosted runners are not available on every plan for private repositories.

For Tensilica HiFi4/HiFi5 the audio ISA, xt-clang compiler and xt-run instruction-set simulator are proprietary Cadence tools, so they cannot run in public CI; .github/workflows/ci.yml contains a commented self-hosted runner job template (hifi-iss) that drops in once a runner with a Cadence license is available. cmake/hexagon-linux-musl.cmake shows the general cross + emulator pattern (CMAKE_CROSSCOMPILING_EMULATOR makes ctest run every test through the emulator transparently).

Everything above runs on emulated or simulated clocks. For validating against real independent oscillators with commodity hardware (a Pi with two USB audio dongles, a Pi + Pico 2, two Pis over Ethernet), see docs/HARDWARE_TESTING.md.

Performance

Methodology, optimization roadmap and regression gating live in docs/PERFORMANCE.md. Build the benchmarks with -DSRT_BUILD_BENCHMARKS=ON (host only). A measured computational head-to-head against libsamplerate and soxr — host wall-clock and embedded instruction counts (-DSRT_BUILD_COMPARE_BENCH=ON, SRT_ICOUNT_COMPARE) — lives in docs/COMPARISON.md.

Executed instructions per fixed workload (bench/icount/), measured under QEMU with a counting plugin — deterministic, and gated in CI at ±3% against bench/baselines.json:

Workload Cortex-M33 Cortex-M55 Hexagon
kernel_float 1,897,321,329 99,468,474 339,027,222
kernel_q15 587,096,252 181,994,196 102,819,852
kernel_q31 634,168,961 210,789,622 110,455,141
pipeline12_q15 962,613,655 387,876,968 378,858,793
pipeline_float 1,856,735,553 92,751,177 335,912,671
pipeline_q15 484,146,844 127,446,817 119,847,854
pipeline_q31 566,751,937 162,708,581 120,694,199

Indicative numbers from a shared machine (Intel(R) Xeon(R) Processor @ 2.80GHz, 2026-06-12); regenerate with scripts/update_perf_docs.py. Items are output samples (kernel) or frames (pipeline); ×realtime is per 48 kHz stream.

Benchmark ns/item ×realtime @48k
BM_Kernel_Float_Fast 48.9 426×
BM_Kernel_Float_Balanced 69.0 302×
BM_Kernel_Float_Transparent 109.1 191×
BM_Kernel_Q15_Balanced 45.2 461×
BM_Kernel_Q31_Balanced 62.9 331×
BM_Pipeline_Float_Balanced_1ch 67.7 308×
BM_Pipeline_Float_Balanced_2ch 107.8 193×
BM_Pipeline_Float_Balanced_8ch 336.2 62×
BM_Pipeline_Q15_Balanced_2ch 56.0 372×
BM_Pipeline_Q31_Balanced_2ch 120.8 173×
BM_Pipeline_Float_Transparent_2ch 164.2 127×
BM_Pipeline_Float_Balanced_12ch 511.8 41×
BM_Pipeline_Q15_Balanced_12ch 189.1 110×
BM_Pipeline_Float_Balanced_16ch 649.6 32×
BM_Pipeline_Q15_Balanced_16ch 241.8 86×

Sample types

The datapath is templated on the sample type via srt::SampleTraits (include/srt/sample_traits.hpp). Three formats are provided:

Type Alias Format Measured SNR (997 Hz / 19.5 kHz, half scale, +200 ppm)
float AsyncSampleRateConverter float I/O, double accumulation 135 dB / 105 dB
std::int32_t AsyncSampleRateConverterQ31 Q31 I/O, Q1.30 coeffs, int64 accumulation, saturating 133 dB / 105 dB
std::int16_t AsyncSampleRateConverterQ15 Q15 I/O, Q1.14 coeffs, int64 accumulation, saturating 77 dB (format-limited)

The fixed-point datapaths have integer-only inner loops (the μ blend factor is converted once per output sample), making them the appropriate choice for DSPs and MCUs without double-precision FPUs. Q31 is bit-for-bit as quiet as the float path; Q15's floor is the 16-bit format itself. The servo and the filter design always run in double (control path / one-time init, a handful of operations per block).

Limitations

  • Near-unity ratios only (±maxDeviationPpm, default 1000 ppm). No 44.1 ↔ 48 kHz conversion.
  • The rate estimate is derived from FIFO counts only. With block-quantized transfer its instantaneous value wobbles at the block-beat frequency (see Status.ppm vs. a few seconds of averaging), and ultra-quiet servo operation requires fine-grained transfer. A future version may accept per-block timestamps for sub-sample phase observation.
  • One producer thread and one consumer thread; construction/destruction must not overlap either.

Provenance and license

MIT (see LICENSE). All code implements long-published methods: Kaiser window design (Kaiser 1974), band-limited interpolation (J. O. Smith, CCRMA), polyphase decomposition and the harris length estimate, and textbook 2nd-order PLL servo design. No third-party source was copied. GoogleTest (BSD-3) is fetched for tests only and is not part of the shipped headers.

About

ASRC

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors