faster-whisper's device="auto" picks CUDA when ctranslate2's wheel
ships CUDA shared libs, even on hosts without the NVIDIA runtime
(libcublas.so.12 / libcudnn*). On those hosts the model often loads
fine but transcribe() fails at first dlopen, and the broken model
stays cached in the module-global — every subsequent voice message
in the gateway process fails identically until restart.
- Add _load_local_whisper_model() wrapper: try auto, catch missing-lib
errors, retry on device=cpu compute_type=int8.
- Wrap transcribe() with the same fallback: evict cached model, reload
on CPU, retry once. Required because the dlopen failure only surfaces
at first kernel launch, not at model construction.
- Narrow marker list (libcublas, libcudnn, libcudart, 'cannot be loaded',
'no kernel image is available', 'no CUDA-capable device', driver
mismatch). Deliberately excludes 'CUDA out of memory' and similar —
those are real runtime failures that should surface, not be silently
retried on CPU.
- Tests for load-time fallback, runtime fallback (with cached-model
eviction verified), and the OOM non-fallback path.
Reported via Telegram voice-message dumps on WSL2 hosts where libcublas
isn't installed by default.