Vosk vs Whisper Local: The Ultimate 2026 Guide to Self-Hosted Speech Recognition (STT)

Some time ago, a client asked me for an enhancement to their system so that their Asterisk could request certain personal data via voice and perform some critical tasks based on the input. The first thing to know is that the data being entered is important, the system can’t make a mistake recognizing a last name, an ID number, or a credit card number. While both ID numbers and credit card numbers have built-in verification systems that detect when a digit has been misrecognized, having to repeat the entire number because the person didn’t enunciate clearly, or because there’s so much background noise it can’t be understood properly, can become quite a problem. Luckily, the validation system for ID numbers, IBAN codes, or credit cards doesn’t fail when there’s a wrong digit.

Faced with such a need, sooner or later you reach the million-dollar question: Vosk or Whisper local? Both are speech recognition (STT, Speech-to-Text) engines you can run on your own server, without sending audio to Google, OpenAI, or Amazon. Both are free. And both solve different problems.

This guide is the result of several months putting Vosk and Whisper into production within telephony projects, LLM-powered agents, and real-time transcription. If you come in with the idea of “always Whisper because it’s the new thing”, after reading this, maybe I can get you to think about it a bit more carefully.

What are Vosk and Whisper

Vosk is an offline, lightweight, real-time speech recognition engine based on Kaldi. It runs on CPU, on Raspberry Pi, on mobile devices, and even on embedded devices. It has models for more than 20 languages (Spanish included, of course) and very low latency. (we already explained What Vosk is)

Whisper is the multilingual speech recognition model that OpenAI released as open source (MIT license) in 2022. It’s based on transformers, is enormously more accurate than Vosk under difficult conditions, supports 99 languages, and offers translation to English. The downside: it needs a GPU if you want reasonable speed, and the larger models take up several GB. (we already explained What Whisper is)

Vosk vs Whisper local: quick comparison

ASPECT	VOSK	WHISPER LOCAL
Minimum hardware	Modest CPU, 500 MB RAM	GPU recommended (CPU possible but slow)
Model size	40 MB – 1.8 GB depending on language/accuracy	75 MB (tiny) to 3 GB (large-v3)
Real-time streaming	Yes, native	Not native (forks available: faster-whisper, whisper.cpp, whisper-streaming)
Accuracy on clean audio	Good	Excellent
Accuracy with noise / accents	Average	Notably superior
Supported languages	~20 including Spanish	99 languages
Automatic punctuation	No	Yes
Diarization (speaker separation)	No (requires external tool)	Not native (combine with pyannote)
License	Apache 2.0	MIT
Typical use case	Real-time VoIP, IoT, voice commands	Recording transcription, subtitles, podcasts

When to choose Vosk

Vosk is the right choice when what you need is real-time, low-resource STT. Cases where it fits like a glove:

Live transcription of VoIP calls inside an Asterisk, FreeSWITCH, or Kamailio PBX. Latency matters more than absolute accuracy.
Embedded voice agents on devices without a GPU (Raspberry Pi, mini-PCs).
Voice commands in applications where the vocabulary is limited and known.
IoT and offline devices: home automation, kiosks, vehicles.
Browsers: Vosk has a WebAssembly build that runs inside the browser.

If you need to recognize something over the phone with a closed grammar (a list of keywords), Vosk is practically your solution unless you have a dedicated GPU and are using a well-optimized faster-whisper.

I’ve tested some “flavors” of Whisper for this and on CPU it takes too long (even more so if you have to record the audio, detect the silence at the end, pass it to the transcriber, and recognize what was said).

When to choose Whisper local

Whisper shines when accuracy is paramount and you can afford some latency or have a GPU available:

Transcription of meeting recordings, podcasts, interviews, completed calls.
Automatic video subtitling.
Audio with noise, strong accents, or multilingual content (a mix of English and Spanish, for example).
Post-call analysis where the audio is already complete and latency doesn’t matter.
Automatic translation to English integrated into the transcription.

If you only need to transcribe individual audio files or you have a GPU, don’t even consider Vosk: Whisper is going to give you clearly better results.

Important: Whisper local = faster-whisper, whisper.cpp and friends

When we talk about Whisper local, in 2026 nobody uses OpenAI’s original Whisper as is. Different improved versions are used (faster, more complete, with diarization, with…)

The implementations actually used in production are:

faster-whisper

A reimplementation in CTranslate2. It’s between 4 and 5 times faster than the original Whisper and consumes less memory. It supports quantization (int8, int8_float16) which allows running the large model on modest GPUs or even on CPU with reasonable times. It’s the default option for servers with GPU.

whisper.cpp

A C/C++ port with zero Python dependencies. It compiles anywhere (Linux, macOS, Windows, embedded), takes advantage of Metal on Apple Silicon, CUDA on NVIDIA, and Vulkan/OpenCL on others. If you want Whisper on pure CPU or on a Mac without installing Python, whisper.cpp is the answer.

whisper-streaming / WhisperLive

Forks that implement true streaming on top of Whisper, splitting the audio into windows and reapplying to maintain context. Latency is still worse than Vosk but it’s already in the usable range (1–2 seconds) for many cases. Even so, 1 or 2 seconds for recognition in an IVR is still too much.

Distil-Whisper

A version distilled by Hugging Face: 6 times faster than whisper-large-v3 with minimal loss of accuracy in English. For massive transcription in English it’s hard to beat. For Spanish, no idea, I haven’t tested it.

There are other interesting versions I encourage you to try, there’s even other software that isn’t based on Whisper that has very good marketing, but without a GPU card, everything I’ve found is unviable.

Some code: How to install Vosk step by step (Linux)

Probably the simplest installation in the STT ecosystem:

pip install vosk
# Download small Spanish model (~40 MB)
wget https://alphacephei.com/vosk/models/vosk-model-small-es-0.42.zip
unzip vosk-model-small-es-0.42.zip

And a minimal example transcribing a WAV file:

import wave, json
from vosk import Model, KaldiRecognizer

model = Model("vosk-model-small-es-0.42")
wf = wave.open("audio.wav", "rb")
rec = KaldiRecognizer(model, wf.getframerate())

while True:
    data = wf.readframes(4000)
    if not data: break
    if rec.AcceptWaveform(data):
        print(json.loads(rec.Result())["text"])
print(json.loads(rec.FinalResult())["text"])

The audio must be mono WAV at 16 kHz. For integration with Asterisk, the native SLIN16 format is exactly that, so it connects trivially.

How to install Whisper local with faster-whisper

The most recommended way to run Whisper in 2026:

pip install faster-whisper

Transcription example with int8 quantization (runs even on a decent CPU):

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# For CPU: device="cpu", compute_type="int8"

segments, info = model.transcribe("audio.wav", language="es")
for s in segments:
    print(f"[{s.start:.2f}s -> {s.end:.2f}s] {s.text}")

The model is downloaded automatically on the first run. Reference sizes: tiny 75 MB, base 145 MB, small 480 MB, medium 1.5 GB, large-v3 3 GB.

Speech To Text STT in VoIP: integrating Vosk or Whisper with Asterisk

The most interesting thing is connecting an ASR (STT) to an Asterisk to transcribe calls in real time. These days, depending on the version of Asterisk, there are three possibilities:

Option 1: ARI + external audio

Asterisk sends the audio over an external connection (UDP, WebSocket, or a fork via chan_external_media via ARI) to a service that runs the STT model. It’s the cleanest option and the one that scales best. The downside: it requires a fair amount of Python or Go code in the middle.

There’s plenty of documentation on the Internet about this approach, and indeed, for a long time, this was the “best way” to do it.

Option 2: AGI with recording + asynchronous transcription

You record the call and, when hanging up (h extension), you launch an AGI that transcribes the WAV with Whisper. It’s not real time but it’s trivial to set up and sufficient for post-call analysis and automatic summarization.

This is the most compatible approach, it works on any version of Asterisk, although CPU-based transcription consumes a lot of resources, so be careful because if you have limited resources, transcribing audio can affect the quality of ongoing calls.

Option 3: AudioFork + WebSocket

Since Asterisk 18, AudioFork allows sending live audio over WebSocket to a server that processes it with Vosk or Whisper streaming.

This is probably the best way to do it, or at least the most modern. These days even ChatGPT supports a direct WebSocket connection from Asterisk via this system.

Audio cleanup: DeepFilterNet before STT

The big problem usually is noise, since many people use Speech-to-Text from a mobile phone, on the street, with a hands-free kit… this introduces a brutal amount of noise and impacts recognition of anything. There’s nothing worse than running a demo under these conditions only to have the demo fail miserably.

A trick that dramatically improves the accuracy of any STT, especially Vosk in noisy environments: pass the audio through DeepFilterNet first. It’s a real-time neural denoiser that cleans up audio (background noise, echo, wind) without touching the voice. Combined with Vosk, it gets quite close to Whisper’s accuracy under real-world telephony conditions.

Privacy: why doing STT locally matters

Since we’re busy with security certificates, the topic of privacy is something I take quite seriously, and there are three reasons why it’s necessary to prevent audio from leaving Europe or to avoid relying on third parties, so running STT on your own server is a hundred times better option:

GDPR and LOPD-GDD: sending audio of your customers’ calls to a foreign provider forces you to depend on others, international data transfers, data processors, etc. Local STT avoids all of this completely.
Cost: beyond a certain volume, cloud APIs become extremely expensive. (and when I say extremely expensive, it really is extremely expensive). A local Whisper on a dedicated GPU pays off the hardware in a few months if you transcribe thousands of minutes a day.
Latency: the round trip to an external server easily adds 200-500 ms. Now that everyone wants to build conversational voice agents, that destroys the experience.

Frequently Asked Questions about Vosk and Whisper

Can Whisper be used offline for free?

Yes. The Whisper models are open source (MIT), you download them once and from then on they run offline without sending anything to OpenAI or any server. The OpenAI API (paid) is independent and optional.

Is Vosk better than Whisper?

Better for different cases. Vosk is better in latency, modest hardware, and streaming. Whisper is better in accuracy, languages, and noise robustness. The choice depends on the use case, not on which is “better” in the abstract.

Which is the fastest Whisper model?

If you mean the original Whisper: tiny (75 MB). If what you want is practical speed with good accuracy, faster-whisper with int8 quantization on the medium or large-v3 model is the best compromise on a modest GPU.

Does Vosk support Spanish?

Yes. There are Spanish models of various sizes from 40 MB (vosk-model-small-es-0.42) up to 1.4 GB for maximum accuracy. It works reasonably well with Spanish from Spain and Latin America.

Can Whisper transcribe in real time?

The original Whisper wasn’t designed for streaming, but there are forks (whisper-streaming, WhisperLive) that allow it with 1-2 seconds of latency. For latencies below 500 ms, Vosk is better.

How much GPU do I need for Whisper large?

With faster-whisper in float16, a GPU with 6 GB of VRAM (RTX 3060, RTX 4060) runs large-v3 with plenty of room. With int8 quantization it drops to about 3 GB and even fits on older GPUs or laptops.

Can Vosk and Whisper be combined?

Yes, and sometimes it’s the most practical option. Vosk in streaming to show text live during the call, and Whisper at the end over the complete recording to produce the definitive transcription with punctuation. The best of both worlds.

Do you have an STT use case that doesn’t fit into the above? Share it in the comments. Every real case is an opportunity to learn something new about these models.