ENGINEERING

We Replaced Browser Speech Recognition with OpenAI Whisper. Here’s Why.

Andres MuguiraFebruary 17, 20266 min read
OpenAIWhisperVoice InputSpeech to Text
← Back to Blog
Summarize with AI

Voice input in a CRM should be simple. Click a microphone button, talk, and see accurate text appear. In practice, browser-native speech recognition made this nearly impossible to ship as a reliable feature. After months of fighting browser inconsistencies, we ripped out the Web Speech API entirely and replaced it with OpenAI's Whisper model running on our backend. The result was transformative — and the implementation was surprisingly straightforward.

The Web Speech API Was Unreliable

The Web Speech API (specifically navigator.webkitSpeechRecognition) seemed like the obvious choice when we first built voice input for SalesSheet. It is a browser-native API, requires no backend, and streams results in real time. On paper, it checked every box. In practice, it was a minefield.

Chrome had the best implementation, which makes sense since Google powers the speech recognition backend. But even Chrome's version had problems: it would silently stop listening after 60 seconds, punctuation was almost nonexistent, and any background noise — an air conditioner, a coffee shop, a coworker on a call — would cause the recognizer to output gibberish. Firefox's implementation was incomplete and inconsistent across versions. Safari on both macOS and iOS barely worked at all, often failing silently without triggering any error callback.

The cross-browser inconsistency was the real killer. We could not ship a feature that worked beautifully on Chrome, partially on Firefox, and not at all on Safari. Our users are salespeople working across devices and browsers. A voice input feature that only works sometimes is worse than no voice input at all, because users lose trust in it and stop trying.

Web Speech API vs. OpenAI Whisper transcription quality

Why We Chose OpenAI Whisper

We evaluated several alternatives before landing on OpenAI. Deepgram and AssemblyAI both offer excellent speech-to-text APIs with competitive accuracy. Google Cloud Speech-to-Text was another option, and we briefly considered running the open-source Whisper model ourselves on a GPU server. Each option had tradeoffs in latency, cost, and accuracy.

We chose OpenAI's gpt-4o-mini-transcribe model for three reasons. First, accuracy — it handles accented English, Spanish, Portuguese, and mixed-language input with remarkable precision, which matters because our users sell internationally. Second, it produces naturally punctuated output with proper capitalization, paragraph breaks, and even formatted numbers. Third, we were already using OpenAI for our AI assistant, so adding another API call to our existing integration was trivial from an infrastructure perspective.

The moment we switched from Web Speech API to Whisper, our voice transcription went from a beta experiment to a production feature. The accuracy difference was not incremental — it was categorical.

The Implementation: MediaRecorder to Supabase to OpenAI

The architecture is straightforward. On the client side, clicking the microphone button starts a recording session using the browser's MediaRecorder API. We request audio in audio/webm;codecs=opus format, which gives us good compression without sacrificing quality. The MediaRecorder API, unlike the Web Speech API, works consistently across Chrome, Firefox, Safari, and all modern mobile browsers. That alone solved our cross-browser problem.

When the user clicks the mic button again to stop recording, we collect the audio blob and POST it to a Supabase Edge Function called transcribe-audio. The Edge Function receives the audio blob, forwards it to OpenAI's transcription endpoint with the gpt-4o-mini-transcribe model specified, and returns the resulting text. The entire round trip — upload, transcription, response — typically completes in 1 to 2 seconds for a 30-second recording, and under 3 seconds for recordings up to 2 minutes.

We chose to route through a Supabase Edge Function rather than calling OpenAI directly from the browser for two reasons. First, it keeps our API key on the server side where it belongs. Embedding an OpenAI API key in client-side JavaScript would be a security disaster. Second, the Edge Function lets us add middleware logic: we validate the audio format, enforce a maximum recording duration of 5 minutes, check the user's subscription tier, and log usage for billing purposes.

Microphone button states: idle, recording, done

Latency Optimization

Two seconds feels instant for a voice transcription, but we worked to get it there. The Supabase Edge Function runs on Deno Deploy, which means it executes at the edge location closest to the user. For most of our users in the Americas, that means the function runs in a US data center with sub-50ms latency to both the user and OpenAI's API servers.

We also optimized the audio payload. By using Opus codec compression inside the WebM container, a 30-second voice recording is typically around 50-80 KB rather than the several megabytes you would get with uncompressed WAV. Smaller payload means faster upload, especially on mobile networks where bandwidth is constrained. We experimented with chunked streaming — sending audio segments as they are recorded — but the added complexity was not worth the marginal latency improvement for recordings under 2 minutes.

One subtle optimization: we pre-warm the Edge Function with a lightweight health check on page load, so the Deno runtime is already initialized when the user clicks the mic button. Cold starts on Deno Deploy are fast (under 100ms), but eliminating even that small delay makes the experience feel more responsive.

Where Voice Input Works in SalesSheet

Voice input is available in three key areas of the app, each serving a different use case. The first and most popular is the AI chat input. Users can speak naturally — "show me all deals closing this month over fifty thousand" or "create a new contact named Sarah Chen at Stripe" — and the transcription feeds directly into the AI assistant. This is especially useful on mobile, where typing long natural language queries on a small keyboard is painful.

The second integration point is the note composer. When you are on a contact record and want to log a quick note after a meeting, you can tap the mic and dictate instead of typing. The transcription includes proper punctuation and paragraph breaks, so the note is ready to save without editing. For salespeople who are between meetings or driving, this turns a task they would skip into one they actually do.

The third is the email body composer. Dictating an email draft and then reviewing it before sending is significantly faster than typing from scratch, especially for longer follow-up emails where you need to reference specific discussion points. The Voice DNA feature can then refine the dictated text to match your personal writing style.

Privacy and Data Handling

Voice data is sensitive. We treat audio recordings with the same care as any other user data in SalesSheet. The audio blob is transmitted over HTTPS to our Edge Function, forwarded to OpenAI for transcription, and then immediately discarded. We do not store the audio recording anywhere — not in our database, not in cloud storage, not in logs. The only thing that persists is the resulting text, which lives in whatever context the user created it in (a note, an email draft, a chat message).

OpenAI's API data usage policy states that API inputs are not used to train their models, which was an important factor in our provider selection. For users in regulated industries who are concerned about any third-party processing, we document exactly what data flows where in our privacy policy. We have also explored the possibility of running the open-source Whisper model on our own infrastructure for enterprise customers who require zero third-party data processing, though this is not yet available.

What We Learned

The biggest lesson from this migration was that browser-native APIs are not always the right choice, even when they seem like the obvious one. The Web Speech API saved us from running backend infrastructure, but at the cost of reliability, accuracy, and cross-browser support. Moving to a server-side model added a small amount of latency and a per-request cost, but gave us a feature we could actually ship to every user with confidence.

Looking ahead, we are watching OpenAI's real-time speech API closely. The ability to stream audio to the model and get back streaming transcription would enable live dictation with word-by-word display, similar to what the Web Speech API promised but never reliably delivered. We are also exploring voice commands beyond text input — imagine saying "call John and schedule a follow-up for Thursday" and having the CRM execute both actions. The foundation is there. The AI-native architecture we built makes adding these capabilities a matter of connecting existing pieces rather than building from scratch.

Try SalesSheet Free

No credit card required. Start selling smarter today.

Start Free Trial