Every Call, Every Language: Multi-Language Transcription

The LATAM Problem

SalesSheet was born in Latin America. Our earliest customers were sales teams in Mexico, Colombia, and Chile. They sold across borders -- a rep in Mexico City calling a prospect in Bogota, following up with a lead in Sao Paulo, then joining a group call with the US headquarters. Three languages in a single afternoon. Sometimes two languages in a single call.

Every CRM with built-in calling offered transcription. None of them handled this reality. The typical setup required users to select the call language before dialing. If a call started in Spanish and switched to English when the American VP joined, the transcription model -- locked to Spanish -- would produce gibberish for the English portions. For bilingual reps who code-switch mid-sentence (a completely normal behavior in LATAM sales), the transcriptions were useless.

A transcription tool that makes you choose the language before the call starts does not understand how multilingual teams actually work. Languages switch mid-conversation, sometimes mid-sentence. Your transcription needs to keep up.

In-call screen with live Spanish transcription and automatic language detection

Auto-Detection: No Language Selection Required

SalesSheet's calling feature does not ask you to select a language. When a call ends, the recording is sent to our transcription pipeline, which runs automatic language detection on short audio segments (5-10 seconds each). Each segment is independently classified, so a call that starts in Spanish, switches to English at minute 3, and returns to Spanish at minute 7 produces a correct transcript throughout.

The language detection uses a lightweight classification model that runs ahead of the transcription model. It analyzes phonetic patterns, frequency distributions, and prosodic features (rhythm and intonation) to identify the language with high confidence. For the top 20 languages, detection accuracy exceeds 97%. For less common languages, it is above 92%.

99+ Languages Supported

Our transcription model supports over 99 languages. The most commonly used in SalesSheet are:

Spanish (all regional variants: Mexican, Colombian, Argentine, Castilian)
English (American, British, Australian)
Portuguese (Brazilian and European)
French
German
Japanese
Mandarin Chinese
Korean
Italian
Arabic

But the long tail matters too. We have customers transcribing calls in Hindi, Thai, Vietnamese, Swahili, and Tagalog. Every language gets the same treatment: auto-detection, speaker separation, and AI-generated summaries.

Speaker Separation (Diarization)

Knowing what was said is only useful if you know who said it. Speaker diarization identifies which parts of the audio belong to which speaker. In a two-person call, the transcript labels each utterance as either "You" (the SalesSheet user) or the contact's name (pulled from the CRM record). In a multi-party call, speakers are labeled as Speaker 1, Speaker 2, etc., with the SalesSheet user always identified by name.

Diarization is harder than it sounds. People interrupt each other. They talk over each other. They pause mid-sentence and someone else fills the gap. Our diarization model uses a combination of voice embedding (a mathematical fingerprint of each speaker's voice characteristics) and turn-taking analysis (patterns of who speaks after whom) to maintain accurate speaker attribution even during cross-talk.

Full transcript with speaker labels (You in teal, Carlos in blue) and Spanish language badge

The Cross-Talk Challenge

Cross-talk -- when two people speak simultaneously -- is the hardest scenario for both transcription and diarization. Our approach handles it in two steps. First, a source separation model isolates the overlapping voices into separate audio streams. Then each stream is transcribed and attributed independently. The result is two overlapping transcript segments with correct speaker labels and timestamps. In the UI, cross-talk segments are displayed as parallel speech bubbles, making it clear that both speakers were talking at the same time.

AI-Generated Call Summaries

A raw transcript of a 30-minute sales call is thousands of words long. Nobody reads that. What reps want is a summary: what was discussed, what was agreed, what the next steps are. SalesSheet automatically generates a structured summary for every transcribed call:

Key topics discussed -- bulleted list of main subjects covered
Action items -- tasks that either party committed to, with the responsible person identified
Sentiment analysis -- overall tone of the call (positive, neutral, negative) with specific moments flagged
Deal impact -- if the call is linked to a deal, the summary notes any changes to deal probability, value, or timeline
Follow-up suggestion -- a recommended next action based on the call content

The summary is generated in the same language as the majority of the call. For mixed-language calls, the user can choose which language the summary should be written in. A common pattern for LATAM teams: the call was in Spanish, but the summary is in English so the US-based manager can read it without translation.

AI-generated summary of a Spanish call with key topics, action items, and deal impact

The call summary is more valuable than the full transcript for 90% of use cases. Reps read the summary. Managers read the summary. The full transcript exists for when someone says "wait, what exactly did they say about the timeline?"

Handling Regional Accents and Dialects

Spanish is not one language. The Spanish spoken in Mexico City sounds different from the Spanish spoken in Buenos Aires, which sounds different from the Spanish spoken in Madrid. The same is true for Portuguese (Brazilian vs. European), English (American vs. British vs. Indian), and many other languages.

Our transcription model handles regional variation in two ways. First, the model is trained on a diverse corpus that includes recordings from multiple regions for each language. It does not assume all Spanish sounds the same. Second, when the auto-detection identifies the specific regional variant (e.g., Mexican Spanish vs. Argentine Spanish), it applies region-specific vocabulary adjustments. In Argentine Spanish, "vos" is used instead of "tu." In Mexican Spanish, "ahorita" means "right now" (or "sometime in the indefinite future," depending on tone). The model knows these differences.

Code-Switching

Code-switching is when a bilingual speaker switches between languages within a single conversation, sometimes within a single sentence. "Let me send you the proposal, pero primero necesito confirmar los numeros con mi equipo." This is extremely common in LATAM sales calls where the rep and prospect share both Spanish and English.

Our pipeline handles code-switching at the segment level. When the language detection model identifies a switch point, it splits the audio at that boundary and routes each segment to the appropriate language model. The result is a transcript that seamlessly moves between languages, with each portion accurately transcribed in its original language. We do not translate the code-switched segments -- we transcribe them as spoken, because the language mixing is intentional and meaningful.

Searchable Transcripts

Every transcript is full-text indexed and searchable from the AI assistant and the global search bar. You can search across all your calls with queries like "calls where we discussed pricing" or "find the call where Sarah mentioned their budget." The search works across all languages -- you can search for a Spanish word and find it in a Spanish call transcript, even if your interface language is English.

The search index also powers the AI assistant's call-related tools. When you ask the AI "what did the Acme team say about the timeline?", the assistant searches your call transcripts for relevant mentions, extracts the specific quotes, and presents them with timestamps so you can jump to that exact moment in the recording.

Performance and Processing Time

Call transcription happens asynchronously after the call ends. The processing pipeline runs in three stages:

Language detection -- 2-5 seconds regardless of call length
Diarization -- approximately 0.3x real-time (a 30-minute call takes about 9 minutes to diarize)
Transcription -- approximately 0.1x real-time (a 30-minute call takes about 3 minutes to transcribe)

Total processing time for a typical 15-minute sales call is under 4 minutes. The transcript and summary are ready before the rep finishes writing their post-call notes. For most users, the experience feels nearly instant -- by the time they navigate to the call record, the transcript is already there.

Multi-language transcription is not a checkbox feature for us. It is a reflection of who our users are: multilingual sales teams working across borders, switching languages as naturally as they switch browser tabs. Building transcription that works for them meant building transcription that works like they work -- automatically, across languages, without asking them to think about it.

Try SalesSheet Free

No credit card required. Start selling smarter today.

Start Free Trial