Tutorial 24: Call Recording Transcription with Faster-Whisper
Batch transcription of ViciDial call recordings using Faster-Whisper (OpenAI Whisper optimized with CTranslate2) for speech-to-text at scale — on CPU, without cloud APIs.
Table of Contents
- Introduction
- Architecture Overview
- Prerequisites
- Installation
- Understanding ViciDial Recording Files
- Single-File Transcription
- Batch Transcription Script
- Database-Driven Batch Transcription
- Output Formats
- Integration with ViciDial
- Performance Tuning
- Production Deployment
- Troubleshooting
Introduction
Call centers generate thousands of recordings daily. Manually reviewing them is impossible at scale. Automatic transcription unlocks:
- Quality assurance — search transcripts for compliance phrases, forbidden words, or script adherence
- Agent coaching — identify calls where agents struggled, deviated from scripts, or handled objections well
- Customer insights — aggregate what callers ask about, common complaints, sentiment patterns
- Dispute resolution — quickly find what was said without listening to the full recording
- AI pipeline input — feed transcripts into LLMs for summarization, classification, or workflow automation
OpenAI's Whisper is the best open-source speech recognition model available. Faster-Whisper is a reimplementation using CTranslate2 that runs 4x faster than the original while using less memory — making it practical to transcribe thousands of calls on a standard server CPU without any GPU or cloud API costs.
What we built: A system that transcribes 500 inbound call recordings (15+ hours of audio) entirely on CPU, producing individual text files, structured JSON metadata, and a combined markdown document — with resume support so it can survive interruptions.
Architecture Overview
ViciDial Server
├── /var/spool/asterisk/monitorDONE/ ← Raw WAV recordings
│ └── MP3/ ← Converted MP3 recordings
│ └── YYYYMMDD-HHMMSS_phone-all.mp3
│
├── MariaDB (asterisk database)
│ ├── recording_log ← filename, vicidial_id, location, start_time
│ ├── vicidial_closer_log ← inbound call metadata (agent, status, duration)
│ └── vicidial_log ← outbound call metadata
│
├── Python 3.11 + faster-whisper
│ ├── transcribe_single.py ← One-off transcription
│ ├── transcribe_batch.py ← Batch from file list
│ └── transcribe_db.py ← Database-driven with metadata
│
└── Output
├── individual .txt files ← One per recording
├── _summary.json ← Structured metadata + stats
└── _all_transcriptions.md ← Combined readable document
Processing flow:
- Query the ViciDial database to select recordings by date range, campaign, agent, or call status
- Verify the corresponding MP3/WAV files exist on disk
- Load the Faster-Whisper model once (the expensive step)
- Transcribe each file sequentially, saving results after every file (crash-safe)
- Produce individual text files, a JSON summary, and optionally a combined document
Prerequisites
| Component | Version | Purpose |
|---|---|---|
| Python | 3.11+ | Runtime (3.11 recommended for performance) |
| faster-whisper | 1.2.x | Whisper inference engine |
| CTranslate2 | 4.x | Optimized transformer inference |
| FFmpeg | 4.x+ | Audio format conversion |
| MariaDB client | Any | Database queries for recording metadata |
| Server RAM | 4GB+ minimum | Model loading (small model ~2GB) |
| Disk space | ~5GB | Model cache + transcription output |
Hardware reality check: The "small" model on CPU with int8 quantization transcribes roughly 1 minute of audio in 15-25 seconds on a modern server (8+ cores). A 500-call batch of 15 hours of audio takes approximately 4-6 hours of wall time. This is practical for nightly batch jobs.
Installation
Step 1: Install Python 3.11
If your server runs CentOS 7 or an older distribution, Python 3.11 may not be in the default repositories.
# Debian/Ubuntu
apt update && apt install -y python3.11 python3.11-venv python3.11-dev
# CentOS 7 / RHEL 7 (from source)
yum install -y gcc openssl-devel bzip2-devel libffi-devel zlib-devel xz-devel
cd /usr/src
wget https://www.python.org/ftp/python/3.11.8/Python-3.11.8.tgz
tar xzf Python-3.11.8.tgz
cd Python-3.11.8
./configure --enable-optimizations
make altinstall # 'altinstall' avoids overwriting system python
# openSUSE
zypper install python311 python311-devel
Step 2: Install FFmpeg
Faster-Whisper uses FFmpeg internally (via the av library) to decode audio files.
# Debian/Ubuntu
apt install -y ffmpeg
# CentOS 7
yum install -y epel-release
yum install -y ffmpeg ffmpeg-devel
# openSUSE
zypper install ffmpeg
Step 3: Install Faster-Whisper
pip3.11 install faster-whisper
This pulls in the key dependencies:
ctranslate2— optimized C++ inference enginehuggingface-hub— model download from HuggingFaceav— FFmpeg Python bindings for audio decodingonnxruntime— used for VAD (voice activity detection)tokenizers— text tokenization
Step 4: Install MySQL/MariaDB Client Library
Required only if you plan to query ViciDial's database for recording metadata.
pip3.11 install mysql-connector-python
Step 5: Verify Installation
python3.11 -c "
from faster_whisper import WhisperModel
print('faster-whisper imported successfully')
import ctranslate2
print(f'CTranslate2 version: {ctranslate2.__version__}')
"
Step 6: Pre-Download the Model
The first transcription triggers a model download (~1GB for "small"). Pre-download it to avoid delays:
python3.11 -c "
from faster_whisper import WhisperModel
print('Downloading small model...')
model = WhisperModel('small', device='cpu', compute_type='int8')
print('Model ready.')
"
Models are cached in ~/.cache/huggingface/hub/.
Understanding ViciDial Recording Files
File Naming Convention
ViciDial recordings follow this naming pattern:
YYYYMMDD-HHMMSS_CALLERID_DIALEDNUMBER-all.ext
Examples:
20260313094154_1004_88447787563580-all.wav ← Raw WAV
20260130130931_process_447418315755-all.mp3 ← Converted MP3
- YYYYMMDD-HHMMSS — recording start timestamp
- Middle field — extension/agent number or "process" for auto-processed files
- Phone number — the external phone number (caller or dialed)
- -all suffix — indicates both sides of the conversation (mixed)
- .wav — raw uncompressed recording
- .mp3 — compressed version (created by ViciDial's audio conversion process)
File Locations
/var/spool/asterisk/monitorDONE/ ← Raw WAV files
/var/spool/asterisk/monitorDONE/MP3/ ← Converted MP3 files
/var/spool/asterisk/monitorDONE/ORIG/ ← Original pre-mix files (optional)
/var/spool/asterisk/monitorDONE/FTP/ ← Files pending FTP transfer
Use MP3 files for transcription — they are smaller (faster to read) and Faster-Whisper handles them natively. WAV files work too but are 5-10x larger.
Database Tables
The recording_log table links recordings to call metadata:
-- Find a recording by filename
SELECT recording_id, filename, location, start_time, length_in_sec, vicidial_id
FROM recording_log
WHERE filename LIKE '%447418315755%';
-- The vicidial_id links to either:
-- vicidial_closer_log.closecallid (inbound calls)
-- vicidial_log.uniqueid (outbound calls)
Single-File Transcription
Start with the simplest case: transcribe one recording file.
Basic Script
#!/usr/bin/env python3.11
"""Transcribe a single call recording."""
import sys
from faster_whisper import WhisperModel
if len(sys.argv) < 2:
print("Usage: python3.11 transcribe_single.py <audio_file>")
sys.exit(1)
audio_file = sys.argv[1]
# Load model — 'small' is the sweet spot for call center audio
# CPU + int8 quantization keeps memory usage around 1-2 GB
print("Loading model...", flush=True)
model = WhisperModel("small", device="cpu", compute_type="int8")
# Transcribe with English language forced (skip detection overhead)
print(f"Transcribing: {audio_file}", flush=True)
segments, info = model.transcribe(audio_file, language="en", beam_size=5)
print(f"Duration: {round(info.duration)}s")
print(f"Language confidence: {round(info.language_probability, 3)}")
print(f"---")
for segment in segments:
print(f"[{segment.start:.1f}s -> {segment.end:.1f}s] {segment.text.strip()}")
Run it:
python3.11 transcribe_single.py /var/spool/asterisk/monitorDONE/MP3/20260130130931_process_447418315755-all.mp3
Output:
Loading model...
Transcribing: /var/spool/asterisk/monitorDONE/MP3/20260130130931_process_447418315755-all.mp3
Duration: 246s
Language confidence: 0.997
---
[0.0s -> 4.2s] Call from Google. Hello, how can I help you?
[4.2s -> 8.8s] Hi, I called a little bit earlier. I'm a social worker looking for an estimate...
[8.8s -> 13.5s] ...for a client's plumbing and I've got the postcode now.
...
Key Parameters Explained
model = WhisperModel(
"small", # Model size (see Performance Tuning section)
device="cpu", # "cpu" or "cuda" (GPU)
compute_type="int8" # Quantization: "int8" (fastest CPU), "float16" (GPU), "float32" (most accurate)
)
segments, info = model.transcribe(
audio_file,
language="en", # Force English (faster than auto-detect)
beam_size=5, # Search width — 5 is good balance of speed/accuracy
vad_filter=True, # Optional: skip silence (faster for calls with holds)
vad_parameters=dict(
min_silence_duration_ms=500 # Minimum silence to split on
)
)
Batch Transcription Script
For processing multiple files from a list, with resume support and progress tracking.
transcribe_batch.py
#!/usr/bin/env python3.11
"""
Batch transcribe call recordings from a file list.
Resume-safe: skips files that have already been transcribed.
Saves individual .txt files + a summary JSON.
"""
import sys
import os
import json
import time
from faster_whisper import WhisperModel
# === Configuration ===
INPUT_LIST = "/path/to/recording_list.txt" # One file path per line
OUTPUT_DIR = "/path/to/transcriptions" # Output directory
SUMMARY_FILE = os.path.join(OUTPUT_DIR, "_summary.json")
# === Setup ===
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Read the list of files to transcribe
with open(INPUT_LIST) as f:
files = [line.strip() for line in f if line.strip()]
print(f"Found {len(files)} files in list", flush=True)
# Check for already-completed files (resume support)
already_done = set()
existing_results = []
if os.path.exists(SUMMARY_FILE):
with open(SUMMARY_FILE) as f:
existing_results = json.load(f)
already_done = {r["file"] for r in existing_results}
print(f"Resuming: {len(already_done)} already done, "
f"{len(files) - len(already_done)} remaining", flush=True)
remaining = [fp for fp in files if os.path.basename(fp) not in already_done]
if not remaining:
print("All files already transcribed!", flush=True)
sys.exit(0)
# === Load Model ===
print("Loading whisper model (small, CPU, int8)...", flush=True)
model = WhisperModel("small", device="cpu", compute_type="int8")
print("Model loaded.", flush=True)
# === Transcribe ===
start_time = time.time()
results = list(existing_results) # Carry forward previous results
for i, filepath in enumerate(remaining):
if not os.path.exists(filepath):
print(f"[{i+1}/{len(remaining)}] SKIP (not found): {filepath}", flush=True)
continue
fname = os.path.basename(filepath)
# Create output .txt filename (replace audio extension with .txt)
txt_name = fname.rsplit(".", 1)[0] + ".txt"
txt_path = os.path.join(OUTPUT_DIR, txt_name)
print(f"[{i+1}/{len(remaining)}] Transcribing: {fname}...", flush=True)
try:
segments, info = model.transcribe(filepath, language="en", beam_size=5)
# Collect all segment text
text_parts = []
for segment in segments:
text_parts.append(segment.text.strip())
full_text = " ".join(text_parts)
duration = round(info.duration, 1)
# Save individual transcript file
with open(txt_path, "w") as f:
f.write(f"File: {fname}\n")
f.write(f"Duration: {duration}s ({round(duration / 60, 1)} min)\n")
f.write(f"Characters: {len(full_text)}\n")
f.write(f"---\n\n")
f.write(full_text)
# Add to results
results.append({
"file": fname,
"duration_sec": duration,
"duration_min": round(duration / 60, 1),
"language_prob": round(info.language_probability, 3),
"chars": len(full_text),
"txt_file": txt_name
})
# Save summary after EVERY file (crash-safe)
with open(SUMMARY_FILE, "w") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
# Progress and ETA calculation
elapsed = time.time() - start_time
avg_per_file = elapsed / (i + 1)
eta_seconds = avg_per_file * (len(remaining) - i - 1)
print(f" Done ({round(duration)}s audio, {len(full_text)} chars) "
f"| ETA: {round(eta_seconds / 60)}m remaining", flush=True)
except Exception as e:
print(f" ERROR: {e}", flush=True)
# === Final Report ===
total_duration = sum(r["duration_sec"] for r in results)
total_chars = sum(r["chars"] for r in results)
elapsed = time.time() - start_time
print(f"\n{'=' * 60}", flush=True)
print(f"DONE! {len(results)} files transcribed", flush=True)
print(f"Total audio: {round(total_duration / 3600, 1)} hours", flush=True)
print(f"Total text: {total_chars:,} characters", flush=True)
print(f"Processing time: {round(elapsed / 3600, 1)} hours", flush=True)
print(f"Output: {OUTPUT_DIR}", flush=True)
Creating the File List
Generate a list of recordings to transcribe:
# All MP3 files from the last 7 days
find /var/spool/asterisk/monitorDONE/MP3/ -name "*.mp3" -mtime -7 > /root/recent_recordings.txt
# All recordings from a specific date
ls /var/spool/asterisk/monitorDONE/MP3/20260301* > /root/march1_recordings.txt
# Count files
wc -l /root/recent_recordings.txt
Running the Batch
# Run at low CPU priority so it doesn't affect call processing
nice -n 19 python3.11 transcribe_batch.py
# Or run in the background with nohup
nohup nice -n 19 python3.11 transcribe_batch.py > /root/transcribe.log 2>&1 &
# Monitor progress
tail -f /root/transcribe.log
Database-Driven Batch Transcription
The most powerful approach: query ViciDial's database to select exactly the calls you want, then transcribe them with full metadata attached to each transcript.
transcribe_db.py
#!/usr/bin/env python3.11
"""
Database-driven batch transcription.
Queries ViciDial to select calls by date/campaign/agent/status,
verifies MP3 files exist, transcribes with full metadata.
Resume-safe. Produces .txt files + JSON summary + combined markdown.
"""
import sys
import os
import json
import time
import argparse
import mysql.connector
from faster_whisper import WhisperModel
# === Configuration ===
DB_HOST = "127.0.0.1"
DB_USER = "your_db_user" # Use a read-only user
DB_PASS = "your_db_password"
DB_NAME = "asterisk"
MP3_BASE = "/var/spool/asterisk/monitorDONE/MP3"
# === Argument Parsing ===
parser = argparse.ArgumentParser(description="Transcribe ViciDial call recordings")
parser.add_argument("--output-dir", default="/root/transcriptions",
help="Output directory for transcripts")
parser.add_argument("--start-date", default="2026-01-01",
help="Start date (YYYY-MM-DD)")
parser.add_argument("--end-date", default=None,
help="End date (YYYY-MM-DD), defaults to now")
parser.add_argument("--campaign", default=None,
help="Filter by campaign ID")
parser.add_argument("--agent", default=None,
help="Filter by agent user ID")
parser.add_argument("--min-duration", type=int, default=30,
help="Minimum call duration in seconds (default: 30)")
parser.add_argument("--max-duration", type=int, default=600,
help="Maximum call duration in seconds (default: 600)")
parser.add_argument("--limit", type=int, default=100,
help="Maximum number of calls to transcribe")
parser.add_argument("--call-type", choices=["inbound", "outbound", "both"],
default="inbound", help="Call type to transcribe")
parser.add_argument("--model", default="small",
help="Whisper model size (tiny/base/small/medium/large-v3)")
parser.add_argument("--language", default="en",
help="Language code (en, it, es, etc.) or 'auto' for detection")
args = parser.parse_args()
OUTPUT_DIR = args.output_dir
SUMMARY_FILE = os.path.join(OUTPUT_DIR, "_summary.json")
COMBINED_MD = os.path.join(OUTPUT_DIR, "_all_transcriptions.md")
LOG_FILE = os.path.join(OUTPUT_DIR, "_transcribe.log")
os.makedirs(OUTPUT_DIR, exist_ok=True)
def log(msg):
"""Print and log a timestamped message."""
line = f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {msg}"
print(line, flush=True)
with open(LOG_FILE, "a") as f:
f.write(line + "\n")
# === Step 1: Query Database ===
log("Connecting to ViciDial database...")
db = mysql.connector.connect(
host=DB_HOST, user=DB_USER, password=DB_PASS, database=DB_NAME
)
cursor = db.cursor(dictionary=True)
if args.call_type in ("inbound", "both"):
# Build inbound query
query = """
SELECT cl.closecallid AS call_id,
'inbound' AS call_type,
cl.call_date,
cl.length_in_sec,
cl.status,
cl.phone_number,
cl.campaign_id,
cl.queue_seconds,
cl.user AS agent,
cl.term_reason,
rl.filename,
rl.location
FROM vicidial_closer_log cl
JOIN recording_log rl ON rl.vicidial_id = cl.closecallid
WHERE cl.call_date >= %s
AND cl.length_in_sec >= %s
AND cl.length_in_sec <= %s
AND cl.status NOT IN ('DROP', 'TIMEOT', 'NANQUE', 'AFTHRS', 'QVMAIL')
"""
params = [args.start_date, args.min_duration, args.max_duration]
if args.end_date:
query += " AND cl.call_date <= %s"
params.append(args.end_date + " 23:59:59")
if args.campaign:
query += " AND cl.campaign_id = %s"
params.append(args.campaign)
if args.agent:
query += " AND cl.user = %s"
params.append(args.agent)
query += " ORDER BY cl.call_date DESC LIMIT %s"
params.append(args.limit)
cursor.execute(query, params)
calls = cursor.fetchall()
elif args.call_type == "outbound":
query = """
SELECT vl.uniqueid AS call_id,
'outbound' AS call_type,
vl.call_date,
vl.length_in_sec,
vl.status,
vl.phone_number,
vl.campaign_id,
0 AS queue_seconds,
vl.user AS agent,
vl.term_reason,
rl.filename,
rl.location
FROM vicidial_log vl
JOIN recording_log rl ON rl.vicidial_id = vl.uniqueid
WHERE vl.call_date >= %s
AND vl.length_in_sec >= %s
AND vl.length_in_sec <= %s
AND vl.status NOT IN ('DROP', 'NA', 'B', 'DC', 'N')
"""
params = [args.start_date, args.min_duration, args.max_duration]
if args.end_date:
query += " AND vl.call_date <= %s"
params.append(args.end_date + " 23:59:59")
if args.campaign:
query += " AND vl.campaign_id = %s"
params.append(args.campaign)
if args.agent:
query += " AND vl.user = %s"
params.append(args.agent)
query += " ORDER BY vl.call_date DESC LIMIT %s"
params.append(args.limit)
cursor.execute(query, params)
calls = cursor.fetchall()
cursor.close()
db.close()
log(f"Query returned {len(calls)} calls")
# === Step 2: Verify MP3 Files Exist ===
verified = []
missing = 0
for c in calls:
mp3_path = f"{MP3_BASE}/{c['filename']}-all.mp3"
if os.path.exists(mp3_path):
c["mp3_path"] = mp3_path
verified.append(c)
else:
missing += 1
log(f"Verified {len(verified)} MP3 files ({missing} missing)")
if not verified:
log("No recordings found. Exiting.")
sys.exit(1)
# === Step 3: Check Resume State ===
already_done = {}
existing_results = []
if os.path.exists(SUMMARY_FILE):
with open(SUMMARY_FILE) as f:
existing_results = json.load(f)
already_done = {r["file"]: r for r in existing_results}
log(f"Resuming: {len(already_done)} done, "
f"{len(verified) - len(already_done)} remaining")
remaining = [c for c in verified
if os.path.basename(c["mp3_path"]) not in already_done]
if not remaining:
log("All calls already transcribed!")
sys.exit(0)
# === Step 4: Load Whisper Model ===
lang_str = args.language if args.language != "auto" else "auto-detect"
log(f"Loading faster-whisper model ({args.model}, CPU, int8, lang={lang_str})...")
model = WhisperModel(args.model, device="cpu", compute_type="int8")
log("Model loaded. Starting transcription...")
# === Step 5: Transcribe ===
start_time = time.time()
results = list(existing_results)
errors = 0
for i, call in enumerate(remaining):
mp3_path = call["mp3_path"]
fname = os.path.basename(mp3_path)
txt_name = fname.replace(".mp3", ".txt")
txt_path = os.path.join(OUTPUT_DIR, txt_name)
log(f"[{i+1}/{len(remaining)}] {fname} "
f"({call['length_in_sec']}s, status={call['status']})...")
try:
# Transcribe — force language or auto-detect
transcribe_kwargs = {"beam_size": 5}
if args.language != "auto":
transcribe_kwargs["language"] = args.language
segments, info = model.transcribe(mp3_path, **transcribe_kwargs)
text_parts = []
for segment in segments:
text_parts.append(segment.text.strip())
full_text = " ".join(text_parts)
duration = round(info.duration, 1)
# Save individual transcript with metadata header
with open(txt_path, "w") as f:
f.write(f"File: {fname}\n")
f.write(f"Date: {call['call_date']}\n")
f.write(f"Duration: {duration}s ({round(duration / 60, 1)} min)\n")
f.write(f"Caller: {call['phone_number']}\n")
f.write(f"Agent: {call['agent']}\n")
f.write(f"Campaign: {call['campaign_id']}\n")
f.write(f"Status: {call['status']}\n")
f.write(f"Term Reason: {call['term_reason']}\n")
f.write(f"Queue Time: {call['queue_seconds']}s\n")
f.write(f"Detected Language: {info.language} "
f"({round(info.language_probability, 3)})\n")
f.write(f"---\n\n")
f.write(full_text)
record = {
"file": fname,
"call_date": str(call["call_date"]),
"call_type": call["call_type"],
"duration_sec": duration,
"duration_min": round(duration / 60, 1),
"phone_number": str(call["phone_number"]),
"agent": str(call["agent"]),
"campaign": str(call["campaign_id"]),
"status": str(call["status"]),
"term_reason": str(call["term_reason"]),
"queue_seconds": int(call["queue_seconds"]) if call["queue_seconds"] else 0,
"detected_language": info.language,
"language_probability": round(info.language_probability, 3),
"chars": len(full_text),
"txt_file": txt_name,
"transcript": full_text
}
results.append(record)
# Crash-safe: save after every file
with open(SUMMARY_FILE, "w") as f:
json.dump(results, f, indent=2, ensure_ascii=False, default=str)
elapsed = time.time() - start_time
avg = elapsed / (i + 1)
eta = avg * (len(remaining) - i - 1)
log(f" OK ({round(duration)}s audio, {len(full_text)} chars) "
f"| ETA: {round(eta / 60)}m")
except Exception as e:
errors += 1
log(f" ERROR: {e}")
# === Step 6: Build Combined Markdown ===
log("Building combined markdown file...")
with open(COMBINED_MD, "w") as f:
f.write(f"# Call Transcriptions\n\n")
f.write(f"Generated: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
f.write(f"Date range: {args.start_date} to {args.end_date or 'present'}\n")
if args.campaign:
f.write(f"Campaign: {args.campaign}\n")
if args.agent:
f.write(f"Agent: {args.agent}\n")
f.write(f"Total calls: {len(results)}\n")
total_dur = sum(r["duration_sec"] for r in results)
f.write(f"Total audio: {round(total_dur / 3600, 1)} hours\n\n")
f.write("---\n\n")
for j, r in enumerate(results):
f.write(f"## Call {j+1}: {r['file']}\n\n")
f.write(f"- **Date**: {r['call_date']}\n")
f.write(f"- **Type**: {r['call_type']}\n")
f.write(f"- **Duration**: {r['duration_min']} min\n")
f.write(f"- **Caller**: {r['phone_number']}\n")
f.write(f"- **Agent**: {r['agent']}\n")
f.write(f"- **Campaign**: {r['campaign']}\n")
f.write(f"- **Status**: {r['status']}\n")
f.write(f"- **Term Reason**: {r['term_reason']}\n")
f.write(f"- **Queue Time**: {r['queue_seconds']}s\n\n")
f.write(f"### Transcript\n\n")
f.write(f"{r.get('transcript', '(no transcript)')}\n\n")
f.write("---\n\n")
# === Final Report ===
elapsed_total = time.time() - start_time
total_dur = sum(r["duration_sec"] for r in results)
total_chars = sum(r["chars"] for r in results)
log("")
log("=" * 60)
log(f"COMPLETE! {len(results)} calls transcribed ({errors} errors)")
log(f"Total audio: {round(total_dur / 3600, 1)} hours")
log(f"Total text: {total_chars:,} characters")
log(f"Processing time: {round(elapsed_total / 3600, 1)} hours")
log(f"Speed ratio: {round(total_dur / elapsed_total, 1)}x "
f"({'faster' if total_dur > elapsed_total else 'slower'} than real-time)")
log(f"Output: {OUTPUT_DIR}")
log(f"Summary: {SUMMARY_FILE}")
log(f"Combined: {COMBINED_MD}")
Usage Examples
# Transcribe last 100 inbound calls from all campaigns
python3.11 transcribe_db.py --start-date 2026-03-01 --limit 100
# Transcribe outbound calls for a specific campaign
python3.11 transcribe_db.py --call-type outbound --campaign ukcamp --limit 50
# Transcribe a specific agent's calls
python3.11 transcribe_db.py --agent 1042 --start-date 2026-03-01 --limit 200
# Transcribe with a larger model for better accuracy
python3.11 transcribe_db.py --model medium --limit 50
# Auto-detect language (useful for multilingual call centers)
python3.11 transcribe_db.py --language auto --limit 100
# Run at low priority in background
nohup nice -n 19 python3.11 transcribe_db.py \
--output-dir /root/transcriptions_march \
--start-date 2026-03-01 \
--limit 500 \
> /root/transcribe_march.log 2>&1 &
Output Formats
The scripts produce three complementary output formats.
Individual Text Files (.txt)
One file per recording with a metadata header and the full transcript:
File: 20260216-153000_447974040560-all.mp3
Date: 2026-02-16 15:30:00
Duration: 245.9s (4.1 min)
Caller: 447974040560
Agent: 1002
Campaign: doppia
Status: A
Term Reason: AGENT
Queue Time: 0.00s
Detected Language: en (0.997)
---
Call from Google. Hello, how can I help you? Hi, I called a little bit
earlier. I'm a social worker looking for an estimate for a client's
plumbing and I've got the postcode now...
Summary JSON (_summary.json)
Structured metadata for programmatic access, searchable and filterable:
[
{
"file": "20260216-153000_447974040560-all.mp3",
"call_date": "2026-02-16 15:30:00",
"call_type": "inbound",
"duration_sec": 245.9,
"duration_min": 4.1,
"phone_number": "447974040560",
"agent": "1002",
"campaign": "doppia",
"status": "A",
"term_reason": "AGENT",
"queue_seconds": 0,
"detected_language": "en",
"language_probability": 0.997,
"chars": 2847,
"txt_file": "20260216-153000_447974040560-all.txt",
"transcript": "Call from Google. Hello, how can I help you?..."
}
]
SRT Subtitle Format
For cases where you need timed subtitles (e.g., video review tools, quality assurance interfaces), add SRT output to your transcription:
def save_as_srt(segments, output_path):
"""Save transcription segments in SRT subtitle format."""
with open(output_path, "w") as f:
for i, segment in enumerate(segments, 1):
start = format_timestamp_srt(segment.start)
end = format_timestamp_srt(segment.end)
f.write(f"{i}\n")
f.write(f"{start} --> {end}\n")
f.write(f"{segment.text.strip()}\n\n")
def format_timestamp_srt(seconds):
"""Convert seconds to SRT timestamp format (HH:MM:SS,mmm)."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
# Usage within transcription loop:
segments, info = model.transcribe(filepath, language="en", beam_size=5)
segments_list = list(segments) # Consume the generator once
# Save SRT
srt_path = os.path.join(OUTPUT_DIR, fname.replace(".mp3", ".srt"))
save_as_srt(segments_list, srt_path)
# Also build plain text from the consumed segments
full_text = " ".join(s.text.strip() for s in segments_list)
Example SRT output:
1
00:00:00,000 --> 00:00:04,200
Call from Google. Hello, how can I help you?
2
00:00:04,200 --> 00:00:08,800
Hi, I called a little bit earlier. I'm a social worker looking for an estimate.
3
00:00:08,800 --> 00:00:13,500
For a client's plumbing and I've got the postcode now.
Timestamped JSON
For detailed segment-level data with timing information:
def save_segments_json(segments_list, info, output_path):
"""Save detailed segment-level transcription as JSON."""
data = {
"duration": round(info.duration, 1),
"language": info.language,
"language_probability": round(info.language_probability, 3),
"segments": [
{
"id": i,
"start": round(seg.start, 2),
"end": round(seg.end, 2),
"text": seg.text.strip(),
"avg_logprob": round(seg.avg_logprob, 3),
"no_speech_prob": round(seg.no_speech_prob, 3)
}
for i, seg in enumerate(segments_list)
]
}
with open(output_path, "w") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
This gives you confidence scores per segment (avg_logprob closer to 0 = more confident, no_speech_prob closer to 1 = likely silence/noise).
Integration with ViciDial
Finding Recordings by Various Criteria
Use these SQL queries to build targeted transcription jobs.
By Date Range
SELECT rl.filename, rl.start_time, rl.length_in_sec
FROM recording_log rl
WHERE rl.start_time >= '2026-03-01 00:00:00'
AND rl.start_time < '2026-03-08 00:00:00'
AND rl.length_in_sec > 30
ORDER BY rl.start_time;
By Agent (with Call Details)
SELECT cl.call_date, cl.phone_number, cl.length_in_sec,
cl.status, cl.campaign_id, rl.filename
FROM vicidial_closer_log cl
JOIN recording_log rl ON rl.vicidial_id = cl.closecallid
WHERE cl.user = '1042'
AND cl.call_date >= '2026-03-01'
AND cl.length_in_sec > 30
ORDER BY cl.call_date DESC;
By Campaign
SELECT cl.call_date, cl.phone_number, cl.user,
cl.length_in_sec, cl.status, rl.filename
FROM vicidial_closer_log cl
JOIN recording_log rl ON rl.vicidial_id = cl.closecallid
WHERE cl.campaign_id = 'doppia'
AND cl.call_date >= '2026-03-01'
AND cl.length_in_sec BETWEEN 30 AND 600
ORDER BY cl.call_date DESC
LIMIT 200;
By Call Status (e.g., Only Answered Calls)
-- Common inbound statuses:
-- A = answered (agent disposition)
-- SALE = sale made
-- NI = not interested
-- CB = callback scheduled
-- DROP = dropped (no agent answered) — typically exclude
-- TIMEOT = timeout — typically exclude
SELECT cl.call_date, cl.phone_number, cl.user, cl.status, rl.filename
FROM vicidial_closer_log cl
JOIN recording_log rl ON rl.vicidial_id = cl.closecallid
WHERE cl.call_date >= '2026-03-01'
AND cl.status IN ('A', 'SALE', 'NI', 'CB')
AND cl.length_in_sec > 30
ORDER BY cl.call_date DESC;
Outbound Calls
SELECT vl.call_date, vl.phone_number, vl.user,
vl.length_in_sec, vl.status, rl.filename
FROM vicidial_log vl
JOIN recording_log rl ON rl.vicidial_id = vl.uniqueid
WHERE vl.campaign_id = 'ukcamp'
AND vl.call_date >= '2026-03-01'
AND vl.length_in_sec > 30
AND vl.status NOT IN ('NA', 'B', 'DC', 'N', 'DROP')
ORDER BY vl.call_date DESC
LIMIT 100;
Building the File Path
ViciDial's recording_log.filename contains the base name. The actual MP3 file is:
mp3_path = f"/var/spool/asterisk/monitorDONE/MP3/{filename}-all.mp3"
Always verify the file exists before adding it to the transcription queue — recordings may have been cleaned up by retention scripts.
Linking Transcripts Back to ViciDial
After transcription, you can load the JSON summary into a database table for searching:
CREATE TABLE call_transcriptions (
id INT AUTO_INCREMENT PRIMARY KEY,
recording_filename VARCHAR(255) NOT NULL,
call_date DATETIME,
call_type ENUM('inbound', 'outbound'),
phone_number VARCHAR(20),
agent VARCHAR(20),
campaign VARCHAR(50),
status VARCHAR(10),
duration_sec DECIMAL(8,1),
transcript TEXT,
chars INT,
language VARCHAR(5),
language_probability DECIMAL(4,3),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_call_date (call_date),
INDEX idx_agent (agent),
INDEX idx_campaign (campaign),
FULLTEXT INDEX idx_transcript (transcript)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Then search transcripts with SQL:
-- Find calls where "refund" was mentioned
SELECT call_date, agent, phone_number, duration_sec,
SUBSTRING(transcript, 1, 200) AS preview
FROM call_transcriptions
WHERE MATCH(transcript) AGAINST('refund' IN BOOLEAN MODE)
ORDER BY call_date DESC;
-- Find calls where agent said "I don't know"
SELECT call_date, agent, phone_number
FROM call_transcriptions
WHERE transcript LIKE '%I don%t know%'
AND call_date >= '2026-03-01';
Performance Tuning
Model Selection
Faster-Whisper supports all Whisper model sizes. Choose based on your accuracy/speed tradeoff:
| Model | Parameters | VRAM/RAM | Speed (CPU int8) | English Accuracy | Best For |
|---|---|---|---|---|---|
tiny |
39M | ~400 MB | ~6x real-time | Basic | Quick scanning, keyword detection |
base |
74M | ~500 MB | ~4x real-time | Good | Rough transcripts, high volume |
small |
244M | ~1.5 GB | ~1.5-2x real-time | Very good | Production sweet spot |
medium |
769M | ~3 GB | ~0.5x real-time | Excellent | High-accuracy needs |
large-v3 |
1.5B | ~5 GB | ~0.2x real-time | Best | Critical/legal recordings |
Recommendation: Use small for call center transcription. It handles accents well (UK, Indian, European English) and runs fast enough for large batches on CPU. Use medium only when you need higher accuracy on difficult audio (heavy accents, background noise, multiple speakers talking over each other).
CPU vs GPU
# CPU with int8 quantization (no GPU required)
model = WhisperModel("small", device="cpu", compute_type="int8")
# GPU with float16 (requires NVIDIA GPU + CUDA)
model = WhisperModel("small", device="cuda", compute_type="float16")
# GPU with int8 (fastest GPU option)
model = WhisperModel("small", device="cuda", compute_type="int8_float16")
GPU is 5-10x faster than CPU, but most ViciDial servers do not have GPUs. CPU with int8 is the practical choice for on-premises call center servers.
Beam Size
The beam_size parameter controls the search width during decoding:
# beam_size=1: Greedy decoding (fastest, least accurate)
segments, info = model.transcribe(path, language="en", beam_size=1)
# beam_size=5: Good balance (default recommendation)
segments, info = model.transcribe(path, language="en", beam_size=5)
# beam_size=10: More thorough search (slower, marginally better)
segments, info = model.transcribe(path, language="en", beam_size=10)
For call center audio, beam_size=5 is the sweet spot. Going higher than 5 gives diminishing returns and costs ~40% more processing time.
Voice Activity Detection (VAD)
Enable VAD filtering to skip silent sections (common in calls with hold time):
segments, info = model.transcribe(
filepath,
language="en",
beam_size=5,
vad_filter=True,
vad_parameters=dict(
threshold=0.5, # Speech detection sensitivity (0-1)
min_speech_duration_ms=250, # Ignore speech shorter than this
min_silence_duration_ms=500, # Split on silences longer than this
speech_pad_ms=200 # Padding around detected speech
)
)
VAD can speed up transcription by 20-40% on calls with significant silence, hold time, or ringing.
Language Detection vs Forcing
# Force English (fastest — skips 30-second detection scan)
segments, info = model.transcribe(filepath, language="en")
# Auto-detect language (adds ~2-5 seconds per file)
segments, info = model.transcribe(filepath)
# info.language returns detected code ("en", "it", "es", etc.)
# info.language_probability returns confidence (0.0 to 1.0)
If your call center handles a single language, always force it. For multilingual environments, auto-detection works well but adds overhead.
Parallel Processing
For servers with many CPU cores, you can run multiple transcription processes:
# Split file list into chunks
split -n l/4 recording_list.txt chunk_
# Run 4 parallel transcription processes
for chunk in chunk_*; do
nice -n 19 python3.11 transcribe_batch.py --input "$chunk" \
--output-dir /root/transcriptions_$(basename $chunk) &
done
wait
echo "All chunks complete"
Note: Each process loads its own copy of the model (~1.5 GB for small), so 4 processes need ~6 GB RAM. Monitor with htop.
Memory Optimization
For very large batches on memory-constrained servers:
import gc
# Process in chunks — unload model between chunks if needed
for chunk_start in range(0, len(files), 100):
chunk = files[chunk_start:chunk_start + 100]
model = WhisperModel("small", device="cpu", compute_type="int8")
for filepath in chunk:
# ... transcribe ...
pass
del model
gc.collect() # Force garbage collection between chunks
Production Deployment
Cron-Based Nightly Transcription
Set up automatic transcription of the previous day's calls:
#!/bin/bash
# /root/scripts/nightly_transcribe.sh
# Transcribe yesterday's inbound calls
# Runs via cron at 02:00 when call volume is lowest
YESTERDAY=$(date -d 'yesterday' +%Y-%m-%d)
OUTPUT_DIR="/root/transcriptions/${YESTERDAY}"
nice -n 19 /usr/local/bin/python3.11 /root/scripts/transcribe_db.py \
--start-date "$YESTERDAY" \
--end-date "$YESTERDAY" \
--call-type inbound \
--min-duration 30 \
--max-duration 600 \
--limit 1000 \
--output-dir "$OUTPUT_DIR" \
>> /var/log/transcribe.log 2>&1
# Report completion
TOTAL=$(jq length "${OUTPUT_DIR}/_summary.json" 2>/dev/null || echo 0)
echo "[$(date)] Transcribed ${TOTAL} calls from ${YESTERDAY}" >> /var/log/transcribe.log
Add to crontab:
0 2 * * * /root/scripts/nightly_transcribe.sh
Monitoring Transcription Jobs
# Check if a transcription is running
pgrep -af transcribe
# Check progress from the summary file
python3.11 -c "
import json
d = json.load(open('/root/transcriptions/_summary.json'))
total_sec = sum(r['duration_sec'] for r in d)
print(f'Transcribed: {len(d)} calls')
print(f'Total audio: {round(total_sec/3600, 1)} hours')
print(f'Total text: {sum(r[\"chars\"] for r in d):,} characters')
"
# Check the log
tail -20 /root/transcriptions/_transcribe.log
Disk Space Management
Transcription output is relatively small compared to audio:
- 500 calls, 15 hours of audio produces roughly:
- ~3 MB of text files
- ~5 MB JSON summary (with transcripts embedded)
- ~2 MB combined markdown
- Total: ~10 MB vs ~2 GB of MP3 audio
Set up cleanup for old transcriptions:
# Keep transcriptions for 90 days
find /root/transcriptions/ -name "*.txt" -mtime +90 -delete
find /root/transcriptions/ -name "_summary.json" -mtime +90 -delete
Speaker Considerations
Whisper transcribes all audio as a single stream — it does not perform speaker diarization (identifying who said what). For call center recordings, this means the agent and caller text is mixed together.
Strategies for handling this:
Rely on context clues — In practice, conversational flow makes it clear who is speaking. Phrases like "How can I help you?" are clearly the agent, while "I'm calling about..." is the caller.
Use separate channel recordings — If ViciDial records agent and caller on separate channels (MixMon
randtoptions), transcribe each channel individually and merge with speaker labels.Post-process with an LLM — Feed the raw transcript to a language model with a prompt like: "This is a call center transcript. Label each sentence as AGENT or CALLER based on context."
Use a dedicated diarization tool — Libraries like
pyannote-audiocan identify speakers before transcription:
# Speaker diarization (requires separate installation)
# pip install pyannote.audio
from pyannote.audio import Pipeline
diarization = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
result = diarization(audio_path)
for turn, _, speaker in result.itertracks(yield_label=True):
print(f"[{turn.start:.1f}s - {turn.end:.1f}s] {speaker}")
Troubleshooting
Common Issues
"Model not found" error:
The model files are downloaded on first use. Ensure internet access is available,
or pre-download:
python3.11 -c "from faster_whisper import WhisperModel; WhisperModel('small')"
Models are cached at: ~/.cache/huggingface/hub/
Out of memory (OOM):
Use a smaller model or reduce batch processing parallelism.
Model RAM requirements: tiny ~400MB, base ~500MB, small ~1.5GB, medium ~3GB
Monitor with: watch -n 5 free -h
"Could not load audio" error:
Ensure FFmpeg is installed: ffmpeg -version
Test the file: ffmpeg -i /path/to/file.mp3 -f null -
If corrupt, the file may have been truncated during recording or conversion.
Very slow transcription:
Check CPU load — other processes may be competing.
Use 'nice -n 19' to lower priority.
Verify compute_type="int8" is set (not "float32").
Check model size — "medium" is 3-4x slower than "small".
Poor accuracy on accented English:
Switch from "tiny" or "base" to "small" or "medium" model.
The "small" model handles UK, Indian, and European accents well.
Forcing language="en" can sometimes hurt if speakers mix languages —
try removing the language parameter to let Whisper auto-detect.
Empty or garbled transcriptions:
Check audio quality: ffmpeg -i file.mp3 -af volumedetect -f null -
Very quiet recordings (below -40dB) or recordings with only
DTMF/hold music will produce poor results.
Enable VAD filtering to skip silence: vad_filter=True
Verifying Output Quality
Spot-check a sample of transcriptions against the actual recordings:
# Pick 5 random transcripts and listen to the originals
python3.11 -c "
import json, random
d = json.load(open('/root/transcriptions/_summary.json'))
sample = random.sample(d, min(5, len(d)))
for r in sample:
print(f'{r[\"file\"]} ({r[\"duration_min\"]}min): {r[\"transcript\"][:100]}...')
"
Compare the first 30 seconds of text against what you hear. The small model typically achieves 90-95% word accuracy on clear call center audio.
Summary
| Component | Value |
|---|---|
| Model | faster-whisper 1.2.x with CTranslate2 |
| Recommended model size | small (244M parameters) |
| Compute | CPU with int8 quantization |
| Speed | ~1.5-2x real-time on modern server CPU |
| Accuracy | 90-95% on clear English call audio |
| RAM usage | ~1.5 GB (small model) |
| Cost | Zero (fully self-hosted, no API fees) |
| Resume support | Yes (crash-safe, saves after each file) |
| Output formats | Plain text, JSON, SRT, Markdown |
This system processes hundreds of calls per night on standard server hardware, turning audio into searchable, analyzable text — ready for quality assurance, compliance checks, agent coaching, or AI pipeline input — all without sending a single recording to an external API.