← All Tutorials

Call Recording Transcription with Faster-Whisper

AI & Voice Agents Intermediate 27 min read #24

Tutorial 24: Call Recording Transcription with Faster-Whisper

Batch transcription of ViciDial call recordings using Faster-Whisper (OpenAI Whisper optimized with CTranslate2) for speech-to-text at scale — on CPU, without cloud APIs.


Table of Contents

  1. Introduction
  2. Architecture Overview
  3. Prerequisites
  4. Installation
  5. Understanding ViciDial Recording Files
  6. Single-File Transcription
  7. Batch Transcription Script
  8. Database-Driven Batch Transcription
  9. Output Formats
  10. Integration with ViciDial
  11. Performance Tuning
  12. Production Deployment
  13. Troubleshooting

Introduction

Call centers generate thousands of recordings daily. Manually reviewing them is impossible at scale. Automatic transcription unlocks:

OpenAI's Whisper is the best open-source speech recognition model available. Faster-Whisper is a reimplementation using CTranslate2 that runs 4x faster than the original while using less memory — making it practical to transcribe thousands of calls on a standard server CPU without any GPU or cloud API costs.

What we built: A system that transcribes 500 inbound call recordings (15+ hours of audio) entirely on CPU, producing individual text files, structured JSON metadata, and a combined markdown document — with resume support so it can survive interruptions.


Architecture Overview

ViciDial Server
├── /var/spool/asterisk/monitorDONE/     ← Raw WAV recordings
│   └── MP3/                              ← Converted MP3 recordings
│       └── YYYYMMDD-HHMMSS_phone-all.mp3
│
├── MariaDB (asterisk database)
│   ├── recording_log          ← filename, vicidial_id, location, start_time
│   ├── vicidial_closer_log    ← inbound call metadata (agent, status, duration)
│   └── vicidial_log           ← outbound call metadata
│
├── Python 3.11 + faster-whisper
│   ├── transcribe_single.py   ← One-off transcription
│   ├── transcribe_batch.py    ← Batch from file list
│   └── transcribe_db.py       ← Database-driven with metadata
│
└── Output
    ├── individual .txt files   ← One per recording
    ├── _summary.json           ← Structured metadata + stats
    └── _all_transcriptions.md  ← Combined readable document

Processing flow:

  1. Query the ViciDial database to select recordings by date range, campaign, agent, or call status
  2. Verify the corresponding MP3/WAV files exist on disk
  3. Load the Faster-Whisper model once (the expensive step)
  4. Transcribe each file sequentially, saving results after every file (crash-safe)
  5. Produce individual text files, a JSON summary, and optionally a combined document

Prerequisites

Component Version Purpose
Python 3.11+ Runtime (3.11 recommended for performance)
faster-whisper 1.2.x Whisper inference engine
CTranslate2 4.x Optimized transformer inference
FFmpeg 4.x+ Audio format conversion
MariaDB client Any Database queries for recording metadata
Server RAM 4GB+ minimum Model loading (small model ~2GB)
Disk space ~5GB Model cache + transcription output

Hardware reality check: The "small" model on CPU with int8 quantization transcribes roughly 1 minute of audio in 15-25 seconds on a modern server (8+ cores). A 500-call batch of 15 hours of audio takes approximately 4-6 hours of wall time. This is practical for nightly batch jobs.


Installation

Step 1: Install Python 3.11

If your server runs CentOS 7 or an older distribution, Python 3.11 may not be in the default repositories.

# Debian/Ubuntu
apt update && apt install -y python3.11 python3.11-venv python3.11-dev

# CentOS 7 / RHEL 7 (from source)
yum install -y gcc openssl-devel bzip2-devel libffi-devel zlib-devel xz-devel
cd /usr/src
wget https://www.python.org/ftp/python/3.11.8/Python-3.11.8.tgz
tar xzf Python-3.11.8.tgz
cd Python-3.11.8
./configure --enable-optimizations
make altinstall  # 'altinstall' avoids overwriting system python

# openSUSE
zypper install python311 python311-devel

Step 2: Install FFmpeg

Faster-Whisper uses FFmpeg internally (via the av library) to decode audio files.

# Debian/Ubuntu
apt install -y ffmpeg

# CentOS 7
yum install -y epel-release
yum install -y ffmpeg ffmpeg-devel

# openSUSE
zypper install ffmpeg

Step 3: Install Faster-Whisper

pip3.11 install faster-whisper

This pulls in the key dependencies:

Step 4: Install MySQL/MariaDB Client Library

Required only if you plan to query ViciDial's database for recording metadata.

pip3.11 install mysql-connector-python

Step 5: Verify Installation

python3.11 -c "
from faster_whisper import WhisperModel
print('faster-whisper imported successfully')
import ctranslate2
print(f'CTranslate2 version: {ctranslate2.__version__}')
"

Step 6: Pre-Download the Model

The first transcription triggers a model download (~1GB for "small"). Pre-download it to avoid delays:

python3.11 -c "
from faster_whisper import WhisperModel
print('Downloading small model...')
model = WhisperModel('small', device='cpu', compute_type='int8')
print('Model ready.')
"

Models are cached in ~/.cache/huggingface/hub/.


Understanding ViciDial Recording Files

File Naming Convention

ViciDial recordings follow this naming pattern:

YYYYMMDD-HHMMSS_CALLERID_DIALEDNUMBER-all.ext

Examples:

20260313094154_1004_88447787563580-all.wav     ← Raw WAV
20260130130931_process_447418315755-all.mp3     ← Converted MP3

File Locations

/var/spool/asterisk/monitorDONE/          ← Raw WAV files
/var/spool/asterisk/monitorDONE/MP3/      ← Converted MP3 files
/var/spool/asterisk/monitorDONE/ORIG/     ← Original pre-mix files (optional)
/var/spool/asterisk/monitorDONE/FTP/      ← Files pending FTP transfer

Use MP3 files for transcription — they are smaller (faster to read) and Faster-Whisper handles them natively. WAV files work too but are 5-10x larger.

Database Tables

The recording_log table links recordings to call metadata:

-- Find a recording by filename
SELECT recording_id, filename, location, start_time, length_in_sec, vicidial_id
FROM recording_log
WHERE filename LIKE '%447418315755%';

-- The vicidial_id links to either:
--   vicidial_closer_log.closecallid (inbound calls)
--   vicidial_log.uniqueid (outbound calls)

Single-File Transcription

Start with the simplest case: transcribe one recording file.

Basic Script

#!/usr/bin/env python3.11
"""Transcribe a single call recording."""
import sys
from faster_whisper import WhisperModel

if len(sys.argv) < 2:
    print("Usage: python3.11 transcribe_single.py <audio_file>")
    sys.exit(1)

audio_file = sys.argv[1]

# Load model — 'small' is the sweet spot for call center audio
# CPU + int8 quantization keeps memory usage around 1-2 GB
print("Loading model...", flush=True)
model = WhisperModel("small", device="cpu", compute_type="int8")

# Transcribe with English language forced (skip detection overhead)
print(f"Transcribing: {audio_file}", flush=True)
segments, info = model.transcribe(audio_file, language="en", beam_size=5)

print(f"Duration: {round(info.duration)}s")
print(f"Language confidence: {round(info.language_probability, 3)}")
print(f"---")

for segment in segments:
    print(f"[{segment.start:.1f}s -> {segment.end:.1f}s] {segment.text.strip()}")

Run it:

python3.11 transcribe_single.py /var/spool/asterisk/monitorDONE/MP3/20260130130931_process_447418315755-all.mp3

Output:

Loading model...
Transcribing: /var/spool/asterisk/monitorDONE/MP3/20260130130931_process_447418315755-all.mp3
Duration: 246s
Language confidence: 0.997
---
[0.0s -> 4.2s] Call from Google. Hello, how can I help you?
[4.2s -> 8.8s] Hi, I called a little bit earlier. I'm a social worker looking for an estimate...
[8.8s -> 13.5s] ...for a client's plumbing and I've got the postcode now.
...

Key Parameters Explained

model = WhisperModel(
    "small",           # Model size (see Performance Tuning section)
    device="cpu",      # "cpu" or "cuda" (GPU)
    compute_type="int8"  # Quantization: "int8" (fastest CPU), "float16" (GPU), "float32" (most accurate)
)

segments, info = model.transcribe(
    audio_file,
    language="en",    # Force English (faster than auto-detect)
    beam_size=5,      # Search width — 5 is good balance of speed/accuracy
    vad_filter=True,  # Optional: skip silence (faster for calls with holds)
    vad_parameters=dict(
        min_silence_duration_ms=500  # Minimum silence to split on
    )
)

Batch Transcription Script

For processing multiple files from a list, with resume support and progress tracking.

transcribe_batch.py

#!/usr/bin/env python3.11
"""
Batch transcribe call recordings from a file list.
Resume-safe: skips files that have already been transcribed.
Saves individual .txt files + a summary JSON.
"""
import sys
import os
import json
import time
from faster_whisper import WhisperModel

# === Configuration ===
INPUT_LIST = "/path/to/recording_list.txt"    # One file path per line
OUTPUT_DIR = "/path/to/transcriptions"         # Output directory
SUMMARY_FILE = os.path.join(OUTPUT_DIR, "_summary.json")

# === Setup ===
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Read the list of files to transcribe
with open(INPUT_LIST) as f:
    files = [line.strip() for line in f if line.strip()]

print(f"Found {len(files)} files in list", flush=True)

# Check for already-completed files (resume support)
already_done = set()
existing_results = []
if os.path.exists(SUMMARY_FILE):
    with open(SUMMARY_FILE) as f:
        existing_results = json.load(f)
    already_done = {r["file"] for r in existing_results}
    print(f"Resuming: {len(already_done)} already done, "
          f"{len(files) - len(already_done)} remaining", flush=True)

remaining = [fp for fp in files if os.path.basename(fp) not in already_done]

if not remaining:
    print("All files already transcribed!", flush=True)
    sys.exit(0)

# === Load Model ===
print("Loading whisper model (small, CPU, int8)...", flush=True)
model = WhisperModel("small", device="cpu", compute_type="int8")
print("Model loaded.", flush=True)

# === Transcribe ===
start_time = time.time()
results = list(existing_results)  # Carry forward previous results

for i, filepath in enumerate(remaining):
    if not os.path.exists(filepath):
        print(f"[{i+1}/{len(remaining)}] SKIP (not found): {filepath}", flush=True)
        continue

    fname = os.path.basename(filepath)
    # Create output .txt filename (replace audio extension with .txt)
    txt_name = fname.rsplit(".", 1)[0] + ".txt"
    txt_path = os.path.join(OUTPUT_DIR, txt_name)

    print(f"[{i+1}/{len(remaining)}] Transcribing: {fname}...", flush=True)

    try:
        segments, info = model.transcribe(filepath, language="en", beam_size=5)

        # Collect all segment text
        text_parts = []
        for segment in segments:
            text_parts.append(segment.text.strip())

        full_text = " ".join(text_parts)
        duration = round(info.duration, 1)

        # Save individual transcript file
        with open(txt_path, "w") as f:
            f.write(f"File: {fname}\n")
            f.write(f"Duration: {duration}s ({round(duration / 60, 1)} min)\n")
            f.write(f"Characters: {len(full_text)}\n")
            f.write(f"---\n\n")
            f.write(full_text)

        # Add to results
        results.append({
            "file": fname,
            "duration_sec": duration,
            "duration_min": round(duration / 60, 1),
            "language_prob": round(info.language_probability, 3),
            "chars": len(full_text),
            "txt_file": txt_name
        })

        # Save summary after EVERY file (crash-safe)
        with open(SUMMARY_FILE, "w") as f:
            json.dump(results, f, indent=2, ensure_ascii=False)

        # Progress and ETA calculation
        elapsed = time.time() - start_time
        avg_per_file = elapsed / (i + 1)
        eta_seconds = avg_per_file * (len(remaining) - i - 1)
        print(f"  Done ({round(duration)}s audio, {len(full_text)} chars) "
              f"| ETA: {round(eta_seconds / 60)}m remaining", flush=True)

    except Exception as e:
        print(f"  ERROR: {e}", flush=True)

# === Final Report ===
total_duration = sum(r["duration_sec"] for r in results)
total_chars = sum(r["chars"] for r in results)
elapsed = time.time() - start_time

print(f"\n{'=' * 60}", flush=True)
print(f"DONE! {len(results)} files transcribed", flush=True)
print(f"Total audio: {round(total_duration / 3600, 1)} hours", flush=True)
print(f"Total text: {total_chars:,} characters", flush=True)
print(f"Processing time: {round(elapsed / 3600, 1)} hours", flush=True)
print(f"Output: {OUTPUT_DIR}", flush=True)

Creating the File List

Generate a list of recordings to transcribe:

# All MP3 files from the last 7 days
find /var/spool/asterisk/monitorDONE/MP3/ -name "*.mp3" -mtime -7 > /root/recent_recordings.txt

# All recordings from a specific date
ls /var/spool/asterisk/monitorDONE/MP3/20260301* > /root/march1_recordings.txt

# Count files
wc -l /root/recent_recordings.txt

Running the Batch

# Run at low CPU priority so it doesn't affect call processing
nice -n 19 python3.11 transcribe_batch.py

# Or run in the background with nohup
nohup nice -n 19 python3.11 transcribe_batch.py > /root/transcribe.log 2>&1 &

# Monitor progress
tail -f /root/transcribe.log

Database-Driven Batch Transcription

The most powerful approach: query ViciDial's database to select exactly the calls you want, then transcribe them with full metadata attached to each transcript.

transcribe_db.py

#!/usr/bin/env python3.11
"""
Database-driven batch transcription.
Queries ViciDial to select calls by date/campaign/agent/status,
verifies MP3 files exist, transcribes with full metadata.
Resume-safe. Produces .txt files + JSON summary + combined markdown.
"""
import sys
import os
import json
import time
import argparse
import mysql.connector
from faster_whisper import WhisperModel

# === Configuration ===
DB_HOST = "127.0.0.1"
DB_USER = "your_db_user"           # Use a read-only user
DB_PASS = "your_db_password"
DB_NAME = "asterisk"
MP3_BASE = "/var/spool/asterisk/monitorDONE/MP3"

# === Argument Parsing ===
parser = argparse.ArgumentParser(description="Transcribe ViciDial call recordings")
parser.add_argument("--output-dir", default="/root/transcriptions",
                    help="Output directory for transcripts")
parser.add_argument("--start-date", default="2026-01-01",
                    help="Start date (YYYY-MM-DD)")
parser.add_argument("--end-date", default=None,
                    help="End date (YYYY-MM-DD), defaults to now")
parser.add_argument("--campaign", default=None,
                    help="Filter by campaign ID")
parser.add_argument("--agent", default=None,
                    help="Filter by agent user ID")
parser.add_argument("--min-duration", type=int, default=30,
                    help="Minimum call duration in seconds (default: 30)")
parser.add_argument("--max-duration", type=int, default=600,
                    help="Maximum call duration in seconds (default: 600)")
parser.add_argument("--limit", type=int, default=100,
                    help="Maximum number of calls to transcribe")
parser.add_argument("--call-type", choices=["inbound", "outbound", "both"],
                    default="inbound", help="Call type to transcribe")
parser.add_argument("--model", default="small",
                    help="Whisper model size (tiny/base/small/medium/large-v3)")
parser.add_argument("--language", default="en",
                    help="Language code (en, it, es, etc.) or 'auto' for detection")
args = parser.parse_args()

OUTPUT_DIR = args.output_dir
SUMMARY_FILE = os.path.join(OUTPUT_DIR, "_summary.json")
COMBINED_MD = os.path.join(OUTPUT_DIR, "_all_transcriptions.md")
LOG_FILE = os.path.join(OUTPUT_DIR, "_transcribe.log")

os.makedirs(OUTPUT_DIR, exist_ok=True)


def log(msg):
    """Print and log a timestamped message."""
    line = f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {msg}"
    print(line, flush=True)
    with open(LOG_FILE, "a") as f:
        f.write(line + "\n")


# === Step 1: Query Database ===
log("Connecting to ViciDial database...")
db = mysql.connector.connect(
    host=DB_HOST, user=DB_USER, password=DB_PASS, database=DB_NAME
)
cursor = db.cursor(dictionary=True)

if args.call_type in ("inbound", "both"):
    # Build inbound query
    query = """
        SELECT cl.closecallid AS call_id,
               'inbound' AS call_type,
               cl.call_date,
               cl.length_in_sec,
               cl.status,
               cl.phone_number,
               cl.campaign_id,
               cl.queue_seconds,
               cl.user AS agent,
               cl.term_reason,
               rl.filename,
               rl.location
        FROM vicidial_closer_log cl
        JOIN recording_log rl ON rl.vicidial_id = cl.closecallid
        WHERE cl.call_date >= %s
          AND cl.length_in_sec >= %s
          AND cl.length_in_sec <= %s
          AND cl.status NOT IN ('DROP', 'TIMEOT', 'NANQUE', 'AFTHRS', 'QVMAIL')
    """
    params = [args.start_date, args.min_duration, args.max_duration]

    if args.end_date:
        query += " AND cl.call_date <= %s"
        params.append(args.end_date + " 23:59:59")
    if args.campaign:
        query += " AND cl.campaign_id = %s"
        params.append(args.campaign)
    if args.agent:
        query += " AND cl.user = %s"
        params.append(args.agent)

    query += " ORDER BY cl.call_date DESC LIMIT %s"
    params.append(args.limit)

    cursor.execute(query, params)
    calls = cursor.fetchall()

elif args.call_type == "outbound":
    query = """
        SELECT vl.uniqueid AS call_id,
               'outbound' AS call_type,
               vl.call_date,
               vl.length_in_sec,
               vl.status,
               vl.phone_number,
               vl.campaign_id,
               0 AS queue_seconds,
               vl.user AS agent,
               vl.term_reason,
               rl.filename,
               rl.location
        FROM vicidial_log vl
        JOIN recording_log rl ON rl.vicidial_id = vl.uniqueid
        WHERE vl.call_date >= %s
          AND vl.length_in_sec >= %s
          AND vl.length_in_sec <= %s
          AND vl.status NOT IN ('DROP', 'NA', 'B', 'DC', 'N')
    """
    params = [args.start_date, args.min_duration, args.max_duration]

    if args.end_date:
        query += " AND vl.call_date <= %s"
        params.append(args.end_date + " 23:59:59")
    if args.campaign:
        query += " AND vl.campaign_id = %s"
        params.append(args.campaign)
    if args.agent:
        query += " AND vl.user = %s"
        params.append(args.agent)

    query += " ORDER BY vl.call_date DESC LIMIT %s"
    params.append(args.limit)

    cursor.execute(query, params)
    calls = cursor.fetchall()

cursor.close()
db.close()
log(f"Query returned {len(calls)} calls")

# === Step 2: Verify MP3 Files Exist ===
verified = []
missing = 0
for c in calls:
    mp3_path = f"{MP3_BASE}/{c['filename']}-all.mp3"
    if os.path.exists(mp3_path):
        c["mp3_path"] = mp3_path
        verified.append(c)
    else:
        missing += 1

log(f"Verified {len(verified)} MP3 files ({missing} missing)")

if not verified:
    log("No recordings found. Exiting.")
    sys.exit(1)

# === Step 3: Check Resume State ===
already_done = {}
existing_results = []
if os.path.exists(SUMMARY_FILE):
    with open(SUMMARY_FILE) as f:
        existing_results = json.load(f)
    already_done = {r["file"]: r for r in existing_results}
    log(f"Resuming: {len(already_done)} done, "
        f"{len(verified) - len(already_done)} remaining")

remaining = [c for c in verified
             if os.path.basename(c["mp3_path"]) not in already_done]

if not remaining:
    log("All calls already transcribed!")
    sys.exit(0)

# === Step 4: Load Whisper Model ===
lang_str = args.language if args.language != "auto" else "auto-detect"
log(f"Loading faster-whisper model ({args.model}, CPU, int8, lang={lang_str})...")
model = WhisperModel(args.model, device="cpu", compute_type="int8")
log("Model loaded. Starting transcription...")

# === Step 5: Transcribe ===
start_time = time.time()
results = list(existing_results)
errors = 0

for i, call in enumerate(remaining):
    mp3_path = call["mp3_path"]
    fname = os.path.basename(mp3_path)
    txt_name = fname.replace(".mp3", ".txt")
    txt_path = os.path.join(OUTPUT_DIR, txt_name)

    log(f"[{i+1}/{len(remaining)}] {fname} "
        f"({call['length_in_sec']}s, status={call['status']})...")

    try:
        # Transcribe — force language or auto-detect
        transcribe_kwargs = {"beam_size": 5}
        if args.language != "auto":
            transcribe_kwargs["language"] = args.language

        segments, info = model.transcribe(mp3_path, **transcribe_kwargs)

        text_parts = []
        for segment in segments:
            text_parts.append(segment.text.strip())

        full_text = " ".join(text_parts)
        duration = round(info.duration, 1)

        # Save individual transcript with metadata header
        with open(txt_path, "w") as f:
            f.write(f"File: {fname}\n")
            f.write(f"Date: {call['call_date']}\n")
            f.write(f"Duration: {duration}s ({round(duration / 60, 1)} min)\n")
            f.write(f"Caller: {call['phone_number']}\n")
            f.write(f"Agent: {call['agent']}\n")
            f.write(f"Campaign: {call['campaign_id']}\n")
            f.write(f"Status: {call['status']}\n")
            f.write(f"Term Reason: {call['term_reason']}\n")
            f.write(f"Queue Time: {call['queue_seconds']}s\n")
            f.write(f"Detected Language: {info.language} "
                    f"({round(info.language_probability, 3)})\n")
            f.write(f"---\n\n")
            f.write(full_text)

        record = {
            "file": fname,
            "call_date": str(call["call_date"]),
            "call_type": call["call_type"],
            "duration_sec": duration,
            "duration_min": round(duration / 60, 1),
            "phone_number": str(call["phone_number"]),
            "agent": str(call["agent"]),
            "campaign": str(call["campaign_id"]),
            "status": str(call["status"]),
            "term_reason": str(call["term_reason"]),
            "queue_seconds": int(call["queue_seconds"]) if call["queue_seconds"] else 0,
            "detected_language": info.language,
            "language_probability": round(info.language_probability, 3),
            "chars": len(full_text),
            "txt_file": txt_name,
            "transcript": full_text
        }
        results.append(record)

        # Crash-safe: save after every file
        with open(SUMMARY_FILE, "w") as f:
            json.dump(results, f, indent=2, ensure_ascii=False, default=str)

        elapsed = time.time() - start_time
        avg = elapsed / (i + 1)
        eta = avg * (len(remaining) - i - 1)
        log(f"  OK ({round(duration)}s audio, {len(full_text)} chars) "
            f"| ETA: {round(eta / 60)}m")

    except Exception as e:
        errors += 1
        log(f"  ERROR: {e}")

# === Step 6: Build Combined Markdown ===
log("Building combined markdown file...")
with open(COMBINED_MD, "w") as f:
    f.write(f"# Call Transcriptions\n\n")
    f.write(f"Generated: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write(f"Date range: {args.start_date} to {args.end_date or 'present'}\n")
    if args.campaign:
        f.write(f"Campaign: {args.campaign}\n")
    if args.agent:
        f.write(f"Agent: {args.agent}\n")
    f.write(f"Total calls: {len(results)}\n")
    total_dur = sum(r["duration_sec"] for r in results)
    f.write(f"Total audio: {round(total_dur / 3600, 1)} hours\n\n")
    f.write("---\n\n")

    for j, r in enumerate(results):
        f.write(f"## Call {j+1}: {r['file']}\n\n")
        f.write(f"- **Date**: {r['call_date']}\n")
        f.write(f"- **Type**: {r['call_type']}\n")
        f.write(f"- **Duration**: {r['duration_min']} min\n")
        f.write(f"- **Caller**: {r['phone_number']}\n")
        f.write(f"- **Agent**: {r['agent']}\n")
        f.write(f"- **Campaign**: {r['campaign']}\n")
        f.write(f"- **Status**: {r['status']}\n")
        f.write(f"- **Term Reason**: {r['term_reason']}\n")
        f.write(f"- **Queue Time**: {r['queue_seconds']}s\n\n")
        f.write(f"### Transcript\n\n")
        f.write(f"{r.get('transcript', '(no transcript)')}\n\n")
        f.write("---\n\n")

# === Final Report ===
elapsed_total = time.time() - start_time
total_dur = sum(r["duration_sec"] for r in results)
total_chars = sum(r["chars"] for r in results)
log("")
log("=" * 60)
log(f"COMPLETE! {len(results)} calls transcribed ({errors} errors)")
log(f"Total audio: {round(total_dur / 3600, 1)} hours")
log(f"Total text: {total_chars:,} characters")
log(f"Processing time: {round(elapsed_total / 3600, 1)} hours")
log(f"Speed ratio: {round(total_dur / elapsed_total, 1)}x "
    f"({'faster' if total_dur > elapsed_total else 'slower'} than real-time)")
log(f"Output: {OUTPUT_DIR}")
log(f"Summary: {SUMMARY_FILE}")
log(f"Combined: {COMBINED_MD}")

Usage Examples

# Transcribe last 100 inbound calls from all campaigns
python3.11 transcribe_db.py --start-date 2026-03-01 --limit 100

# Transcribe outbound calls for a specific campaign
python3.11 transcribe_db.py --call-type outbound --campaign ukcamp --limit 50

# Transcribe a specific agent's calls
python3.11 transcribe_db.py --agent 1042 --start-date 2026-03-01 --limit 200

# Transcribe with a larger model for better accuracy
python3.11 transcribe_db.py --model medium --limit 50

# Auto-detect language (useful for multilingual call centers)
python3.11 transcribe_db.py --language auto --limit 100

# Run at low priority in background
nohup nice -n 19 python3.11 transcribe_db.py \
    --output-dir /root/transcriptions_march \
    --start-date 2026-03-01 \
    --limit 500 \
    > /root/transcribe_march.log 2>&1 &

Output Formats

The scripts produce three complementary output formats.

Individual Text Files (.txt)

One file per recording with a metadata header and the full transcript:

File: 20260216-153000_447974040560-all.mp3
Date: 2026-02-16 15:30:00
Duration: 245.9s (4.1 min)
Caller: 447974040560
Agent: 1002
Campaign: doppia
Status: A
Term Reason: AGENT
Queue Time: 0.00s
Detected Language: en (0.997)
---

Call from Google. Hello, how can I help you? Hi, I called a little bit
earlier. I'm a social worker looking for an estimate for a client's
plumbing and I've got the postcode now...

Summary JSON (_summary.json)

Structured metadata for programmatic access, searchable and filterable:

[
  {
    "file": "20260216-153000_447974040560-all.mp3",
    "call_date": "2026-02-16 15:30:00",
    "call_type": "inbound",
    "duration_sec": 245.9,
    "duration_min": 4.1,
    "phone_number": "447974040560",
    "agent": "1002",
    "campaign": "doppia",
    "status": "A",
    "term_reason": "AGENT",
    "queue_seconds": 0,
    "detected_language": "en",
    "language_probability": 0.997,
    "chars": 2847,
    "txt_file": "20260216-153000_447974040560-all.txt",
    "transcript": "Call from Google. Hello, how can I help you?..."
  }
]

SRT Subtitle Format

For cases where you need timed subtitles (e.g., video review tools, quality assurance interfaces), add SRT output to your transcription:

def save_as_srt(segments, output_path):
    """Save transcription segments in SRT subtitle format."""
    with open(output_path, "w") as f:
        for i, segment in enumerate(segments, 1):
            start = format_timestamp_srt(segment.start)
            end = format_timestamp_srt(segment.end)
            f.write(f"{i}\n")
            f.write(f"{start} --> {end}\n")
            f.write(f"{segment.text.strip()}\n\n")


def format_timestamp_srt(seconds):
    """Convert seconds to SRT timestamp format (HH:MM:SS,mmm)."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"


# Usage within transcription loop:
segments, info = model.transcribe(filepath, language="en", beam_size=5)
segments_list = list(segments)  # Consume the generator once

# Save SRT
srt_path = os.path.join(OUTPUT_DIR, fname.replace(".mp3", ".srt"))
save_as_srt(segments_list, srt_path)

# Also build plain text from the consumed segments
full_text = " ".join(s.text.strip() for s in segments_list)

Example SRT output:

1
00:00:00,000 --> 00:00:04,200
Call from Google. Hello, how can I help you?

2
00:00:04,200 --> 00:00:08,800
Hi, I called a little bit earlier. I'm a social worker looking for an estimate.

3
00:00:08,800 --> 00:00:13,500
For a client's plumbing and I've got the postcode now.

Timestamped JSON

For detailed segment-level data with timing information:

def save_segments_json(segments_list, info, output_path):
    """Save detailed segment-level transcription as JSON."""
    data = {
        "duration": round(info.duration, 1),
        "language": info.language,
        "language_probability": round(info.language_probability, 3),
        "segments": [
            {
                "id": i,
                "start": round(seg.start, 2),
                "end": round(seg.end, 2),
                "text": seg.text.strip(),
                "avg_logprob": round(seg.avg_logprob, 3),
                "no_speech_prob": round(seg.no_speech_prob, 3)
            }
            for i, seg in enumerate(segments_list)
        ]
    }
    with open(output_path, "w") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

This gives you confidence scores per segment (avg_logprob closer to 0 = more confident, no_speech_prob closer to 1 = likely silence/noise).


Integration with ViciDial

Finding Recordings by Various Criteria

Use these SQL queries to build targeted transcription jobs.

By Date Range

SELECT rl.filename, rl.start_time, rl.length_in_sec
FROM recording_log rl
WHERE rl.start_time >= '2026-03-01 00:00:00'
  AND rl.start_time < '2026-03-08 00:00:00'
  AND rl.length_in_sec > 30
ORDER BY rl.start_time;

By Agent (with Call Details)

SELECT cl.call_date, cl.phone_number, cl.length_in_sec,
       cl.status, cl.campaign_id, rl.filename
FROM vicidial_closer_log cl
JOIN recording_log rl ON rl.vicidial_id = cl.closecallid
WHERE cl.user = '1042'
  AND cl.call_date >= '2026-03-01'
  AND cl.length_in_sec > 30
ORDER BY cl.call_date DESC;

By Campaign

SELECT cl.call_date, cl.phone_number, cl.user,
       cl.length_in_sec, cl.status, rl.filename
FROM vicidial_closer_log cl
JOIN recording_log rl ON rl.vicidial_id = cl.closecallid
WHERE cl.campaign_id = 'doppia'
  AND cl.call_date >= '2026-03-01'
  AND cl.length_in_sec BETWEEN 30 AND 600
ORDER BY cl.call_date DESC
LIMIT 200;

By Call Status (e.g., Only Answered Calls)

-- Common inbound statuses:
-- A     = answered (agent disposition)
-- SALE  = sale made
-- NI    = not interested
-- CB    = callback scheduled
-- DROP  = dropped (no agent answered) — typically exclude
-- TIMEOT = timeout — typically exclude

SELECT cl.call_date, cl.phone_number, cl.user, cl.status, rl.filename
FROM vicidial_closer_log cl
JOIN recording_log rl ON rl.vicidial_id = cl.closecallid
WHERE cl.call_date >= '2026-03-01'
  AND cl.status IN ('A', 'SALE', 'NI', 'CB')
  AND cl.length_in_sec > 30
ORDER BY cl.call_date DESC;

Outbound Calls

SELECT vl.call_date, vl.phone_number, vl.user,
       vl.length_in_sec, vl.status, rl.filename
FROM vicidial_log vl
JOIN recording_log rl ON rl.vicidial_id = vl.uniqueid
WHERE vl.campaign_id = 'ukcamp'
  AND vl.call_date >= '2026-03-01'
  AND vl.length_in_sec > 30
  AND vl.status NOT IN ('NA', 'B', 'DC', 'N', 'DROP')
ORDER BY vl.call_date DESC
LIMIT 100;

Building the File Path

ViciDial's recording_log.filename contains the base name. The actual MP3 file is:

mp3_path = f"/var/spool/asterisk/monitorDONE/MP3/{filename}-all.mp3"

Always verify the file exists before adding it to the transcription queue — recordings may have been cleaned up by retention scripts.

Linking Transcripts Back to ViciDial

After transcription, you can load the JSON summary into a database table for searching:

CREATE TABLE call_transcriptions (
    id INT AUTO_INCREMENT PRIMARY KEY,
    recording_filename VARCHAR(255) NOT NULL,
    call_date DATETIME,
    call_type ENUM('inbound', 'outbound'),
    phone_number VARCHAR(20),
    agent VARCHAR(20),
    campaign VARCHAR(50),
    status VARCHAR(10),
    duration_sec DECIMAL(8,1),
    transcript TEXT,
    chars INT,
    language VARCHAR(5),
    language_probability DECIMAL(4,3),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_call_date (call_date),
    INDEX idx_agent (agent),
    INDEX idx_campaign (campaign),
    FULLTEXT INDEX idx_transcript (transcript)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Then search transcripts with SQL:

-- Find calls where "refund" was mentioned
SELECT call_date, agent, phone_number, duration_sec,
       SUBSTRING(transcript, 1, 200) AS preview
FROM call_transcriptions
WHERE MATCH(transcript) AGAINST('refund' IN BOOLEAN MODE)
ORDER BY call_date DESC;

-- Find calls where agent said "I don't know"
SELECT call_date, agent, phone_number
FROM call_transcriptions
WHERE transcript LIKE '%I don%t know%'
  AND call_date >= '2026-03-01';

Performance Tuning

Model Selection

Faster-Whisper supports all Whisper model sizes. Choose based on your accuracy/speed tradeoff:

Model Parameters VRAM/RAM Speed (CPU int8) English Accuracy Best For
tiny 39M ~400 MB ~6x real-time Basic Quick scanning, keyword detection
base 74M ~500 MB ~4x real-time Good Rough transcripts, high volume
small 244M ~1.5 GB ~1.5-2x real-time Very good Production sweet spot
medium 769M ~3 GB ~0.5x real-time Excellent High-accuracy needs
large-v3 1.5B ~5 GB ~0.2x real-time Best Critical/legal recordings

Recommendation: Use small for call center transcription. It handles accents well (UK, Indian, European English) and runs fast enough for large batches on CPU. Use medium only when you need higher accuracy on difficult audio (heavy accents, background noise, multiple speakers talking over each other).

CPU vs GPU

# CPU with int8 quantization (no GPU required)
model = WhisperModel("small", device="cpu", compute_type="int8")

# GPU with float16 (requires NVIDIA GPU + CUDA)
model = WhisperModel("small", device="cuda", compute_type="float16")

# GPU with int8 (fastest GPU option)
model = WhisperModel("small", device="cuda", compute_type="int8_float16")

GPU is 5-10x faster than CPU, but most ViciDial servers do not have GPUs. CPU with int8 is the practical choice for on-premises call center servers.

Beam Size

The beam_size parameter controls the search width during decoding:

# beam_size=1: Greedy decoding (fastest, least accurate)
segments, info = model.transcribe(path, language="en", beam_size=1)

# beam_size=5: Good balance (default recommendation)
segments, info = model.transcribe(path, language="en", beam_size=5)

# beam_size=10: More thorough search (slower, marginally better)
segments, info = model.transcribe(path, language="en", beam_size=10)

For call center audio, beam_size=5 is the sweet spot. Going higher than 5 gives diminishing returns and costs ~40% more processing time.

Voice Activity Detection (VAD)

Enable VAD filtering to skip silent sections (common in calls with hold time):

segments, info = model.transcribe(
    filepath,
    language="en",
    beam_size=5,
    vad_filter=True,
    vad_parameters=dict(
        threshold=0.5,                  # Speech detection sensitivity (0-1)
        min_speech_duration_ms=250,     # Ignore speech shorter than this
        min_silence_duration_ms=500,    # Split on silences longer than this
        speech_pad_ms=200               # Padding around detected speech
    )
)

VAD can speed up transcription by 20-40% on calls with significant silence, hold time, or ringing.

Language Detection vs Forcing

# Force English (fastest — skips 30-second detection scan)
segments, info = model.transcribe(filepath, language="en")

# Auto-detect language (adds ~2-5 seconds per file)
segments, info = model.transcribe(filepath)
# info.language returns detected code ("en", "it", "es", etc.)
# info.language_probability returns confidence (0.0 to 1.0)

If your call center handles a single language, always force it. For multilingual environments, auto-detection works well but adds overhead.

Parallel Processing

For servers with many CPU cores, you can run multiple transcription processes:

# Split file list into chunks
split -n l/4 recording_list.txt chunk_

# Run 4 parallel transcription processes
for chunk in chunk_*; do
    nice -n 19 python3.11 transcribe_batch.py --input "$chunk" \
        --output-dir /root/transcriptions_$(basename $chunk) &
done
wait
echo "All chunks complete"

Note: Each process loads its own copy of the model (~1.5 GB for small), so 4 processes need ~6 GB RAM. Monitor with htop.

Memory Optimization

For very large batches on memory-constrained servers:

import gc

# Process in chunks — unload model between chunks if needed
for chunk_start in range(0, len(files), 100):
    chunk = files[chunk_start:chunk_start + 100]
    model = WhisperModel("small", device="cpu", compute_type="int8")

    for filepath in chunk:
        # ... transcribe ...
        pass

    del model
    gc.collect()  # Force garbage collection between chunks

Production Deployment

Cron-Based Nightly Transcription

Set up automatic transcription of the previous day's calls:

#!/bin/bash
# /root/scripts/nightly_transcribe.sh
# Transcribe yesterday's inbound calls
# Runs via cron at 02:00 when call volume is lowest

YESTERDAY=$(date -d 'yesterday' +%Y-%m-%d)
OUTPUT_DIR="/root/transcriptions/${YESTERDAY}"

nice -n 19 /usr/local/bin/python3.11 /root/scripts/transcribe_db.py \
    --start-date "$YESTERDAY" \
    --end-date "$YESTERDAY" \
    --call-type inbound \
    --min-duration 30 \
    --max-duration 600 \
    --limit 1000 \
    --output-dir "$OUTPUT_DIR" \
    >> /var/log/transcribe.log 2>&1

# Report completion
TOTAL=$(jq length "${OUTPUT_DIR}/_summary.json" 2>/dev/null || echo 0)
echo "[$(date)] Transcribed ${TOTAL} calls from ${YESTERDAY}" >> /var/log/transcribe.log

Add to crontab:

0 2 * * * /root/scripts/nightly_transcribe.sh

Monitoring Transcription Jobs

# Check if a transcription is running
pgrep -af transcribe

# Check progress from the summary file
python3.11 -c "
import json
d = json.load(open('/root/transcriptions/_summary.json'))
total_sec = sum(r['duration_sec'] for r in d)
print(f'Transcribed: {len(d)} calls')
print(f'Total audio: {round(total_sec/3600, 1)} hours')
print(f'Total text: {sum(r[\"chars\"] for r in d):,} characters')
"

# Check the log
tail -20 /root/transcriptions/_transcribe.log

Disk Space Management

Transcription output is relatively small compared to audio:

Set up cleanup for old transcriptions:

# Keep transcriptions for 90 days
find /root/transcriptions/ -name "*.txt" -mtime +90 -delete
find /root/transcriptions/ -name "_summary.json" -mtime +90 -delete

Speaker Considerations

Whisper transcribes all audio as a single stream — it does not perform speaker diarization (identifying who said what). For call center recordings, this means the agent and caller text is mixed together.

Strategies for handling this:

  1. Rely on context clues — In practice, conversational flow makes it clear who is speaking. Phrases like "How can I help you?" are clearly the agent, while "I'm calling about..." is the caller.

  2. Use separate channel recordings — If ViciDial records agent and caller on separate channels (MixMon r and t options), transcribe each channel individually and merge with speaker labels.

  3. Post-process with an LLM — Feed the raw transcript to a language model with a prompt like: "This is a call center transcript. Label each sentence as AGENT or CALLER based on context."

  4. Use a dedicated diarization tool — Libraries like pyannote-audio can identify speakers before transcription:

# Speaker diarization (requires separate installation)
# pip install pyannote.audio
from pyannote.audio import Pipeline

diarization = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
result = diarization(audio_path)

for turn, _, speaker in result.itertracks(yield_label=True):
    print(f"[{turn.start:.1f}s - {turn.end:.1f}s] {speaker}")

Troubleshooting

Common Issues

"Model not found" error:

The model files are downloaded on first use. Ensure internet access is available,
or pre-download:
    python3.11 -c "from faster_whisper import WhisperModel; WhisperModel('small')"
Models are cached at: ~/.cache/huggingface/hub/

Out of memory (OOM):

Use a smaller model or reduce batch processing parallelism.
Model RAM requirements: tiny ~400MB, base ~500MB, small ~1.5GB, medium ~3GB
Monitor with: watch -n 5 free -h

"Could not load audio" error:

Ensure FFmpeg is installed: ffmpeg -version
Test the file: ffmpeg -i /path/to/file.mp3 -f null -
If corrupt, the file may have been truncated during recording or conversion.

Very slow transcription:

Check CPU load — other processes may be competing.
Use 'nice -n 19' to lower priority.
Verify compute_type="int8" is set (not "float32").
Check model size — "medium" is 3-4x slower than "small".

Poor accuracy on accented English:

Switch from "tiny" or "base" to "small" or "medium" model.
The "small" model handles UK, Indian, and European accents well.
Forcing language="en" can sometimes hurt if speakers mix languages —
try removing the language parameter to let Whisper auto-detect.

Empty or garbled transcriptions:

Check audio quality: ffmpeg -i file.mp3 -af volumedetect -f null -
Very quiet recordings (below -40dB) or recordings with only
DTMF/hold music will produce poor results.
Enable VAD filtering to skip silence: vad_filter=True

Verifying Output Quality

Spot-check a sample of transcriptions against the actual recordings:

# Pick 5 random transcripts and listen to the originals
python3.11 -c "
import json, random
d = json.load(open('/root/transcriptions/_summary.json'))
sample = random.sample(d, min(5, len(d)))
for r in sample:
    print(f'{r[\"file\"]} ({r[\"duration_min\"]}min): {r[\"transcript\"][:100]}...')
"

Compare the first 30 seconds of text against what you hear. The small model typically achieves 90-95% word accuracy on clear call center audio.


Summary

Component Value
Model faster-whisper 1.2.x with CTranslate2
Recommended model size small (244M parameters)
Compute CPU with int8 quantization
Speed ~1.5-2x real-time on modern server CPU
Accuracy 90-95% on clear English call audio
RAM usage ~1.5 GB (small model)
Cost Zero (fully self-hosted, no API fees)
Resume support Yes (crash-safe, saves after each file)
Output formats Plain text, JSON, SRT, Markdown

This system processes hundreds of calls per night on standard server hardware, turning audio into searchable, analyzable text — ready for quality assurance, compliance checks, agent coaching, or AI pipeline input — all without sending a single recording to an external API.

Need expert help with your setup?

VoIP infrastructure consulting, AI voice agent integration, monitoring stacks, scaling — I've done it all in production.

Get a Free Consultation