A single Python process that runs on each VoIP server, collecting metrics from three sources: | Source | Method | Metrics | |--------|--------|---------| | Asterisk | CLI commands via asterisk -rx | SIP peers, channels, RTP stats, codecs, transcoding, uptime |

Building a Custom Prometheus Exporter for Asterisk/ViciDial

A production-grade Python exporter that bridges Asterisk AMI and ViciDial MySQL into Prometheus metrics for real-time VoIP monitoring.

Why Build a Custom Exporter?
Architecture Overview
Prerequisites
Project Structure
The Complete Exporter
Metric Reference
Systemd Service Configuration
Prometheus Scrape Configuration
Grafana Dashboard Panels
Automated Deployment
Production Tips
Troubleshooting
Extending the Exporter

1. Why Build a Custom Exporter?

If you run a VoIP call center on Asterisk with ViciDial, you already know that node_exporter alone tells you almost nothing useful. It will tell you that CPU is at 40% and disk is 60% full --- but it cannot tell you:

How many SIP trunks are currently reachable, and what their latency is
How many agents are logged in, on calls, or sitting in pause
Whether RTP streams are suffering from jitter or packet loss right now
How many calls are queued waiting for an available agent
Whether recordings are being generated for every call (compliance requirement)
Which codec mismatches are causing unnecessary transcoding load
How many IPs fail2ban has currently banned

These are the metrics that matter at 2 AM when calls start dropping. Standard exporters do not understand Asterisk's internal state, ViciDial's MySQL schema, or the relationship between SIP registration and call routing. You need a custom exporter that speaks both Asterisk CLI and ViciDial SQL.

What We Are Building

A single Python process that runs on each VoIP server, collecting metrics from three sources:

Source	Method	Metrics
Asterisk	CLI commands via `asterisk -rx`	SIP peers, channels, RTP stats, codecs, transcoding, uptime
ViciDial MySQL	SQL queries via `mysql-connector`	Agent states, queue depth, recording integrity, campaign stats
System tools	`fail2ban-client`	Active bans per jail

It exposes everything on a single /metrics HTTP endpoint in Prometheus text format, scraped every 15 seconds by a central Prometheus instance.

Why Not Use Existing Asterisk Exporters?

There are a few community Asterisk exporters (notably asterisk_exporter from digium-cloud and various Go-based ones). They all share the same limitations:

AMI-only --- They connect via the Asterisk Manager Interface (TCP socket), which requires opening another port, managing AMI credentials, and dealing with connection lifecycle. Our approach uses asterisk -rx CLI commands, which are simpler, require no additional ports, and work identically whether Asterisk is version 11 or 20.
No ViciDial awareness --- No existing exporter knows about vicidial_live_agents, vicidial_auto_calls, or recording_log. ViciDial stores its operational state in MySQL, not in Asterisk.
No fleet tracking --- In a multi-site call center, you need to track SIP phone registration states across sites (e.g., "are all 40 London Zoiper phones online?"). This requires cross-referencing sip show peers with core show channels to distinguish offline/idle/in-call.
No security metrics --- VoIP servers are constant targets for SIP scanning and brute-force attacks. Monitoring fail2ban ban counts alongside SIP metrics gives you a single pane of glass.

2. Architecture Overview

+------------------+     +------------------+     +------------------+
| VoIP Server A    |     | VoIP Server B    |     | VoIP Server C    |
|                  |     |                  |     |                  |
| asterisk_exporter|     | asterisk_exporter|     | asterisk_exporter|
| :9101            |     | :9101            |     | :9101            |
| node_exporter    |     | node_exporter    |     | node_exporter    |
| :9100            |     | :9100            |     | :9100            |
+--------+---------+     +--------+---------+     +--------+---------+
         |                         |                         |
         +------------+------------+------------+------------+
                      |
              +-------v--------+
              | Prometheus     |
              | scrape /15s    |
              | retention: 30d |
              +-------+--------+
                      |
              +-------v--------+
              | Grafana        |
              | dashboards     |
              | alerts         |
              +----------------+

Each VoIP server runs the exporter as a systemd service alongside node_exporter. A central Prometheus instance scrapes both endpoints. Grafana queries Prometheus for visualization and alerting.

The exporter is intentionally stateless --- it collects fresh data on every scrape request. This means:

No persistent state to corrupt
No risk of stale cached data
Instant recovery after restart
Each scrape reflects the current moment

3. Prerequisites

On Each VoIP Server

# Python 3.6+ (3.11 recommended)
python3 --version

# mysql-connector-python
python3 -m pip install mysql-connector-python

# Verify the import works
python3 -c "import mysql.connector; print('OK')"

MySQL User for the Exporter

Create a read-only MySQL user. The exporter only needs SELECT on a handful of ViciDial tables:

CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'YOUR_EXPORTER_PASSWORD';

GRANT SELECT ON asterisk.vicidial_live_agents TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.vicidial_auto_calls TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.vicidial_closer_log TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.recording_log TO 'exporter'@'localhost';

FLUSH PRIVILEGES;

Security note: Never use the ViciDial cron user or any user with write privileges. The exporter should be read-only by design. Use a dedicated user with minimal grants.

Asterisk CLI Access

The exporter runs as root (or the asterisk user) and calls asterisk -rx "<command>" directly. No AMI port, no AMI credentials, no TCP socket management. The asterisk binary must be in the system PATH.

Verify:

asterisk -rx "core show version"
asterisk -rx "sip show peers"
asterisk -rx "core show channels"

On the Central Monitoring Server

Prometheus (v2.x+) with network access to port 9101 on each VoIP server
Grafana (v10+) with Prometheus datasource configured

4. Project Structure

/opt/asterisk_exporter/
├── asterisk_exporter.py      # The exporter (single file)
└── README.md                 # Optional: notes for your team

/etc/systemd/system/
└── asterisk_exporter.service # Systemd unit file

One file. No virtual environments, no dependency hell, no framework overhead. The only external dependency is mysql-connector-python. The HTTP server uses Python's built-in http.server.

5. The Complete Exporter

5.1 Configuration and Startup

The exporter is configured entirely via environment variables, set in the systemd service file. No config files to manage.

#!/usr/bin/env python3
"""
Asterisk/ViciDial Prometheus Exporter
Queries Asterisk CLI + ViciDial MySQL to expose VoIP metrics.
Runs on each monitored server, listens on :9101.
"""

import http.server
import subprocess
import re
import os
import time
import mysql.connector
from mysql.connector import Error

# ─── Configuration via environment variables ─────────────────────────
LISTEN_PORT  = int(os.environ.get("EXPORTER_PORT", 9101))
MYSQL_HOST   = os.environ.get("MYSQL_HOST", "localhost")
MYSQL_USER   = os.environ.get("MYSQL_USER", "exporter")
MYSQL_PASS   = os.environ.get("MYSQL_PASS", "YOUR_EXPORTER_PASSWORD")
MYSQL_DB     = os.environ.get("MYSQL_DB", "asterisk")
SERVER_LABEL = os.environ.get("SERVER_LABEL", "server1")

# Optional: SIP phone fleet range (e.g., "1031-1070")
PHONE_FLEET_RANGE = os.environ.get("PHONE_FLEET_RANGE", "")

Design decisions:

Environment variables over config files --- Systemd makes this trivial with Environment= directives, and it keeps the exporter a single portable file.
SERVER_LABEL --- Every metric carries a server label so you can aggregate across a multi-server fleet in Prometheus without relying on instance labels (which contain IPs and break when servers move).
PHONE_FLEET_RANGE --- Optional. Only enable on servers that have a known range of SIP phone extensions to track (e.g., a block of Zoiper softphones allocated to a specific office).

5.2 Asterisk CLI Integration

Rather than connecting to the Asterisk Manager Interface (AMI) over a TCP socket, we execute CLI commands directly. This is simpler, more reliable, and avoids AMI credential management.

def run_ast_cmd(cmd):
    """Run an Asterisk CLI command and return stdout.

    Uses subprocess with a 10-second timeout to prevent hangs
    if Asterisk is unresponsive. Returns empty string on any failure.
    """
    try:
        result = subprocess.run(
            ["asterisk", "-rx", cmd],
            capture_output=True,
            text=True,
            timeout=10
        )
        return result.stdout
    except Exception:
        return ""

Why CLI over AMI?

Aspect	CLI (`asterisk -rx`)	AMI (TCP socket)
Authentication	None (runs as root/asterisk user)	Requires AMI user/secret in `manager.conf`
Port requirements	None	TCP 5038 (another port to firewall)
Connection management	None (stateless per-call)	Must handle reconnects, keepalives
Asterisk version compat	Works on Asterisk 11 through 21+	AMI protocol varies across versions
Output parsing	Text-based, grep-friendly	Event-based, more complex parsing
Performance	Fork per command (~5ms each)	Single persistent connection

The CLI approach is slightly less efficient (one fork per command), but for a 15-second scrape interval collecting 6-8 commands, the total overhead is under 100ms. For a call center server already handling hundreds of calls, this is negligible.

5.3 SIP Peer Metrics

SIP trunks are the connection between your Asterisk server and your carriers. If a trunk goes down, outbound calls stop. Monitoring trunk status and latency is the single most important VoIP metric.

def collect_sip_peers():
    """Parse 'sip show peers' for status and latency.

    Asterisk output format:
      protech/protech    185.X.X.X    D  5060  OK (23 ms)
      mutitel_de/mutite   148.X.X.X   D  5060  OK (45 ms)
      1031/1031          10.X.X.X     D  5060  Unspecified

    We extract:
      - peer name (before the /)
      - IP address
      - status (OK, UNREACHABLE, LAGGED, UNKNOWN)
      - latency in ms (from the parenthetical)
    """
    metrics = []
    output = run_ast_cmd("sip show peers")

    for line in output.splitlines():
        m = re.match(
            r'^(\S+)\s+(\d+\.\d+\.\d+\.\d+)\s+\S+\s+\S+\s+(\S+)\s+(\S+)',
            line
        )
        if m:
            peer = m.group(1).split('/')[0]
            status_str = m.group(3)
            latency_str = m.group(4)

            # Binary up/down for simple alerting
            is_up = 1 if status_str == "OK" else 0
            metrics.append(
                f'asterisk_sip_peer_up{{server="{SERVER_LABEL}",'
                f'peer="{peer}"}} {is_up}'
            )

            # Full status string for detailed dashboards
            metrics.append(
                f'asterisk_sip_peer_status{{server="{SERVER_LABEL}",'
                f'peer="{peer}",status="{status_str}"}} 1'
            )

            # Latency in ms (only present when peer responds to OPTIONS)
            lat_match = re.search(r'(\d+)', latency_str)
            if lat_match:
                metrics.append(
                    f'asterisk_sip_peer_latency_ms{{server="{SERVER_LABEL}",'
                    f'peer="{peer}"}} {lat_match.group(1)}'
                )

    return metrics

Why two status metrics?

asterisk_sip_peer_up (0 or 1) is perfect for alerting rules: asterisk_sip_peer_up == 0 fires immediately.
asterisk_sip_peer_status with a status label lets you build state-timeline panels in Grafana showing transitions between OK/LAGGED/UNREACHABLE over time.

5.4 Channel and Call Metrics

Active channels and calls are the heartbeat of your Asterisk server. A sudden drop to zero means something is very wrong. A sudden spike might mean a toll fraud attack.

def collect_channels():
    """Parse 'core show channels' for active call count and codec info.

    The last line of 'core show channels' output looks like:
      5 active channels
      2 active calls
      28 calls processed

    We also parse 'sip show channelstats' to count channels by codec,
    which helps identify transcoding overhead.
    """
    metrics = []
    output = run_ast_cmd("core show channels")

    # Extract totals from the summary line
    m = re.search(r'(\d+) active channel', output)
    channels = int(m.group(1)) if m else 0

    m2 = re.search(r'(\d+) active call', output)
    calls = int(m2.group(1)) if m2 else 0

    metrics.append(
        f'asterisk_active_channels{{server="{SERVER_LABEL}"}} {channels}'
    )
    metrics.append(
        f'asterisk_active_calls{{server="{SERVER_LABEL}"}} {calls}'
    )

    # Count channels by codec from channelstats
    codec_counts = {}
    stats_output = run_ast_cmd("sip show channelstats")
    for line in stats_output.splitlines():
        parts = line.split()
        if len(parts) >= 12:
            codec = parts[11] if len(parts) > 11 else "unknown"
            if codec in ("alaw", "ulaw", "g722", "g729", "gsm", "opus"):
                codec_counts[codec] = codec_counts.get(codec, 0) + 1

    for codec, count in codec_counts.items():
        metrics.append(
            f'asterisk_channels_by_codec{{server="{SERVER_LABEL}",'
            f'codec="{codec}"}} {count}'
        )

    return metrics

The channels vs. calls distinction matters. Each call typically uses 2 channels (one inbound leg, one outbound leg or agent leg). If you see 10 active calls but 25 active channels, you may have conference bridges or call recordings consuming extra channels.

5.5 RTP Quality Metrics

This is where VoIP monitoring gets serious. RTP (Real-time Transport Protocol) carries the actual audio. Jitter, packet loss, and round-trip time directly determine call quality.

def collect_rtp_stats():
    """Parse 'sip show channelstats' for RTP quality metrics.

    Asterisk output columns:
      Peer  Recv-Count  Recv-Lost  Recv-Loss%  Recv-Jitter
            Send-Count  Send-Lost  Send-Loss%  Send-Jitter  RTT

    We extract per-peer:
      - Receive packet loss percentage
      - Receive jitter (in ms)
      - Round-trip time (in ms)

    These are the three key indicators of audio quality.
    """
    metrics = []
    output = run_ast_cmd("sip show channelstats")

    for line in output.splitlines():
        parts = line.split()
        if len(parts) >= 10 and parts[0] != "Peer":
            try:
                peer = parts[0]

                # Parse receive-side loss percentage
                recv_loss_pct = (
                    float(parts[3].rstrip('%'))
                    if '%' in parts[3] else 0
                )

                # Parse receive-side jitter
                recv_jitter = (
                    float(parts[4])
                    if parts[4].replace('.', '').isdigit() else 0
                )

                # Parse round-trip time
                rtt = (
                    float(parts[7])
                    if len(parts) > 7
                    and parts[7].replace('.', '').isdigit()
                    else 0
                )

                metrics.append(
                    f'asterisk_rtp_packet_loss_percent'
                    f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
                    f'{recv_loss_pct}'
                )
                metrics.append(
                    f'asterisk_rtp_jitter_ms'
                    f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
                    f'{recv_jitter}'
                )
                if rtt > 0:
                    metrics.append(
                        f'asterisk_rtp_rtt_ms'
                        f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
                        f'{rtt}'
                    )
            except (ValueError, IndexError):
                continue

    return metrics

Quality thresholds for VoIP:

Metric	Good	Acceptable	Poor
Packet Loss	< 0.5%	0.5-2%	> 2%
Jitter	< 20ms	20-50ms	> 50ms
RTT	< 150ms	150-300ms	> 300ms

These numbers translate directly into Grafana gauge thresholds and Prometheus alert rules.

5.6 Transcoding Detection

Transcoding (converting between codecs in real-time) consumes significant CPU. In a call center, unexpected transcoding usually means a misconfigured trunk or phone that negotiated the wrong codec.

def collect_transcoding():
    """Detect active transcoding by inspecting each SIP channel.

    For each active SIP channel, we run 'core show channel <chan>' and
    check the ReadTranscode/WriteTranscode fields. If either is "Yes",
    Asterisk is converting audio in real-time.

    We also extract the codec mismatch pairs (e.g., "ulaw->alaw")
    and the globally allowed codecs from sip.conf.
    """
    metrics = []
    transcoding_count = 0
    codec_mismatch_pairs = {}

    # Get list of active SIP channels
    output = run_ast_cmd("core show channels verbose")
    sip_channels = []
    for line in output.splitlines():
        m = re.match(r'^(SIP/\S+)', line)
        if m:
            sip_channels.append(m.group(1))

    for chan in sip_channels:
        ch_output = run_ast_cmd(f"core show channel {chan}")
        native = ""
        read_tc = False
        write_tc = False
        read_path = ""
        write_path = ""
        peer_name = (
            chan.split("/")[1].split("-")[0]
            if "/" in chan else chan
        )

        for ch_line in ch_output.splitlines():
            ch_line = ch_line.strip()
            if ch_line.startswith("NativeFormats:"):
                m = re.search(r'\(([^)]+)\)', ch_line)
                if m:
                    native = m.group(1)
            elif ch_line.startswith("ReadTranscode:"):
                if "Yes" in ch_line:
                    read_tc = True
                    read_path = ch_line.split("Yes")[-1].strip()
            elif ch_line.startswith("WriteTranscode:"):
                if "Yes" in ch_line:
                    write_tc = True
                    write_path = ch_line.split("Yes")[-1].strip()

        if native:
            metrics.append(
                f'asterisk_channel_native_codec'
                f'{{server="{SERVER_LABEL}",channel="{peer_name}",'
                f'codec="{native}"}} 1'
            )

        if read_tc or write_tc:
            transcoding_count += 1
            direction = "read" if read_tc else "write"
            metrics.append(
                f'asterisk_channel_transcoding'
                f'{{server="{SERVER_LABEL}",channel="{peer_name}",'
                f'codec="{native}",direction="{direction}"}} 1'
            )

            # Extract codec pairs from transcoding path
            # e.g., "(alaw@8000)->(slin@8000)->(ulaw@8000)"
            for path in [read_path, write_path]:
                codecs_in_path = re.findall(r'(\w+)@\d+', path)
                if len(codecs_in_path) >= 2:
                    src = codecs_in_path[0]
                    dst = codecs_in_path[-1]
                    if (src != dst
                            and src != "slin"
                            and dst != "slin"):
                        pair = f"{src}->{dst}"
                        codec_mismatch_pairs[pair] = (
                            codec_mismatch_pairs.get(pair, 0) + 1
                        )

    metrics.append(
        f'asterisk_transcoding_channels'
        f'{{server="{SERVER_LABEL}"}} {transcoding_count}'
    )

    for pair, count in codec_mismatch_pairs.items():
        metrics.append(
            f'asterisk_codec_mismatch'
            f'{{server="{SERVER_LABEL}",pair="{pair}"}} {count}'
        )

    # Export globally allowed codecs from sip.conf
    settings = run_ast_cmd("sip show settings")
    for line in settings.splitlines():
        if "Codecs:" in line:
            m = re.search(r'\(([^)]+)\)', line)
            if m:
                for codec in m.group(1).split("|"):
                    codec = codec.strip()
                    if codec:
                        metrics.append(
                            f'asterisk_sip_allowed_codec'
                            f'{{server="{SERVER_LABEL}",'
                            f'codec="{codec}"}} 1'
                        )
            break

    return metrics

Performance note: This function runs core show channel for each active SIP channel, which means N+1 subprocess calls during busy periods. In practice, even with 50 active calls, the total time is under 500ms. If your server handles 200+ concurrent calls, consider sampling instead of inspecting every channel.

5.7 ViciDial Agent Metrics (MySQL)

This is where the exporter goes beyond what any Asterisk-only tool can provide. ViciDial stores agent state in MySQL, not in Asterisk. The vicidial_live_agents table is the source of truth for who is logged in and what they are doing.

def get_mysql_connection():
    """Get a MySQL connection with a 5-second timeout.

    Returns None on failure rather than raising --- the exporter
    should always return partial metrics rather than crashing.
    """
    try:
        return mysql.connector.connect(
            host=MYSQL_HOST,
            user=MYSQL_USER,
            password=MYSQL_PASS,
            database=MYSQL_DB,
            connect_timeout=5
        )
    except Error:
        return None


def collect_vicidial_agents():
    """Query ViciDial MySQL for agent states and queue depth.

    Key table: vicidial_live_agents
      - status: READY, INCALL, PAUSED, CLOSER, QUEUE, DISPO
      - user: agent login ID
      - pause_code: reason for pause (LUNCH, BREAK, etc.)
      - last_state_change: timestamp of last status transition

    We collect:
      1. Aggregate counts by status (for overview dashboards)
      2. Per-agent status with duration (for supervisor views)
      3. Queue depth per campaign/ingroup
    """
    metrics = []
    conn = get_mysql_connection()
    if not conn:
        return metrics

    try:
        cursor = conn.cursor(dictionary=True)

        # ── Aggregate agent counts by status ──
        cursor.execute("""
            SELECT status, COUNT(*) as cnt
            FROM vicidial_live_agents
            WHERE server_ip != ''
            GROUP BY status
        """)

        logged_in = 0
        incall = 0
        paused = 0
        waiting = 0

        for row in cursor.fetchall():
            s = row['status']
            c = row['cnt']
            logged_in += c
            if s == 'INCALL':
                incall = c
            elif s == 'PAUSED':
                paused = c
            elif s in ('READY', 'CLOSER'):
                waiting += c

        metrics.append(
            f'asterisk_agents_logged_in'
            f'{{server="{SERVER_LABEL}"}} {logged_in}'
        )
        metrics.append(
            f'asterisk_agents_incall'
            f'{{server="{SERVER_LABEL}"}} {incall}'
        )
        metrics.append(
            f'asterisk_agents_paused'
            f'{{server="{SERVER_LABEL}"}} {paused}'
        )
        metrics.append(
            f'asterisk_agents_waiting'
            f'{{server="{SERVER_LABEL}"}} {waiting}'
        )

        # ── Per-agent status with duration ──
        cursor.execute("""
            SELECT user, status, pause_code,
                   TIMESTAMPDIFF(SECOND, last_state_change, NOW())
                       AS state_duration
            FROM vicidial_live_agents
            WHERE server_ip != ''
        """)

        for row in cursor.fetchall():
            user = row['user']
            status = row['status']
            duration = row['state_duration'] or 0

            metrics.append(
                f'asterisk_agent_status'
                f'{{server="{SERVER_LABEL}",agent="{user}",'
                f'status="{status}"}} 1'
            )

            if status == 'INCALL':
                metrics.append(
                    f'asterisk_agent_incall_duration_seconds'
                    f'{{server="{SERVER_LABEL}",agent="{user}"}} '
                    f'{duration}'
                )
            elif status == 'PAUSED':
                pause_code = row['pause_code'] or 'NONE'
                metrics.append(
                    f'asterisk_agent_pause_duration_seconds'
                    f'{{server="{SERVER_LABEL}",agent="{user}"}} '
                    f'{duration}'
                )
                metrics.append(
                    f'asterisk_agent_pause_code'
                    f'{{server="{SERVER_LABEL}",agent="{user}",'
                    f'pause_code="{pause_code}"}} 1'
                )

        cursor.close()
    except Exception:
        pass
    finally:
        try:
            conn.close()
        except Exception:
            pass

    return metrics

Why a new connection every scrape?

We open a fresh MySQL connection on each /metrics request and close it immediately after. This avoids:

Stale connections --- MySQL's wait_timeout (default 28800s) will kill idle persistent connections, requiring reconnect logic
Connection pool complexity --- For a 15-second scrape interval with queries that take <50ms, pooling adds complexity with zero benefit
Resource leaks --- No long-lived connections to leak if the exporter encounters an error

The tradeoff is one TCP handshake per scrape (adds ~1ms on localhost). Completely negligible.

5.8 Queue Depth Monitoring

Calls waiting in queue is a critical real-time metric. If queue depth climbs, either agents are overwhelmed or something is preventing call distribution.

        # ── Queue depth by campaign/ingroup ──
        # (continued inside collect_vicidial_agents)
        cursor.execute("""
            SELECT campaign_id, COUNT(*) as cnt
            FROM vicidial_auto_calls
            WHERE status = 'LIVE'
            GROUP BY campaign_id
        """)
        for row in cursor.fetchall():
            ingroup = row['campaign_id']
            cnt = row['cnt']
            metrics.append(
                f'asterisk_queue_depth'
                f'{{server="{SERVER_LABEL}",ingroup="{ingroup}"}} {cnt}'
            )

The vicidial_auto_calls table is ViciDial's real-time call routing table. Rows with status = 'LIVE' are calls that have been answered by the carrier but not yet connected to an agent.

5.9 Fail2ban Security Metrics

VoIP servers are under constant SIP scanning and registration brute-force attacks. Monitoring fail2ban activity alongside call metrics lets you correlate security events with call quality issues.

def collect_fail2ban():
    """Parse fail2ban-client for ban counts per jail.

    Typical jails on a VoIP server:
      - asterisk: SIP authentication failures
      - apache-auth: Web interface brute force
      - sshd: SSH brute force

    We collect:
      - Current active bans (gauge --- can go down as bans expire)
      - Total historical bans (counter --- only goes up)
    """
    metrics = []
    try:
        result = subprocess.run(
            ["fail2ban-client", "status"],
            capture_output=True, text=True, timeout=5
        )
        jails = re.findall(r'Jail list:\s*(.*)', result.stdout)
        if jails:
            for jail in jails[0].split(','):
                jail = jail.strip()
                if not jail:
                    continue

                jr = subprocess.run(
                    ["fail2ban-client", "status", jail],
                    capture_output=True, text=True, timeout=5
                )

                banned = re.search(
                    r'Currently banned:\s+(\d+)', jr.stdout
                )
                total = re.search(
                    r'Total banned:\s+(\d+)', jr.stdout
                )

                if banned:
                    metrics.append(
                        f'asterisk_fail2ban_active_bans'
                        f'{{server="{SERVER_LABEL}",jail="{jail}"}} '
                        f'{banned.group(1)}'
                    )
                if total:
                    metrics.append(
                        f'asterisk_fail2ban_bans_total'
                        f'{{server="{SERVER_LABEL}",jail="{jail}"}} '
                        f'{total.group(1)}'
                    )
    except Exception:
        pass

    return metrics

Note: If fail2ban is not installed or not running, this function silently returns empty metrics. The exporter never crashes due to missing optional components.

5.10 Recording Integrity Checks

In many jurisdictions, call centers must record all calls. This metric detects when calls complete but no recording is generated --- a compliance problem that can go unnoticed for days without monitoring.

def collect_recordings():
    """Check for calls without recordings in the last hour.

    Joins vicidial_closer_log (completed calls) against
    recording_log (actual recordings). Any inbound call longer
    than 10 seconds that has no matching recording is flagged.
    """
    metrics = []
    conn = get_mysql_connection()
    if not conn:
        return metrics

    try:
        cursor = conn.cursor(dictionary=True)
        cursor.execute("""
            SELECT COUNT(*) as missing
            FROM vicidial_closer_log cl
            LEFT JOIN recording_log rl
                ON rl.vicidial_id = cl.closecallid
                AND rl.start_time >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
            WHERE cl.call_date >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
                AND cl.length_in_sec > 10
                AND rl.recording_id IS NULL
        """)
        row = cursor.fetchone()
        missing = row['missing'] if row else 0
        metrics.append(
            f'asterisk_recordings_missing'
            f'{{server="{SERVER_LABEL}"}} {missing}'
        )
        cursor.close()
    except Exception:
        pass
    finally:
        try:
            conn.close()
        except Exception:
            pass

    return metrics

Alert rule example:

- alert: MissingRecordings
  expr: asterisk_recordings_missing > 5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.server }}: {{ $value }} calls missing recordings"

5.11 Uptime and Conference Tracking

Simple but useful metrics for operations dashboards.

def collect_uptime():
    """Get Asterisk uptime in seconds."""
    metrics = []
    output = run_ast_cmd("core show uptime seconds")
    m = re.search(r'System uptime:\s+(\d+)', output)
    if m:
        metrics.append(
            f'asterisk_uptime_seconds'
            f'{{server="{SERVER_LABEL}"}} {m.group(1)}'
        )
    return metrics


def collect_confbridge():
    """Count active ConfBridge/MeetMe conferences.

    ViciDial uses conference bridges for agent-call connections.
    The count reflects active call legs being mixed.
    """
    metrics = []
    output = run_ast_cmd("confbridge list")
    count = 0
    for line in output.splitlines():
        if re.match(r'^\d+', line):
            count += 1
    metrics.append(
        f'asterisk_confbridge_count'
        f'{{server="{SERVER_LABEL}"}} {count}'
    )
    return metrics

5.12 SIP Phone Fleet Tracking (3-State Model)

This is a feature unique to our exporter. In a call center with a known block of SIP phone extensions (e.g., 40 Zoiper softphones in a London office registered as extensions 1031-1070), you want to know each phone's state at a glance:

State 0 (Offline): Phone is not registered. Agent has not started their softphone, or there is a network issue.
State 1 (Idle): Phone is registered but not on a call. Agent is logged in but waiting.
State 2 (In Call): Phone is registered and has an active call channel.

This requires cross-referencing two Asterisk data sources: sip show peers (registration) and core show channels verbose (active calls).

def collect_phone_fleet():
    """Track a block of SIP phones with 3-state monitoring.

    Combines:
      1. 'sip show peers' --- registration status and latency
      2. 'core show channels verbose' --- which peers have active calls

    State model:
      0 = offline (not registered)
      1 = idle (registered, no active call)
      2 = in_call (registered, has active channel)

    Also emits aggregate counts:
      - phones_registered: how many of the fleet are online
      - phones_incall: how many are currently on calls
      - phones_total: total phones in the fleet
    """
    metrics = []
    if not PHONE_FLEET_RANGE:
        return metrics

    try:
        start, end = PHONE_FLEET_RANGE.split("-")
        phone_range = set(
            str(i) for i in range(int(start), int(end) + 1)
        )
    except (ValueError, TypeError):
        return metrics

    # 1) Parse sip show peers for registration status
    output = run_ast_cmd("sip show peers")
    peer_info = {}
    for line in output.splitlines():
        parts = line.split()
        if not parts:
            continue
        peer = parts[0].split('/')[0]
        if peer not in phone_range:
            continue

        if "(Unspecified)" in line or "UNKNOWN" in line:
            peer_info[peer] = None  # Not registered
        else:
            ip_m = re.search(r'(\d+\.\d+\.\d+\.\d+)', line)
            lat_m = re.search(r'\((\d+)\s*ms\)', line)
            status = (
                "OK" if "OK" in line
                else ("Lagged" if "Lagged" in line else "OTHER")
            )
            peer_info[peer] = {
                "ip": ip_m.group(1) if ip_m else "",
                "status": status,
                "latency": int(lat_m.group(1)) if lat_m else 0,
            }

    # 2) Check active channels to find peers currently in calls
    chan_output = run_ast_cmd("core show channels verbose")
    peers_in_call = set()
    for line in chan_output.splitlines():
        m = re.match(r'^SIP/(\d+)-', line)
        if m and m.group(1) in phone_range:
            peers_in_call.add(m.group(1))

    # 3) Build per-phone state metric
    total = len(phone_range)
    reg_count = 0
    incall_count = 0

    for peer in sorted(phone_range, key=int):
        info = peer_info.get(peer)
        registered = (
            info is not None
            and info.get("status") in ("OK", "Lagged")
        )
        in_call = peer in peers_in_call

        if registered:
            reg_count += 1
            ip = info["ip"]
            if in_call:
                state = 2
                incall_count += 1
            else:
                state = 1
        else:
            state = 0
            ip = ""

        metrics.append(
            f'asterisk_phone_fleet_state'
            f'{{server="{SERVER_LABEL}",peer="{peer}",'
            f'ip="{ip}"}} {state}'
        )

        if registered and info.get("latency"):
            metrics.append(
                f'asterisk_phone_fleet_latency_ms'
                f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
                f'{info["latency"]}'
            )

    # Aggregate counts
    metrics.append(
        f'asterisk_phone_fleet_registered'
        f'{{server="{SERVER_LABEL}"}} {reg_count}'
    )
    metrics.append(
        f'asterisk_phone_fleet_incall'
        f'{{server="{SERVER_LABEL}"}} {incall_count}'
    )
    metrics.append(
        f'asterisk_phone_fleet_total'
        f'{{server="{SERVER_LABEL}"}} {total}'
    )

    return metrics

Grafana state-timeline panel works beautifully with this metric. Set value mappings: 0 = red ("Offline"), 1 = yellow ("Idle"), 2 = green ("In Call"). You get a real-time heatmap of your entire phone fleet.

5.13 Metric Exposition and HTTP Server

All the collector functions are combined into a single /metrics endpoint using Python's built-in HTTP server. No Flask, no FastAPI, no external web framework required.

def collect_all():
    """Collect all metrics and format as Prometheus text exposition.

    Each metric family gets a HELP and TYPE declaration,
    followed by the actual metric lines from each collector.
    """
    lines = [
        # ── SIP Peers ──
        "# HELP asterisk_sip_peer_up SIP peer reachability (1=up, 0=down)",
        "# TYPE asterisk_sip_peer_up gauge",
        "# HELP asterisk_sip_peer_latency_ms SIP peer qualify latency in ms",
        "# TYPE asterisk_sip_peer_latency_ms gauge",

        # ── Channels & Calls ──
        "# HELP asterisk_active_calls Number of active calls",
        "# TYPE asterisk_active_calls gauge",
        "# HELP asterisk_active_channels Number of active channels",
        "# TYPE asterisk_active_channels gauge",
        "# HELP asterisk_channels_by_codec Channel count per codec",
        "# TYPE asterisk_channels_by_codec gauge",

        # ── RTP Quality ──
        "# HELP asterisk_rtp_packet_loss_percent RTP packet loss percentage",
        "# TYPE asterisk_rtp_packet_loss_percent gauge",
        "# HELP asterisk_rtp_jitter_ms RTP jitter in ms",
        "# TYPE asterisk_rtp_jitter_ms gauge",
        "# HELP asterisk_rtp_rtt_ms RTP round trip time in ms",
        "# TYPE asterisk_rtp_rtt_ms gauge",

        # ── Transcoding ──
        "# HELP asterisk_transcoding_channels Channels actively transcoding",
        "# TYPE asterisk_transcoding_channels gauge",
        "# HELP asterisk_channel_transcoding Channel is transcoding (1=yes)",
        "# TYPE asterisk_channel_transcoding gauge",
        "# HELP asterisk_codec_mismatch Active codec mismatch pairs",
        "# TYPE asterisk_codec_mismatch gauge",
        "# HELP asterisk_sip_allowed_codec Globally allowed codec",
        "# TYPE asterisk_sip_allowed_codec gauge",

        # ── Agents ──
        "# HELP asterisk_agents_logged_in Number of agents logged in",
        "# TYPE asterisk_agents_logged_in gauge",
        "# HELP asterisk_agents_incall Number of agents in call",
        "# TYPE asterisk_agents_incall gauge",
        "# HELP asterisk_agents_paused Number of agents paused",
        "# TYPE asterisk_agents_paused gauge",
        "# HELP asterisk_agents_waiting Number of agents ready/waiting",
        "# TYPE asterisk_agents_waiting gauge",
        "# HELP asterisk_agent_incall_duration_seconds Per-agent in-call time",
        "# TYPE asterisk_agent_incall_duration_seconds gauge",
        "# HELP asterisk_agent_pause_duration_seconds Per-agent pause time",
        "# TYPE asterisk_agent_pause_duration_seconds gauge",
        "# HELP asterisk_queue_depth Calls waiting in queue per ingroup",
        "# TYPE asterisk_queue_depth gauge",

        # ── Security ──
        "# HELP asterisk_fail2ban_active_bans Current fail2ban active bans",
        "# TYPE asterisk_fail2ban_active_bans gauge",
        "# HELP asterisk_fail2ban_bans_total Total fail2ban bans",
        "# TYPE asterisk_fail2ban_bans_total counter",

        # ── Operations ──
        "# HELP asterisk_recordings_missing CDR entries without recordings",
        "# TYPE asterisk_recordings_missing gauge",
        "# HELP asterisk_uptime_seconds Asterisk system uptime",
        "# TYPE asterisk_uptime_seconds gauge",
        "# HELP asterisk_confbridge_count Active ConfBridge conferences",
        "# TYPE asterisk_confbridge_count gauge",

        # ── Phone Fleet ──
        "# HELP asterisk_phone_fleet_state Phone state (0=offline, 1=idle, 2=in_call)",
        "# TYPE asterisk_phone_fleet_state gauge",
        "# HELP asterisk_phone_fleet_latency_ms Phone SIP latency in ms",
        "# TYPE asterisk_phone_fleet_latency_ms gauge",
        "# HELP asterisk_phone_fleet_registered Registered phone count",
        "# TYPE asterisk_phone_fleet_registered gauge",
        "# HELP asterisk_phone_fleet_incall Phones currently in a call",
        "# TYPE asterisk_phone_fleet_incall gauge",
        "# HELP asterisk_phone_fleet_total Total phones in fleet",
        "# TYPE asterisk_phone_fleet_total gauge",

        "",  # Blank line before metric data
    ]

    # Run all collectors
    lines.extend(collect_sip_peers())
    lines.extend(collect_phone_fleet())
    lines.extend(collect_channels())
    lines.extend(collect_rtp_stats())
    lines.extend(collect_uptime())
    lines.extend(collect_confbridge())
    lines.extend(collect_vicidial_agents())
    lines.extend(collect_fail2ban())
    lines.extend(collect_transcoding())
    lines.extend(collect_recordings())

    return "\n".join(lines) + "\n"


class MetricsHandler(http.server.BaseHTTPRequestHandler):
    """HTTP handler that serves Prometheus metrics on /metrics."""

    def do_GET(self):
        if self.path == "/metrics":
            body = collect_all()
            self.send_response(200)
            self.send_header(
                "Content-Type", "text/plain; charset=utf-8"
            )
            self.end_headers()
            self.wfile.write(body.encode())
        else:
            # Landing page with link to metrics
            self.send_response(200)
            self.send_header("Content-Type", "text/html")
            self.end_headers()
            self.wfile.write(
                b"<html><body>"
                b"<h2>Asterisk/ViciDial Exporter</h2>"
                b"<a href='/metrics'>Metrics</a>"
                b"</body></html>"
            )

    def log_message(self, format, *args):
        """Suppress per-request logging to avoid log noise."""
        pass


if __name__ == "__main__":
    server = http.server.HTTPServer(
        ("0.0.0.0", LISTEN_PORT), MetricsHandler
    )
    print(f"asterisk_exporter listening on :{LISTEN_PORT}")
    server.serve_forever()

Why manual text format instead of prometheus_client?

The prometheus_client Python library is the "official" way to write exporters. We chose manual text format for these reasons:

Zero additional dependencies --- prometheus_client is another pip package to install and maintain. Our exporter has exactly one external dependency (mysql-connector-python).
Full control over metric lifecycle --- With prometheus_client, once you create a metric with a label set, it persists until you explicitly remove it. When an agent logs out, their metric would remain with the last known value. With manual text format, metrics simply disappear when the agent is gone --- exactly what we want.
Simpler mental model --- Each scrape is a fresh rendering of the current state. No stateful metric objects, no label management, no registry cleanup.
Easier debugging --- curl localhost:9101/metrics shows you exactly what Prometheus will see. No hidden state.

The tradeoff is that we must manually write # HELP and # TYPE declarations. This is a one-time cost at the top of collect_all().

6. Metric Reference

Complete list of metrics exposed by the exporter:

Asterisk Core Metrics

Metric	Type	Labels	Description
`asterisk_active_calls`	gauge	server	Current active calls
`asterisk_active_channels`	gauge	server	Current active channels (typically 2x calls)
`asterisk_uptime_seconds`	gauge	server	Asterisk process uptime
`asterisk_confbridge_count`	gauge	server	Active conference bridge rooms

SIP Peer Metrics

Metric	Type	Labels	Description
`asterisk_sip_peer_up`	gauge	server, peer	1 if peer responds to OPTIONS, 0 otherwise
`asterisk_sip_peer_status`	gauge	server, peer, status	Status string (OK/UNREACHABLE/LAGGED)
`asterisk_sip_peer_latency_ms`	gauge	server, peer	Round-trip latency of SIP OPTIONS qualify

RTP Quality Metrics

Metric	Type	Labels	Description
`asterisk_rtp_packet_loss_percent`	gauge	server, peer	Receive-side packet loss %
`asterisk_rtp_jitter_ms`	gauge	server, peer	Receive-side jitter in milliseconds
`asterisk_rtp_rtt_ms`	gauge	server, peer	Round-trip time in milliseconds

Codec/Transcoding Metrics

Metric	Type	Labels	Description
`asterisk_channels_by_codec`	gauge	server, codec	Active channel count per codec
`asterisk_transcoding_channels`	gauge	server	Total channels doing codec conversion
`asterisk_channel_transcoding`	gauge	server, channel, codec, direction	Per-channel transcoding indicator
`asterisk_channel_native_codec`	gauge	server, channel, codec	Native codec of each channel
`asterisk_codec_mismatch`	gauge	server, pair	Count of active mismatched codec pairs
`asterisk_sip_allowed_codec`	gauge	server, codec	Codecs allowed in sip.conf

ViciDial Agent Metrics

Metric	Type	Labels	Description
`asterisk_agents_logged_in`	gauge	server	Total agents logged into ViciDial
`asterisk_agents_incall`	gauge	server	Agents currently handling a call
`asterisk_agents_paused`	gauge	server	Agents in pause state
`asterisk_agents_waiting`	gauge	server	Agents in READY/CLOSER (available)
`asterisk_agent_status`	gauge	server, agent, status	Per-agent current status
`asterisk_agent_incall_duration_seconds`	gauge	server, agent	How long agent has been on current call
`asterisk_agent_pause_duration_seconds`	gauge	server, agent	How long agent has been paused
`asterisk_agent_pause_code`	gauge	server, agent, pause_code	Agent's current pause reason
`asterisk_queue_depth`	gauge	server, ingroup	Calls waiting per campaign/ingroup

Security Metrics

Metric	Type	Labels	Description
`asterisk_fail2ban_active_bans`	gauge	server, jail	Currently banned IPs per jail
`asterisk_fail2ban_bans_total`	counter	server, jail	Cumulative bans since fail2ban start

Operations Metrics

Metric	Type	Labels	Description
`asterisk_recordings_missing`	gauge	server	Calls without recordings in last hour

Phone Fleet Metrics

Metric	Type	Labels	Description
`asterisk_phone_fleet_state`	gauge	server, peer, ip	0=offline, 1=idle, 2=in_call
`asterisk_phone_fleet_latency_ms`	gauge	server, peer	SIP qualify latency for phone
`asterisk_phone_fleet_registered`	gauge	server	Count of registered phones
`asterisk_phone_fleet_incall`	gauge	server	Count of phones on active calls
`asterisk_phone_fleet_total`	gauge	server	Total phones in configured fleet

7. Systemd Service Configuration

The exporter runs as a systemd service with environment-based configuration. Create this file on each VoIP server:

# /etc/systemd/system/asterisk_exporter.service

[Unit]
Description=Asterisk/ViciDial Prometheus Exporter
After=network.target mariadb.service asterisk.service
Wants=mariadb.service

[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/asterisk_exporter/asterisk_exporter.py
Restart=always
RestartSec=10

# ── Configuration ──
Environment=EXPORTER_PORT=9101
Environment=MYSQL_HOST=localhost
Environment=MYSQL_USER=exporter
Environment=MYSQL_PASS=YOUR_EXPORTER_PASSWORD
Environment=MYSQL_DB=asterisk
Environment=SERVER_LABEL=server1

# Optional: SIP phone fleet monitoring (empty = disabled)
# Environment=PHONE_FLEET_RANGE=1031-1070

[Install]
WantedBy=multi-user.target

Enable and Start

# Copy the exporter script
mkdir -p /opt/asterisk_exporter
cp asterisk_exporter.py /opt/asterisk_exporter/
chmod +x /opt/asterisk_exporter/asterisk_exporter.py

# Enable and start the service
systemctl daemon-reload
systemctl enable asterisk_exporter
systemctl start asterisk_exporter

# Verify it is running
systemctl status asterisk_exporter
curl -s localhost:9101/metrics | head -20

Per-Server Configuration

Each server needs its own SERVER_LABEL and optionally its own phone fleet range. Customize the Environment= lines in the service file:

Server	SERVER_LABEL	PHONE_FLEET_RANGE	Notes
UK Primary	`uk-primary`	`1031-1070`	40 London Zoiper phones
Romania	`romania`		No phone fleet tracking
France	`france`		No phone fleet tracking
Italy	`italy`		No phone fleet tracking

Service Behavior

Restart=always --- If the exporter crashes or is killed, systemd restarts it after 10 seconds.
After=mariadb.service asterisk.service --- Starts after MySQL and Asterisk, ensuring both are available when the exporter begins collecting.
Wants=mariadb.service --- Declares a soft dependency. If MySQL is down, the exporter still starts and returns Asterisk-only metrics.
Type=simple --- The exporter is a long-running foreground process. Systemd considers it "started" as soon as the process forks.

8. Prometheus Scrape Configuration

On your central Prometheus server, add the exporter targets alongside the standard node_exporter targets:

# prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # ── System metrics (CPU, RAM, disk, network) ──
  - job_name: "node"
    static_configs:
      - targets: ["voip-server-1:9100"]
        labels:
          server: "uk-primary"
      - targets: ["voip-server-2:9100"]
        labels:
          server: "romania"
      - targets: ["voip-server-3:9100"]
        labels:
          server: "france"

  # ── VoIP & ViciDial metrics (our custom exporter) ──
  - job_name: "asterisk"
    scrape_interval: 15s
    static_configs:
      - targets: ["voip-server-1:9101"]
        labels:
          server: "uk-primary"
      - targets: ["voip-server-2:9101"]
        labels:
          server: "romania"
      - targets: ["voip-server-3:9101"]
        labels:
          server: "france"

Why 15-Second Scrape Interval?

Too fast (5s): Each scrape runs 6-8 Asterisk CLI commands and 2-3 MySQL queries. At 5-second intervals, you are running CLI commands almost continuously. On a server handling 100+ calls, the subprocess overhead becomes measurable.
Too slow (60s): You miss short-lived events. A trunk that goes UNREACHABLE for 30 seconds and recovers would not show up. Queue depth spikes that resolve within a minute would be invisible.
15 seconds: The sweet spot. Low overhead, high resolution. You catch events that last more than ~30 seconds (two consecutive scrapes), which covers all operationally significant incidents.

Label Alignment

Notice that the server label in Prometheus matches the SERVER_LABEL environment variable in the exporter. This means you can join node_exporter metrics with asterisk_exporter metrics using the server label:

# CPU usage alongside active calls for the same server
node_cpu_seconds_total{server="uk-primary", mode="idle"}
asterisk_active_calls{server="uk-primary"}

Firewall Rules

Each VoIP server must allow the Prometheus server to reach port 9101:

# On each VoIP server (replace PROMETHEUS_IP with your actual IP)
iptables -I INPUT -s PROMETHEUS_IP -p tcp --dport 9101 -j ACCEPT
iptables -I INPUT -s PROMETHEUS_IP -p tcp --dport 9100 -j ACCEPT

# Persist the rules
iptables-save > /etc/sysconfig/iptables    # CentOS/openSUSE
# or
netfilter-persistent save                   # Debian/Ubuntu

9. Grafana Dashboard Panels

Here are production-tested panel configurations for the key metrics. These examples use Grafana 10+ panel JSON.

9.1 Fleet Overview --- Stat Panels

A top row of stat panels showing the most critical numbers at a glance.

Active Calls (all servers)

Panel type: Stat
Query: sum(asterisk_active_calls)
Thresholds: 0=green, 50=yellow, 100=red
Unit: none
Title: "Total Active Calls"

Agents Logged In

Panel type: Stat
Query: sum(asterisk_agents_logged_in)
Thresholds: 0=red, 1=yellow, 5=green
Unit: none
Title: "Agents Online"

SIP Trunks Down

Panel type: Stat
Query: count(asterisk_sip_peer_up == 0)
Thresholds: 0=green, 1=yellow, 2=red
Unit: none
Title: "Trunks Down"
Color mode: Background

Queue Depth

Panel type: Stat
Query: sum(asterisk_queue_depth)
Thresholds: 0=green, 5=yellow, 10=red
Unit: none
Title: "Calls in Queue"

9.2 Calls and Agents --- Time Series

A multi-line time series showing calls and agent counts over time.

Panel type: Time Series
Queries:
  A: asterisk_active_calls{server=~"$server"}     (legend: "{{server}} calls")
  B: asterisk_active_channels{server=~"$server"}   (legend: "{{server}} channels")
  C: asterisk_agents_incall{server=~"$server"}     (legend: "{{server}} agents in-call")
  D: asterisk_agents_waiting{server=~"$server"}    (legend: "{{server}} agents waiting")

Axis: Left Y = count
Fill opacity: 10
Line width: 2

9.3 SIP Trunk Status --- State Timeline

A state-timeline panel showing each trunk's status over time, with color coding.

Panel type: State Timeline
Query: asterisk_sip_peer_up{server=~"$server"}
Legend: {{peer}}

Value mappings:
  0 = "DOWN" (red)
  1 = "UP" (green)

Show values: Always
Merge equal consecutive values: true

9.4 Trunk Latency --- Time Series

Panel type: Time Series
Query: asterisk_sip_peer_latency_ms{server=~"$server"}
Legend: {{peer}}
Unit: ms

Thresholds:
  Line at 50ms (yellow)
  Line at 100ms (red)

9.5 RTP Quality --- Three Gauges

Three gauge panels for the three RTP quality indicators.

Packet Loss

Panel type: Gauge
Query: max(asterisk_rtp_packet_loss_percent{server=~"$server"})
Unit: percent (0-100)
Min: 0, Max: 10
Thresholds: 0=green, 0.5=yellow, 2=red
Title: "Max Packet Loss"

Jitter

Panel type: Gauge
Query: max(asterisk_rtp_jitter_ms{server=~"$server"})
Unit: ms
Min: 0, Max: 100
Thresholds: 0=green, 20=yellow, 50=red
Title: "Max Jitter"

Round-Trip Time

Panel type: Gauge
Query: max(asterisk_rtp_rtt_ms{server=~"$server"})
Unit: ms
Min: 0, Max: 500
Thresholds: 0=green, 150=yellow, 300=red
Title: "Max RTT"

9.6 Agent Status --- Bar Gauge

A horizontal bar gauge showing per-agent in-call duration, highlighting agents who have been on unusually long calls.

Panel type: Bar Gauge
Query: asterisk_agent_incall_duration_seconds{server=~"$server"}
Legend: {{agent}}
Unit: seconds (s)
Orientation: Horizontal
Sort: Descending

Thresholds:
  0=green (normal call)
  600=yellow (10 min --- getting long)
  1800=red (30 min --- unusually long)

9.7 Codec Distribution --- Pie Chart

Panel type: Pie Chart
Query: asterisk_channels_by_codec{server=~"$server"}
Legend: {{codec}}
Pie type: Donut
Title: "Active Codecs"

9.8 Phone Fleet --- State Timeline

For servers with phone fleet tracking enabled:

Panel type: State Timeline
Query: asterisk_phone_fleet_state{server="uk-primary"}
Legend: Ext {{peer}}

Value mappings:
  0 = "Offline" (red)
  1 = "Idle" (yellow)
  2 = "In Call" (green)

Row height: 20
Show values: Auto

9.9 Fail2ban --- Time Series

Panel type: Time Series
Query: asterisk_fail2ban_active_bans{server=~"$server"}
Legend: {{server}} - {{jail}}
Unit: none
Title: "Active Fail2ban Bans"

Overrides:
  Fill opacity: 20 (to make ban spikes visually prominent)

9.10 Missing Recordings --- Stat

Panel type: Stat
Query: asterisk_recordings_missing{server=~"$server"}
Thresholds: 0=green, 1=yellow, 5=red
Title: "Missing Recordings (1h)"

9.11 Suggested Dashboard Layout

Row 1: Fleet Overview (collapsed=no)
  [Active Calls] [Agents Online] [Trunks Down] [Queue Depth] [Missing Recordings]

Row 2: Calls & Agents (collapsed=no)
  [Calls + Agents time series, full width]

Row 3: SIP Trunks (collapsed=yes)
  [Trunk status state-timeline] [Trunk latency time series]

Row 4: RTP Quality (collapsed=yes)
  [Packet Loss gauge] [Jitter gauge] [RTT gauge]
  [Packet Loss time series, full width]

Row 5: Agents Detail (collapsed=yes)
  [In-call duration bar gauge] [Pause duration bar gauge]
  [Pause codes table]

Row 6: Codecs & Transcoding (collapsed=yes)
  [Codec pie chart] [Transcoding channels stat] [Codec mismatches table]

Row 7: Phone Fleet (collapsed=yes)
  [Fleet state timeline, full width]
  [Registered count] [In-call count] [Total count]

Row 8: Security (collapsed=yes)
  [Fail2ban bans time series] [Ban totals table]

10. Automated Deployment

For deploying the exporter to multiple servers, use a shell script that handles the full lifecycle: binary detection, Python dependency installation, service file creation, and startup.

#!/bin/bash
# deploy-exporter.sh --- Deploy asterisk_exporter to a remote VoIP server
# Usage: ./deploy-exporter.sh <server_ip> <ssh_port> <server_label>

set -e

SERVER_IP="${1:?Usage: $0 <server_ip> <ssh_port> <server_label>}"
SSH_PORT="${2:-22}"
SERVER_LABEL="${3:?Provide server label}"

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
SSH_CMD="ssh -o StrictHostKeyChecking=no -p ${SSH_PORT} root@${SERVER_IP}"

echo "=== Deploying asterisk_exporter to ${SERVER_LABEL} ==="

# Create target directory
${SSH_CMD} "mkdir -p /opt/asterisk_exporter"

# Copy the exporter script
scp -o StrictHostKeyChecking=no -P ${SSH_PORT} \
    ${SCRIPT_DIR}/asterisk_exporter.py \
    root@${SERVER_IP}:/opt/asterisk_exporter/

# Install dependencies and create systemd service
${SSH_CMD} bash << REMOTEOF
set -e

# Find Python 3
PYTHON_BIN=""
for p in python3.11 python3.9 python3.6 python3; do
    if command -v \$p &>/dev/null; then
        PYTHON_BIN=\$(command -v \$p)
        break
    fi
done

if [ -z "\$PYTHON_BIN" ]; then
    echo "ERROR: No Python 3 found"
    exit 1
fi

echo "Using Python: \$PYTHON_BIN"

# Install mysql-connector-python
\$PYTHON_BIN -m pip install mysql-connector-python 2>/dev/null \
    || \$PYTHON_BIN -m pip install "mysql-connector-python<8.1" 2>/dev/null \
    || true

# Verify import works
\$PYTHON_BIN -c "import mysql.connector; print('mysql-connector OK')"

chmod +x /opt/asterisk_exporter/asterisk_exporter.py

# Create systemd service
cat > /etc/systemd/system/asterisk_exporter.service << SVCFILE
[Unit]
Description=Asterisk/ViciDial Prometheus Exporter
After=network.target mariadb.service asterisk.service
Wants=mariadb.service

[Service]
Type=simple
ExecStart=\$PYTHON_BIN /opt/asterisk_exporter/asterisk_exporter.py
Restart=always
RestartSec=10
Environment=EXPORTER_PORT=9101
Environment=MYSQL_HOST=localhost
Environment=MYSQL_USER=exporter
Environment=MYSQL_PASS=YOUR_EXPORTER_PASSWORD
Environment=MYSQL_DB=asterisk
Environment=SERVER_LABEL=${SERVER_LABEL}

[Install]
WantedBy=multi-user.target
SVCFILE

systemctl daemon-reload
systemctl enable asterisk_exporter
systemctl restart asterisk_exporter
echo "asterisk_exporter started on :9101"
REMOTEOF

echo "=== Done: ${SERVER_LABEL} ==="

The script is idempotent --- running it again on the same server updates the exporter script and restarts the service without breaking anything.

11. Production Tips

Metric Cardinality

Cardinality (the total number of unique time series) is the primary scaling concern for Prometheus. Each unique combination of metric name + label values creates one time series.

Watch out for:

Per-agent metrics with the agent label: 50 agents x 3 metrics = 150 series. Acceptable.
Per-channel metrics (transcoding, codecs): With 100 concurrent calls, this could create 200+ series that appear and disappear every few minutes. Prometheus handles this fine, but Grafana queries over long time ranges will slow down.
Per-peer RTP metrics: Each active call generates jitter/loss/rtt metrics. These are inherently high-cardinality but also inherently short-lived.

Mitigation strategies:

Do not add labels you do not query. If you never filter by ip in Grafana, remove the ip label from phone fleet metrics.
Use aggregate metrics where possible. asterisk_transcoding_channels (single number) is more useful for alerting than 50 individual asterisk_channel_transcoding metrics.
Set Prometheus retention appropriately. 30 days is a good default; going to 90 days with high-cardinality VoIP metrics will consume significant disk.

Error Handling Philosophy

Every collector function follows the same pattern: catch exceptions, return empty metrics, never crash.

# Pattern used throughout the exporter:
def collect_something():
    metrics = []
    try:
        # ... do work ...
    except Exception:
        pass  # Return partial/empty metrics
    return metrics

This is intentional. If MySQL is down, you still get Asterisk metrics. If Asterisk is restarting, you still get MySQL metrics. If fail2ban is not installed, everything else still works. The exporter degrades gracefully rather than failing completely.

Scrape Timeout

Prometheus's default scrape timeout is 10 seconds. Our exporter runs 6-8 CLI commands (each with a 10-second timeout) and 2-3 MySQL queries (5-second connection timeout). In the worst case, a single scrape could take several seconds.

If you see context deadline exceeded errors in Prometheus, increase the scrape timeout:

- job_name: "asterisk"
  scrape_interval: 15s
  scrape_timeout: 12s   # Default is 10s
  static_configs:
    - targets: [...]

MySQL Connection Safety

The exporter's MySQL user should be strictly read-only with a query timeout:

-- Create user with 5-second query timeout
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'YOUR_EXPORTER_PASSWORD';

-- Grant SELECT only on required tables
GRANT SELECT ON asterisk.vicidial_live_agents TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.vicidial_auto_calls TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.vicidial_closer_log TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.recording_log TO 'exporter'@'localhost';

-- Set max execution time (MariaDB 10.1+)
-- This prevents a runaway query from blocking the table
SET GLOBAL max_statement_time = 5;

FLUSH PRIVILEGES;

Log Suppression

The log_message override in MetricsHandler suppresses per-request HTTP logging:

def log_message(self, format, *args):
    pass

Without this, every 15-second Prometheus scrape generates a log line. Over 24 hours, that is 5,760 lines of GET /metrics 200 --- pure noise. Suppress it.

12. Troubleshooting

Exporter Is Not Starting

# Check systemd status
systemctl status asterisk_exporter

# Check logs
journalctl -u asterisk_exporter --no-pager -n 50

# Common issues:
# 1. Python not found --- check ExecStart path in service file
# 2. mysql-connector not installed --- run: python3 -m pip install mysql-connector-python
# 3. Port 9101 already in use --- check: ss -tlnp | grep 9101

Prometheus Shows "Target Down"

# From the Prometheus server, verify network connectivity
curl -s http://VOIP_SERVER_IP:9101/metrics | head -5

# If connection refused:
# 1. Exporter not running: ssh to server, check systemctl status
# 2. Firewall blocking: iptables -L -n | grep 9101
# 3. Wrong port: check EXPORTER_PORT in service file

# If timeout:
# 1. Network issue between Prometheus and VoIP server
# 2. Exporter hung: restart with systemctl restart asterisk_exporter

MySQL Metrics Missing (Asterisk Metrics Present)

# Test MySQL connectivity from the server itself
mysql -u exporter -p'YOUR_EXPORTER_PASSWORD' -e "SELECT 1" asterisk

# Common issues:
# 1. Wrong credentials in Environment= lines
# 2. User lacks SELECT privilege: SHOW GRANTS FOR 'exporter'@'localhost';
# 3. MySQL not running: systemctl status mariadb
# 4. Database name wrong: MYSQL_DB should be "asterisk" for ViciDial

Asterisk Metrics Missing (MySQL Metrics Present)

# Test Asterisk CLI access
asterisk -rx "core show version"

# Common issues:
# 1. Asterisk not running: systemctl status asterisk
# 2. Permission denied: exporter must run as root or asterisk user
# 3. asterisk binary not in PATH: which asterisk

Metrics Look Stale or Frozen

# Check the scrape duration
curl -s localhost:9101/metrics | wc -l
# If very few lines, some collectors are silently failing

# Time a scrape manually
time curl -s localhost:9101/metrics > /dev/null
# Should be < 3 seconds. If > 10s, a CLI command or MySQL query is hanging

# Check if Asterisk is responsive
time asterisk -rx "sip show peers"
# Should return in < 1 second

High CPU from the Exporter

The transcoding detection feature runs core show channel for every active SIP channel. With 100+ concurrent calls, this means 100+ subprocess forks every 15 seconds.

If CPU is a concern, disable transcoding detection by commenting out the collect_transcoding() call in collect_all():

# lines.extend(collect_transcoding())  # Disabled: too many subprocess forks

"Too Many Open Files" Errors

Each subprocess fork opens file descriptors. On systems with low ulimit -n (default 1024), heavy scrape activity can exhaust them.

Fix in the systemd service file:

[Service]
LimitNOFILE=4096

13. Extending the Exporter

The modular structure (one function per metric family) makes it easy to add new collectors.

Adding a New Metric

Write a collector function that returns a list of metric strings:

def collect_my_custom_metric():
    metrics = []
    # ... gather data ...
    metrics.append(
        f'asterisk_my_metric{{server="{SERVER_LABEL}"}} {value}'
    )
    return metrics

Add # HELP and # TYPE declarations to collect_all():

"# HELP asterisk_my_metric Description of what it measures",
"# TYPE asterisk_my_metric gauge",

Call the collector in collect_all():

lines.extend(collect_my_custom_metric())

Restart the exporter:

systemctl restart asterisk_exporter

Ideas for Additional Metrics

Metric	Source	Value
`asterisk_calls_today_total`	MySQL `vicidial_closer_log`	Daily call volume counter
`asterisk_avg_wait_time_seconds`	MySQL `vicidial_closer_log`	Average queue wait time
`asterisk_dialplan_errors`	`asterisk -rx "dialplan show"`	Dialplan syntax errors
`asterisk_sip_registry_status`	`asterisk -rx "sip show registry"`	Outbound registration status
`asterisk_dahdi_spans_up`	`asterisk -rx "dahdi show status"`	DAHDI/ISDN span status
`asterisk_disk_recordings_gb`	`du -s /var/spool/asterisk/monitor`	Recording storage usage
`vicidial_campaign_calls_waiting`	MySQL `vicidial_auto_calls`	Per-campaign queue depth
`vicidial_list_penetration`	MySQL `vicidial_list`	% of leads called per list

Converting to AMI (If Needed)

If you outgrow the CLI approach and need lower-overhead collection (e.g., 200+ concurrent calls with transcoding detection), you can convert individual collectors to use AMI. The Asterisk Manager Interface uses a persistent TCP connection:

import socket

class AMIConnection:
    def __init__(self, host, port, username, secret):
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.sock.settimeout(10)
        self.sock.connect((host, port))
        self._read_until("Asterisk Call Manager")
        self._send_action({
            "Action": "Login",
            "Username": username,
            "Secret": secret,
        })
        response = self._read_response()
        if "Success" not in response.get("Response", ""):
            raise Exception(f"AMI login failed: {response}")

    def _send_action(self, action):
        msg = "\r\n".join(
            f"{k}: {v}" for k, v in action.items()
        ) + "\r\n\r\n"
        self.sock.sendall(msg.encode())

    def _read_until(self, marker):
        data = b""
        while marker.encode() not in data:
            data += self.sock.recv(4096)
        return data.decode()

    def _read_response(self):
        data = self._read_until("\r\n\r\n")
        result = {}
        for line in data.strip().split("\r\n"):
            if ": " in line:
                key, val = line.split(": ", 1)
                result[key] = val
        return result

    def command(self, cmd):
        self._send_action({
            "Action": "Command",
            "Command": cmd,
        })
        return self._read_response()

    def close(self):
        self._send_action({"Action": "Logoff"})
        self.sock.close()

For AMI to work, you need to configure /etc/asterisk/manager.conf:

[general]
enabled = yes
port = 5038
bindaddr = 127.0.0.1    ; Only localhost --- never expose AMI externally

[exporter]
secret = YOUR_AMI_SECRET
deny = 0.0.0.0/0.0.0.0
permit = 127.0.0.1/255.255.255.0
read = system,call,agent
write = command

Our recommendation: Stick with CLI unless you have a measured performance problem. The CLI approach has zero authentication surface, zero network exposure, and works identically across all Asterisk versions from 11 to 21.

Summary

What we built:

A single-file Python exporter (~500 lines) that bridges Asterisk CLI and ViciDial MySQL into Prometheus metrics
30+ metric families covering SIP trunks, call quality, agent states, queue depth, security, and recording integrity
Systemd service with environment-based configuration, auto-restart, and graceful dependency handling
Prometheus scrape config for multi-server fleet monitoring at 15-second resolution
Grafana dashboard panels for every metric family, from fleet overview stats to per-phone state timelines
Automated deployment script for rolling out to new servers in minutes

The exporter runs in production across multiple VoIP servers, each handling hundreds of concurrent calls, scraped every 15 seconds. Total overhead per scrape: under 100ms of subprocess time and one MySQL connection lasting <50ms.

This is the monitoring stack that tells you a trunk went UNREACHABLE at 02:14, that agent 1042 has been paused for 47 minutes, that RTP jitter spiked to 35ms on the France server, and that 3 calls in the last hour have no recordings --- all from a single Grafana dashboard, all alertable, all queryable in PromQL.

Standard node_exporter tells you the disk is 60% full. This exporter tells you the business is running.