Building a Custom Prometheus Exporter for Asterisk/ViciDial
A production-grade Python exporter that bridges Asterisk AMI and ViciDial MySQL into Prometheus metrics for real-time VoIP monitoring.
Table of Contents
- Why Build a Custom Exporter?
- Architecture Overview
- Prerequisites
- Project Structure
- The Complete Exporter
- Configuration and Startup
- Asterisk CLI Integration
- SIP Peer Metrics
- Channel and Call Metrics
- RTP Quality Metrics
- Transcoding Detection
- ViciDial Agent Metrics (MySQL)
- Queue Depth Monitoring
- Fail2ban Security Metrics
- Recording Integrity Checks
- Uptime and Conference Tracking
- SIP Phone Fleet Tracking (3-State Model)
- Metric Exposition and HTTP Server
- Metric Reference
- Systemd Service Configuration
- Prometheus Scrape Configuration
- Grafana Dashboard Panels
- Automated Deployment
- Production Tips
- Troubleshooting
- Extending the Exporter
1. Why Build a Custom Exporter?
If you run a VoIP call center on Asterisk with ViciDial, you already know that node_exporter alone tells you almost nothing useful. It will tell you that CPU is at 40% and disk is 60% full --- but it cannot tell you:
- How many SIP trunks are currently reachable, and what their latency is
- How many agents are logged in, on calls, or sitting in pause
- Whether RTP streams are suffering from jitter or packet loss right now
- How many calls are queued waiting for an available agent
- Whether recordings are being generated for every call (compliance requirement)
- Which codec mismatches are causing unnecessary transcoding load
- How many IPs fail2ban has currently banned
These are the metrics that matter at 2 AM when calls start dropping. Standard exporters do not understand Asterisk's internal state, ViciDial's MySQL schema, or the relationship between SIP registration and call routing. You need a custom exporter that speaks both Asterisk CLI and ViciDial SQL.
What We Are Building
A single Python process that runs on each VoIP server, collecting metrics from three sources:
| Source | Method | Metrics |
|---|---|---|
| Asterisk | CLI commands via asterisk -rx |
SIP peers, channels, RTP stats, codecs, transcoding, uptime |
| ViciDial MySQL | SQL queries via mysql-connector |
Agent states, queue depth, recording integrity, campaign stats |
| System tools | fail2ban-client |
Active bans per jail |
It exposes everything on a single /metrics HTTP endpoint in Prometheus text format, scraped every 15 seconds by a central Prometheus instance.
Why Not Use Existing Asterisk Exporters?
There are a few community Asterisk exporters (notably asterisk_exporter from digium-cloud and various Go-based ones). They all share the same limitations:
AMI-only --- They connect via the Asterisk Manager Interface (TCP socket), which requires opening another port, managing AMI credentials, and dealing with connection lifecycle. Our approach uses
asterisk -rxCLI commands, which are simpler, require no additional ports, and work identically whether Asterisk is version 11 or 20.No ViciDial awareness --- No existing exporter knows about
vicidial_live_agents,vicidial_auto_calls, orrecording_log. ViciDial stores its operational state in MySQL, not in Asterisk.No fleet tracking --- In a multi-site call center, you need to track SIP phone registration states across sites (e.g., "are all 40 London Zoiper phones online?"). This requires cross-referencing
sip show peerswithcore show channelsto distinguish offline/idle/in-call.No security metrics --- VoIP servers are constant targets for SIP scanning and brute-force attacks. Monitoring fail2ban ban counts alongside SIP metrics gives you a single pane of glass.
2. Architecture Overview
+------------------+ +------------------+ +------------------+
| VoIP Server A | | VoIP Server B | | VoIP Server C |
| | | | | |
| asterisk_exporter| | asterisk_exporter| | asterisk_exporter|
| :9101 | | :9101 | | :9101 |
| node_exporter | | node_exporter | | node_exporter |
| :9100 | | :9100 | | :9100 |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+------------+------------+------------+------------+
|
+-------v--------+
| Prometheus |
| scrape /15s |
| retention: 30d |
+-------+--------+
|
+-------v--------+
| Grafana |
| dashboards |
| alerts |
+----------------+
Each VoIP server runs the exporter as a systemd service alongside node_exporter. A central Prometheus instance scrapes both endpoints. Grafana queries Prometheus for visualization and alerting.
The exporter is intentionally stateless --- it collects fresh data on every scrape request. This means:
- No persistent state to corrupt
- No risk of stale cached data
- Instant recovery after restart
- Each scrape reflects the current moment
3. Prerequisites
On Each VoIP Server
# Python 3.6+ (3.11 recommended)
python3 --version
# mysql-connector-python
python3 -m pip install mysql-connector-python
# Verify the import works
python3 -c "import mysql.connector; print('OK')"
MySQL User for the Exporter
Create a read-only MySQL user. The exporter only needs SELECT on a handful of ViciDial tables:
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'YOUR_EXPORTER_PASSWORD';
GRANT SELECT ON asterisk.vicidial_live_agents TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.vicidial_auto_calls TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.vicidial_closer_log TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.recording_log TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
Security note: Never use the ViciDial
cronuser or any user with write privileges. The exporter should be read-only by design. Use a dedicated user with minimal grants.
Asterisk CLI Access
The exporter runs as root (or the asterisk user) and calls asterisk -rx "<command>" directly. No AMI port, no AMI credentials, no TCP socket management. The asterisk binary must be in the system PATH.
Verify:
asterisk -rx "core show version"
asterisk -rx "sip show peers"
asterisk -rx "core show channels"
On the Central Monitoring Server
- Prometheus (v2.x+) with network access to port 9101 on each VoIP server
- Grafana (v10+) with Prometheus datasource configured
4. Project Structure
/opt/asterisk_exporter/
├── asterisk_exporter.py # The exporter (single file)
└── README.md # Optional: notes for your team
/etc/systemd/system/
└── asterisk_exporter.service # Systemd unit file
One file. No virtual environments, no dependency hell, no framework overhead. The only external dependency is mysql-connector-python. The HTTP server uses Python's built-in http.server.
5. The Complete Exporter
5.1 Configuration and Startup
The exporter is configured entirely via environment variables, set in the systemd service file. No config files to manage.
#!/usr/bin/env python3
"""
Asterisk/ViciDial Prometheus Exporter
Queries Asterisk CLI + ViciDial MySQL to expose VoIP metrics.
Runs on each monitored server, listens on :9101.
"""
import http.server
import subprocess
import re
import os
import time
import mysql.connector
from mysql.connector import Error
# ─── Configuration via environment variables ─────────────────────────
LISTEN_PORT = int(os.environ.get("EXPORTER_PORT", 9101))
MYSQL_HOST = os.environ.get("MYSQL_HOST", "localhost")
MYSQL_USER = os.environ.get("MYSQL_USER", "exporter")
MYSQL_PASS = os.environ.get("MYSQL_PASS", "YOUR_EXPORTER_PASSWORD")
MYSQL_DB = os.environ.get("MYSQL_DB", "asterisk")
SERVER_LABEL = os.environ.get("SERVER_LABEL", "server1")
# Optional: SIP phone fleet range (e.g., "1031-1070")
PHONE_FLEET_RANGE = os.environ.get("PHONE_FLEET_RANGE", "")
Design decisions:
- Environment variables over config files --- Systemd makes this trivial with
Environment=directives, and it keeps the exporter a single portable file. SERVER_LABEL--- Every metric carries aserverlabel so you can aggregate across a multi-server fleet in Prometheus without relying oninstancelabels (which contain IPs and break when servers move).PHONE_FLEET_RANGE--- Optional. Only enable on servers that have a known range of SIP phone extensions to track (e.g., a block of Zoiper softphones allocated to a specific office).
5.2 Asterisk CLI Integration
Rather than connecting to the Asterisk Manager Interface (AMI) over a TCP socket, we execute CLI commands directly. This is simpler, more reliable, and avoids AMI credential management.
def run_ast_cmd(cmd):
"""Run an Asterisk CLI command and return stdout.
Uses subprocess with a 10-second timeout to prevent hangs
if Asterisk is unresponsive. Returns empty string on any failure.
"""
try:
result = subprocess.run(
["asterisk", "-rx", cmd],
capture_output=True,
text=True,
timeout=10
)
return result.stdout
except Exception:
return ""
Why CLI over AMI?
| Aspect | CLI (asterisk -rx) |
AMI (TCP socket) |
|---|---|---|
| Authentication | None (runs as root/asterisk user) | Requires AMI user/secret in manager.conf |
| Port requirements | None | TCP 5038 (another port to firewall) |
| Connection management | None (stateless per-call) | Must handle reconnects, keepalives |
| Asterisk version compat | Works on Asterisk 11 through 21+ | AMI protocol varies across versions |
| Output parsing | Text-based, grep-friendly | Event-based, more complex parsing |
| Performance | Fork per command (~5ms each) | Single persistent connection |
The CLI approach is slightly less efficient (one fork per command), but for a 15-second scrape interval collecting 6-8 commands, the total overhead is under 100ms. For a call center server already handling hundreds of calls, this is negligible.
5.3 SIP Peer Metrics
SIP trunks are the connection between your Asterisk server and your carriers. If a trunk goes down, outbound calls stop. Monitoring trunk status and latency is the single most important VoIP metric.
def collect_sip_peers():
"""Parse 'sip show peers' for status and latency.
Asterisk output format:
protech/protech 185.X.X.X D 5060 OK (23 ms)
mutitel_de/mutite 148.X.X.X D 5060 OK (45 ms)
1031/1031 10.X.X.X D 5060 Unspecified
We extract:
- peer name (before the /)
- IP address
- status (OK, UNREACHABLE, LAGGED, UNKNOWN)
- latency in ms (from the parenthetical)
"""
metrics = []
output = run_ast_cmd("sip show peers")
for line in output.splitlines():
m = re.match(
r'^(\S+)\s+(\d+\.\d+\.\d+\.\d+)\s+\S+\s+\S+\s+(\S+)\s+(\S+)',
line
)
if m:
peer = m.group(1).split('/')[0]
status_str = m.group(3)
latency_str = m.group(4)
# Binary up/down for simple alerting
is_up = 1 if status_str == "OK" else 0
metrics.append(
f'asterisk_sip_peer_up{{server="{SERVER_LABEL}",'
f'peer="{peer}"}} {is_up}'
)
# Full status string for detailed dashboards
metrics.append(
f'asterisk_sip_peer_status{{server="{SERVER_LABEL}",'
f'peer="{peer}",status="{status_str}"}} 1'
)
# Latency in ms (only present when peer responds to OPTIONS)
lat_match = re.search(r'(\d+)', latency_str)
if lat_match:
metrics.append(
f'asterisk_sip_peer_latency_ms{{server="{SERVER_LABEL}",'
f'peer="{peer}"}} {lat_match.group(1)}'
)
return metrics
Why two status metrics?
asterisk_sip_peer_up(0 or 1) is perfect for alerting rules:asterisk_sip_peer_up == 0fires immediately.asterisk_sip_peer_statuswith astatuslabel lets you build state-timeline panels in Grafana showing transitions between OK/LAGGED/UNREACHABLE over time.
5.4 Channel and Call Metrics
Active channels and calls are the heartbeat of your Asterisk server. A sudden drop to zero means something is very wrong. A sudden spike might mean a toll fraud attack.
def collect_channels():
"""Parse 'core show channels' for active call count and codec info.
The last line of 'core show channels' output looks like:
5 active channels
2 active calls
28 calls processed
We also parse 'sip show channelstats' to count channels by codec,
which helps identify transcoding overhead.
"""
metrics = []
output = run_ast_cmd("core show channels")
# Extract totals from the summary line
m = re.search(r'(\d+) active channel', output)
channels = int(m.group(1)) if m else 0
m2 = re.search(r'(\d+) active call', output)
calls = int(m2.group(1)) if m2 else 0
metrics.append(
f'asterisk_active_channels{{server="{SERVER_LABEL}"}} {channels}'
)
metrics.append(
f'asterisk_active_calls{{server="{SERVER_LABEL}"}} {calls}'
)
# Count channels by codec from channelstats
codec_counts = {}
stats_output = run_ast_cmd("sip show channelstats")
for line in stats_output.splitlines():
parts = line.split()
if len(parts) >= 12:
codec = parts[11] if len(parts) > 11 else "unknown"
if codec in ("alaw", "ulaw", "g722", "g729", "gsm", "opus"):
codec_counts[codec] = codec_counts.get(codec, 0) + 1
for codec, count in codec_counts.items():
metrics.append(
f'asterisk_channels_by_codec{{server="{SERVER_LABEL}",'
f'codec="{codec}"}} {count}'
)
return metrics
The channels vs. calls distinction matters. Each call typically uses 2 channels (one inbound leg, one outbound leg or agent leg). If you see 10 active calls but 25 active channels, you may have conference bridges or call recordings consuming extra channels.
5.5 RTP Quality Metrics
This is where VoIP monitoring gets serious. RTP (Real-time Transport Protocol) carries the actual audio. Jitter, packet loss, and round-trip time directly determine call quality.
def collect_rtp_stats():
"""Parse 'sip show channelstats' for RTP quality metrics.
Asterisk output columns:
Peer Recv-Count Recv-Lost Recv-Loss% Recv-Jitter
Send-Count Send-Lost Send-Loss% Send-Jitter RTT
We extract per-peer:
- Receive packet loss percentage
- Receive jitter (in ms)
- Round-trip time (in ms)
These are the three key indicators of audio quality.
"""
metrics = []
output = run_ast_cmd("sip show channelstats")
for line in output.splitlines():
parts = line.split()
if len(parts) >= 10 and parts[0] != "Peer":
try:
peer = parts[0]
# Parse receive-side loss percentage
recv_loss_pct = (
float(parts[3].rstrip('%'))
if '%' in parts[3] else 0
)
# Parse receive-side jitter
recv_jitter = (
float(parts[4])
if parts[4].replace('.', '').isdigit() else 0
)
# Parse round-trip time
rtt = (
float(parts[7])
if len(parts) > 7
and parts[7].replace('.', '').isdigit()
else 0
)
metrics.append(
f'asterisk_rtp_packet_loss_percent'
f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
f'{recv_loss_pct}'
)
metrics.append(
f'asterisk_rtp_jitter_ms'
f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
f'{recv_jitter}'
)
if rtt > 0:
metrics.append(
f'asterisk_rtp_rtt_ms'
f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
f'{rtt}'
)
except (ValueError, IndexError):
continue
return metrics
Quality thresholds for VoIP:
| Metric | Good | Acceptable | Poor |
|---|---|---|---|
| Packet Loss | < 0.5% | 0.5-2% | > 2% |
| Jitter | < 20ms | 20-50ms | > 50ms |
| RTT | < 150ms | 150-300ms | > 300ms |
These numbers translate directly into Grafana gauge thresholds and Prometheus alert rules.
5.6 Transcoding Detection
Transcoding (converting between codecs in real-time) consumes significant CPU. In a call center, unexpected transcoding usually means a misconfigured trunk or phone that negotiated the wrong codec.
def collect_transcoding():
"""Detect active transcoding by inspecting each SIP channel.
For each active SIP channel, we run 'core show channel <chan>' and
check the ReadTranscode/WriteTranscode fields. If either is "Yes",
Asterisk is converting audio in real-time.
We also extract the codec mismatch pairs (e.g., "ulaw->alaw")
and the globally allowed codecs from sip.conf.
"""
metrics = []
transcoding_count = 0
codec_mismatch_pairs = {}
# Get list of active SIP channels
output = run_ast_cmd("core show channels verbose")
sip_channels = []
for line in output.splitlines():
m = re.match(r'^(SIP/\S+)', line)
if m:
sip_channels.append(m.group(1))
for chan in sip_channels:
ch_output = run_ast_cmd(f"core show channel {chan}")
native = ""
read_tc = False
write_tc = False
read_path = ""
write_path = ""
peer_name = (
chan.split("/")[1].split("-")[0]
if "/" in chan else chan
)
for ch_line in ch_output.splitlines():
ch_line = ch_line.strip()
if ch_line.startswith("NativeFormats:"):
m = re.search(r'\(([^)]+)\)', ch_line)
if m:
native = m.group(1)
elif ch_line.startswith("ReadTranscode:"):
if "Yes" in ch_line:
read_tc = True
read_path = ch_line.split("Yes")[-1].strip()
elif ch_line.startswith("WriteTranscode:"):
if "Yes" in ch_line:
write_tc = True
write_path = ch_line.split("Yes")[-1].strip()
if native:
metrics.append(
f'asterisk_channel_native_codec'
f'{{server="{SERVER_LABEL}",channel="{peer_name}",'
f'codec="{native}"}} 1'
)
if read_tc or write_tc:
transcoding_count += 1
direction = "read" if read_tc else "write"
metrics.append(
f'asterisk_channel_transcoding'
f'{{server="{SERVER_LABEL}",channel="{peer_name}",'
f'codec="{native}",direction="{direction}"}} 1'
)
# Extract codec pairs from transcoding path
# e.g., "(alaw@8000)->(slin@8000)->(ulaw@8000)"
for path in [read_path, write_path]:
codecs_in_path = re.findall(r'(\w+)@\d+', path)
if len(codecs_in_path) >= 2:
src = codecs_in_path[0]
dst = codecs_in_path[-1]
if (src != dst
and src != "slin"
and dst != "slin"):
pair = f"{src}->{dst}"
codec_mismatch_pairs[pair] = (
codec_mismatch_pairs.get(pair, 0) + 1
)
metrics.append(
f'asterisk_transcoding_channels'
f'{{server="{SERVER_LABEL}"}} {transcoding_count}'
)
for pair, count in codec_mismatch_pairs.items():
metrics.append(
f'asterisk_codec_mismatch'
f'{{server="{SERVER_LABEL}",pair="{pair}"}} {count}'
)
# Export globally allowed codecs from sip.conf
settings = run_ast_cmd("sip show settings")
for line in settings.splitlines():
if "Codecs:" in line:
m = re.search(r'\(([^)]+)\)', line)
if m:
for codec in m.group(1).split("|"):
codec = codec.strip()
if codec:
metrics.append(
f'asterisk_sip_allowed_codec'
f'{{server="{SERVER_LABEL}",'
f'codec="{codec}"}} 1'
)
break
return metrics
Performance note: This function runs
core show channelfor each active SIP channel, which means N+1 subprocess calls during busy periods. In practice, even with 50 active calls, the total time is under 500ms. If your server handles 200+ concurrent calls, consider sampling instead of inspecting every channel.
5.7 ViciDial Agent Metrics (MySQL)
This is where the exporter goes beyond what any Asterisk-only tool can provide. ViciDial stores agent state in MySQL, not in Asterisk. The vicidial_live_agents table is the source of truth for who is logged in and what they are doing.
def get_mysql_connection():
"""Get a MySQL connection with a 5-second timeout.
Returns None on failure rather than raising --- the exporter
should always return partial metrics rather than crashing.
"""
try:
return mysql.connector.connect(
host=MYSQL_HOST,
user=MYSQL_USER,
password=MYSQL_PASS,
database=MYSQL_DB,
connect_timeout=5
)
except Error:
return None
def collect_vicidial_agents():
"""Query ViciDial MySQL for agent states and queue depth.
Key table: vicidial_live_agents
- status: READY, INCALL, PAUSED, CLOSER, QUEUE, DISPO
- user: agent login ID
- pause_code: reason for pause (LUNCH, BREAK, etc.)
- last_state_change: timestamp of last status transition
We collect:
1. Aggregate counts by status (for overview dashboards)
2. Per-agent status with duration (for supervisor views)
3. Queue depth per campaign/ingroup
"""
metrics = []
conn = get_mysql_connection()
if not conn:
return metrics
try:
cursor = conn.cursor(dictionary=True)
# ── Aggregate agent counts by status ──
cursor.execute("""
SELECT status, COUNT(*) as cnt
FROM vicidial_live_agents
WHERE server_ip != ''
GROUP BY status
""")
logged_in = 0
incall = 0
paused = 0
waiting = 0
for row in cursor.fetchall():
s = row['status']
c = row['cnt']
logged_in += c
if s == 'INCALL':
incall = c
elif s == 'PAUSED':
paused = c
elif s in ('READY', 'CLOSER'):
waiting += c
metrics.append(
f'asterisk_agents_logged_in'
f'{{server="{SERVER_LABEL}"}} {logged_in}'
)
metrics.append(
f'asterisk_agents_incall'
f'{{server="{SERVER_LABEL}"}} {incall}'
)
metrics.append(
f'asterisk_agents_paused'
f'{{server="{SERVER_LABEL}"}} {paused}'
)
metrics.append(
f'asterisk_agents_waiting'
f'{{server="{SERVER_LABEL}"}} {waiting}'
)
# ── Per-agent status with duration ──
cursor.execute("""
SELECT user, status, pause_code,
TIMESTAMPDIFF(SECOND, last_state_change, NOW())
AS state_duration
FROM vicidial_live_agents
WHERE server_ip != ''
""")
for row in cursor.fetchall():
user = row['user']
status = row['status']
duration = row['state_duration'] or 0
metrics.append(
f'asterisk_agent_status'
f'{{server="{SERVER_LABEL}",agent="{user}",'
f'status="{status}"}} 1'
)
if status == 'INCALL':
metrics.append(
f'asterisk_agent_incall_duration_seconds'
f'{{server="{SERVER_LABEL}",agent="{user}"}} '
f'{duration}'
)
elif status == 'PAUSED':
pause_code = row['pause_code'] or 'NONE'
metrics.append(
f'asterisk_agent_pause_duration_seconds'
f'{{server="{SERVER_LABEL}",agent="{user}"}} '
f'{duration}'
)
metrics.append(
f'asterisk_agent_pause_code'
f'{{server="{SERVER_LABEL}",agent="{user}",'
f'pause_code="{pause_code}"}} 1'
)
cursor.close()
except Exception:
pass
finally:
try:
conn.close()
except Exception:
pass
return metrics
Why a new connection every scrape?
We open a fresh MySQL connection on each /metrics request and close it immediately after. This avoids:
- Stale connections --- MySQL's
wait_timeout(default 28800s) will kill idle persistent connections, requiring reconnect logic - Connection pool complexity --- For a 15-second scrape interval with queries that take <50ms, pooling adds complexity with zero benefit
- Resource leaks --- No long-lived connections to leak if the exporter encounters an error
The tradeoff is one TCP handshake per scrape (adds ~1ms on localhost). Completely negligible.
5.8 Queue Depth Monitoring
Calls waiting in queue is a critical real-time metric. If queue depth climbs, either agents are overwhelmed or something is preventing call distribution.
# ── Queue depth by campaign/ingroup ──
# (continued inside collect_vicidial_agents)
cursor.execute("""
SELECT campaign_id, COUNT(*) as cnt
FROM vicidial_auto_calls
WHERE status = 'LIVE'
GROUP BY campaign_id
""")
for row in cursor.fetchall():
ingroup = row['campaign_id']
cnt = row['cnt']
metrics.append(
f'asterisk_queue_depth'
f'{{server="{SERVER_LABEL}",ingroup="{ingroup}"}} {cnt}'
)
The vicidial_auto_calls table is ViciDial's real-time call routing table. Rows with status = 'LIVE' are calls that have been answered by the carrier but not yet connected to an agent.
5.9 Fail2ban Security Metrics
VoIP servers are under constant SIP scanning and registration brute-force attacks. Monitoring fail2ban activity alongside call metrics lets you correlate security events with call quality issues.
def collect_fail2ban():
"""Parse fail2ban-client for ban counts per jail.
Typical jails on a VoIP server:
- asterisk: SIP authentication failures
- apache-auth: Web interface brute force
- sshd: SSH brute force
We collect:
- Current active bans (gauge --- can go down as bans expire)
- Total historical bans (counter --- only goes up)
"""
metrics = []
try:
result = subprocess.run(
["fail2ban-client", "status"],
capture_output=True, text=True, timeout=5
)
jails = re.findall(r'Jail list:\s*(.*)', result.stdout)
if jails:
for jail in jails[0].split(','):
jail = jail.strip()
if not jail:
continue
jr = subprocess.run(
["fail2ban-client", "status", jail],
capture_output=True, text=True, timeout=5
)
banned = re.search(
r'Currently banned:\s+(\d+)', jr.stdout
)
total = re.search(
r'Total banned:\s+(\d+)', jr.stdout
)
if banned:
metrics.append(
f'asterisk_fail2ban_active_bans'
f'{{server="{SERVER_LABEL}",jail="{jail}"}} '
f'{banned.group(1)}'
)
if total:
metrics.append(
f'asterisk_fail2ban_bans_total'
f'{{server="{SERVER_LABEL}",jail="{jail}"}} '
f'{total.group(1)}'
)
except Exception:
pass
return metrics
Note: If fail2ban is not installed or not running, this function silently returns empty metrics. The exporter never crashes due to missing optional components.
5.10 Recording Integrity Checks
In many jurisdictions, call centers must record all calls. This metric detects when calls complete but no recording is generated --- a compliance problem that can go unnoticed for days without monitoring.
def collect_recordings():
"""Check for calls without recordings in the last hour.
Joins vicidial_closer_log (completed calls) against
recording_log (actual recordings). Any inbound call longer
than 10 seconds that has no matching recording is flagged.
"""
metrics = []
conn = get_mysql_connection()
if not conn:
return metrics
try:
cursor = conn.cursor(dictionary=True)
cursor.execute("""
SELECT COUNT(*) as missing
FROM vicidial_closer_log cl
LEFT JOIN recording_log rl
ON rl.vicidial_id = cl.closecallid
AND rl.start_time >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
WHERE cl.call_date >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
AND cl.length_in_sec > 10
AND rl.recording_id IS NULL
""")
row = cursor.fetchone()
missing = row['missing'] if row else 0
metrics.append(
f'asterisk_recordings_missing'
f'{{server="{SERVER_LABEL}"}} {missing}'
)
cursor.close()
except Exception:
pass
finally:
try:
conn.close()
except Exception:
pass
return metrics
Alert rule example:
- alert: MissingRecordings
expr: asterisk_recordings_missing > 5
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.server }}: {{ $value }} calls missing recordings"
5.11 Uptime and Conference Tracking
Simple but useful metrics for operations dashboards.
def collect_uptime():
"""Get Asterisk uptime in seconds."""
metrics = []
output = run_ast_cmd("core show uptime seconds")
m = re.search(r'System uptime:\s+(\d+)', output)
if m:
metrics.append(
f'asterisk_uptime_seconds'
f'{{server="{SERVER_LABEL}"}} {m.group(1)}'
)
return metrics
def collect_confbridge():
"""Count active ConfBridge/MeetMe conferences.
ViciDial uses conference bridges for agent-call connections.
The count reflects active call legs being mixed.
"""
metrics = []
output = run_ast_cmd("confbridge list")
count = 0
for line in output.splitlines():
if re.match(r'^\d+', line):
count += 1
metrics.append(
f'asterisk_confbridge_count'
f'{{server="{SERVER_LABEL}"}} {count}'
)
return metrics
5.12 SIP Phone Fleet Tracking (3-State Model)
This is a feature unique to our exporter. In a call center with a known block of SIP phone extensions (e.g., 40 Zoiper softphones in a London office registered as extensions 1031-1070), you want to know each phone's state at a glance:
- State 0 (Offline): Phone is not registered. Agent has not started their softphone, or there is a network issue.
- State 1 (Idle): Phone is registered but not on a call. Agent is logged in but waiting.
- State 2 (In Call): Phone is registered and has an active call channel.
This requires cross-referencing two Asterisk data sources: sip show peers (registration) and core show channels verbose (active calls).
def collect_phone_fleet():
"""Track a block of SIP phones with 3-state monitoring.
Combines:
1. 'sip show peers' --- registration status and latency
2. 'core show channels verbose' --- which peers have active calls
State model:
0 = offline (not registered)
1 = idle (registered, no active call)
2 = in_call (registered, has active channel)
Also emits aggregate counts:
- phones_registered: how many of the fleet are online
- phones_incall: how many are currently on calls
- phones_total: total phones in the fleet
"""
metrics = []
if not PHONE_FLEET_RANGE:
return metrics
try:
start, end = PHONE_FLEET_RANGE.split("-")
phone_range = set(
str(i) for i in range(int(start), int(end) + 1)
)
except (ValueError, TypeError):
return metrics
# 1) Parse sip show peers for registration status
output = run_ast_cmd("sip show peers")
peer_info = {}
for line in output.splitlines():
parts = line.split()
if not parts:
continue
peer = parts[0].split('/')[0]
if peer not in phone_range:
continue
if "(Unspecified)" in line or "UNKNOWN" in line:
peer_info[peer] = None # Not registered
else:
ip_m = re.search(r'(\d+\.\d+\.\d+\.\d+)', line)
lat_m = re.search(r'\((\d+)\s*ms\)', line)
status = (
"OK" if "OK" in line
else ("Lagged" if "Lagged" in line else "OTHER")
)
peer_info[peer] = {
"ip": ip_m.group(1) if ip_m else "",
"status": status,
"latency": int(lat_m.group(1)) if lat_m else 0,
}
# 2) Check active channels to find peers currently in calls
chan_output = run_ast_cmd("core show channels verbose")
peers_in_call = set()
for line in chan_output.splitlines():
m = re.match(r'^SIP/(\d+)-', line)
if m and m.group(1) in phone_range:
peers_in_call.add(m.group(1))
# 3) Build per-phone state metric
total = len(phone_range)
reg_count = 0
incall_count = 0
for peer in sorted(phone_range, key=int):
info = peer_info.get(peer)
registered = (
info is not None
and info.get("status") in ("OK", "Lagged")
)
in_call = peer in peers_in_call
if registered:
reg_count += 1
ip = info["ip"]
if in_call:
state = 2
incall_count += 1
else:
state = 1
else:
state = 0
ip = ""
metrics.append(
f'asterisk_phone_fleet_state'
f'{{server="{SERVER_LABEL}",peer="{peer}",'
f'ip="{ip}"}} {state}'
)
if registered and info.get("latency"):
metrics.append(
f'asterisk_phone_fleet_latency_ms'
f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
f'{info["latency"]}'
)
# Aggregate counts
metrics.append(
f'asterisk_phone_fleet_registered'
f'{{server="{SERVER_LABEL}"}} {reg_count}'
)
metrics.append(
f'asterisk_phone_fleet_incall'
f'{{server="{SERVER_LABEL}"}} {incall_count}'
)
metrics.append(
f'asterisk_phone_fleet_total'
f'{{server="{SERVER_LABEL}"}} {total}'
)
return metrics
Grafana state-timeline panel works beautifully with this metric. Set value mappings: 0 = red ("Offline"), 1 = yellow ("Idle"), 2 = green ("In Call"). You get a real-time heatmap of your entire phone fleet.
5.13 Metric Exposition and HTTP Server
All the collector functions are combined into a single /metrics endpoint using Python's built-in HTTP server. No Flask, no FastAPI, no external web framework required.
def collect_all():
"""Collect all metrics and format as Prometheus text exposition.
Each metric family gets a HELP and TYPE declaration,
followed by the actual metric lines from each collector.
"""
lines = [
# ── SIP Peers ──
"# HELP asterisk_sip_peer_up SIP peer reachability (1=up, 0=down)",
"# TYPE asterisk_sip_peer_up gauge",
"# HELP asterisk_sip_peer_latency_ms SIP peer qualify latency in ms",
"# TYPE asterisk_sip_peer_latency_ms gauge",
# ── Channels & Calls ──
"# HELP asterisk_active_calls Number of active calls",
"# TYPE asterisk_active_calls gauge",
"# HELP asterisk_active_channels Number of active channels",
"# TYPE asterisk_active_channels gauge",
"# HELP asterisk_channels_by_codec Channel count per codec",
"# TYPE asterisk_channels_by_codec gauge",
# ── RTP Quality ──
"# HELP asterisk_rtp_packet_loss_percent RTP packet loss percentage",
"# TYPE asterisk_rtp_packet_loss_percent gauge",
"# HELP asterisk_rtp_jitter_ms RTP jitter in ms",
"# TYPE asterisk_rtp_jitter_ms gauge",
"# HELP asterisk_rtp_rtt_ms RTP round trip time in ms",
"# TYPE asterisk_rtp_rtt_ms gauge",
# ── Transcoding ──
"# HELP asterisk_transcoding_channels Channels actively transcoding",
"# TYPE asterisk_transcoding_channels gauge",
"# HELP asterisk_channel_transcoding Channel is transcoding (1=yes)",
"# TYPE asterisk_channel_transcoding gauge",
"# HELP asterisk_codec_mismatch Active codec mismatch pairs",
"# TYPE asterisk_codec_mismatch gauge",
"# HELP asterisk_sip_allowed_codec Globally allowed codec",
"# TYPE asterisk_sip_allowed_codec gauge",
# ── Agents ──
"# HELP asterisk_agents_logged_in Number of agents logged in",
"# TYPE asterisk_agents_logged_in gauge",
"# HELP asterisk_agents_incall Number of agents in call",
"# TYPE asterisk_agents_incall gauge",
"# HELP asterisk_agents_paused Number of agents paused",
"# TYPE asterisk_agents_paused gauge",
"# HELP asterisk_agents_waiting Number of agents ready/waiting",
"# TYPE asterisk_agents_waiting gauge",
"# HELP asterisk_agent_incall_duration_seconds Per-agent in-call time",
"# TYPE asterisk_agent_incall_duration_seconds gauge",
"# HELP asterisk_agent_pause_duration_seconds Per-agent pause time",
"# TYPE asterisk_agent_pause_duration_seconds gauge",
"# HELP asterisk_queue_depth Calls waiting in queue per ingroup",
"# TYPE asterisk_queue_depth gauge",
# ── Security ──
"# HELP asterisk_fail2ban_active_bans Current fail2ban active bans",
"# TYPE asterisk_fail2ban_active_bans gauge",
"# HELP asterisk_fail2ban_bans_total Total fail2ban bans",
"# TYPE asterisk_fail2ban_bans_total counter",
# ── Operations ──
"# HELP asterisk_recordings_missing CDR entries without recordings",
"# TYPE asterisk_recordings_missing gauge",
"# HELP asterisk_uptime_seconds Asterisk system uptime",
"# TYPE asterisk_uptime_seconds gauge",
"# HELP asterisk_confbridge_count Active ConfBridge conferences",
"# TYPE asterisk_confbridge_count gauge",
# ── Phone Fleet ──
"# HELP asterisk_phone_fleet_state Phone state (0=offline, 1=idle, 2=in_call)",
"# TYPE asterisk_phone_fleet_state gauge",
"# HELP asterisk_phone_fleet_latency_ms Phone SIP latency in ms",
"# TYPE asterisk_phone_fleet_latency_ms gauge",
"# HELP asterisk_phone_fleet_registered Registered phone count",
"# TYPE asterisk_phone_fleet_registered gauge",
"# HELP asterisk_phone_fleet_incall Phones currently in a call",
"# TYPE asterisk_phone_fleet_incall gauge",
"# HELP asterisk_phone_fleet_total Total phones in fleet",
"# TYPE asterisk_phone_fleet_total gauge",
"", # Blank line before metric data
]
# Run all collectors
lines.extend(collect_sip_peers())
lines.extend(collect_phone_fleet())
lines.extend(collect_channels())
lines.extend(collect_rtp_stats())
lines.extend(collect_uptime())
lines.extend(collect_confbridge())
lines.extend(collect_vicidial_agents())
lines.extend(collect_fail2ban())
lines.extend(collect_transcoding())
lines.extend(collect_recordings())
return "\n".join(lines) + "\n"
class MetricsHandler(http.server.BaseHTTPRequestHandler):
"""HTTP handler that serves Prometheus metrics on /metrics."""
def do_GET(self):
if self.path == "/metrics":
body = collect_all()
self.send_response(200)
self.send_header(
"Content-Type", "text/plain; charset=utf-8"
)
self.end_headers()
self.wfile.write(body.encode())
else:
# Landing page with link to metrics
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.end_headers()
self.wfile.write(
b"<html><body>"
b"<h2>Asterisk/ViciDial Exporter</h2>"
b"<a href='/metrics'>Metrics</a>"
b"</body></html>"
)
def log_message(self, format, *args):
"""Suppress per-request logging to avoid log noise."""
pass
if __name__ == "__main__":
server = http.server.HTTPServer(
("0.0.0.0", LISTEN_PORT), MetricsHandler
)
print(f"asterisk_exporter listening on :{LISTEN_PORT}")
server.serve_forever()
Why manual text format instead of prometheus_client?
The prometheus_client Python library is the "official" way to write exporters. We chose manual text format for these reasons:
Zero additional dependencies ---
prometheus_clientis another pip package to install and maintain. Our exporter has exactly one external dependency (mysql-connector-python).Full control over metric lifecycle --- With
prometheus_client, once you create a metric with a label set, it persists until you explicitly remove it. When an agent logs out, their metric would remain with the last known value. With manual text format, metrics simply disappear when the agent is gone --- exactly what we want.Simpler mental model --- Each scrape is a fresh rendering of the current state. No stateful metric objects, no label management, no registry cleanup.
Easier debugging ---
curl localhost:9101/metricsshows you exactly what Prometheus will see. No hidden state.
The tradeoff is that we must manually write # HELP and # TYPE declarations. This is a one-time cost at the top of collect_all().
6. Metric Reference
Complete list of metrics exposed by the exporter:
Asterisk Core Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
asterisk_active_calls |
gauge | server | Current active calls |
asterisk_active_channels |
gauge | server | Current active channels (typically 2x calls) |
asterisk_uptime_seconds |
gauge | server | Asterisk process uptime |
asterisk_confbridge_count |
gauge | server | Active conference bridge rooms |
SIP Peer Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
asterisk_sip_peer_up |
gauge | server, peer | 1 if peer responds to OPTIONS, 0 otherwise |
asterisk_sip_peer_status |
gauge | server, peer, status | Status string (OK/UNREACHABLE/LAGGED) |
asterisk_sip_peer_latency_ms |
gauge | server, peer | Round-trip latency of SIP OPTIONS qualify |
RTP Quality Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
asterisk_rtp_packet_loss_percent |
gauge | server, peer | Receive-side packet loss % |
asterisk_rtp_jitter_ms |
gauge | server, peer | Receive-side jitter in milliseconds |
asterisk_rtp_rtt_ms |
gauge | server, peer | Round-trip time in milliseconds |
Codec/Transcoding Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
asterisk_channels_by_codec |
gauge | server, codec | Active channel count per codec |
asterisk_transcoding_channels |
gauge | server | Total channels doing codec conversion |
asterisk_channel_transcoding |
gauge | server, channel, codec, direction | Per-channel transcoding indicator |
asterisk_channel_native_codec |
gauge | server, channel, codec | Native codec of each channel |
asterisk_codec_mismatch |
gauge | server, pair | Count of active mismatched codec pairs |
asterisk_sip_allowed_codec |
gauge | server, codec | Codecs allowed in sip.conf |
ViciDial Agent Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
asterisk_agents_logged_in |
gauge | server | Total agents logged into ViciDial |
asterisk_agents_incall |
gauge | server | Agents currently handling a call |
asterisk_agents_paused |
gauge | server | Agents in pause state |
asterisk_agents_waiting |
gauge | server | Agents in READY/CLOSER (available) |
asterisk_agent_status |
gauge | server, agent, status | Per-agent current status |
asterisk_agent_incall_duration_seconds |
gauge | server, agent | How long agent has been on current call |
asterisk_agent_pause_duration_seconds |
gauge | server, agent | How long agent has been paused |
asterisk_agent_pause_code |
gauge | server, agent, pause_code | Agent's current pause reason |
asterisk_queue_depth |
gauge | server, ingroup | Calls waiting per campaign/ingroup |
Security Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
asterisk_fail2ban_active_bans |
gauge | server, jail | Currently banned IPs per jail |
asterisk_fail2ban_bans_total |
counter | server, jail | Cumulative bans since fail2ban start |
Operations Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
asterisk_recordings_missing |
gauge | server | Calls without recordings in last hour |
Phone Fleet Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
asterisk_phone_fleet_state |
gauge | server, peer, ip | 0=offline, 1=idle, 2=in_call |
asterisk_phone_fleet_latency_ms |
gauge | server, peer | SIP qualify latency for phone |
asterisk_phone_fleet_registered |
gauge | server | Count of registered phones |
asterisk_phone_fleet_incall |
gauge | server | Count of phones on active calls |
asterisk_phone_fleet_total |
gauge | server | Total phones in configured fleet |
7. Systemd Service Configuration
The exporter runs as a systemd service with environment-based configuration. Create this file on each VoIP server:
# /etc/systemd/system/asterisk_exporter.service
[Unit]
Description=Asterisk/ViciDial Prometheus Exporter
After=network.target mariadb.service asterisk.service
Wants=mariadb.service
[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/asterisk_exporter/asterisk_exporter.py
Restart=always
RestartSec=10
# ── Configuration ──
Environment=EXPORTER_PORT=9101
Environment=MYSQL_HOST=localhost
Environment=MYSQL_USER=exporter
Environment=MYSQL_PASS=YOUR_EXPORTER_PASSWORD
Environment=MYSQL_DB=asterisk
Environment=SERVER_LABEL=server1
# Optional: SIP phone fleet monitoring (empty = disabled)
# Environment=PHONE_FLEET_RANGE=1031-1070
[Install]
WantedBy=multi-user.target
Enable and Start
# Copy the exporter script
mkdir -p /opt/asterisk_exporter
cp asterisk_exporter.py /opt/asterisk_exporter/
chmod +x /opt/asterisk_exporter/asterisk_exporter.py
# Enable and start the service
systemctl daemon-reload
systemctl enable asterisk_exporter
systemctl start asterisk_exporter
# Verify it is running
systemctl status asterisk_exporter
curl -s localhost:9101/metrics | head -20
Per-Server Configuration
Each server needs its own SERVER_LABEL and optionally its own phone fleet range. Customize the Environment= lines in the service file:
| Server | SERVER_LABEL | PHONE_FLEET_RANGE | Notes |
|---|---|---|---|
| UK Primary | uk-primary |
1031-1070 |
40 London Zoiper phones |
| Romania | romania |
No phone fleet tracking | |
| France | france |
No phone fleet tracking | |
| Italy | italy |
No phone fleet tracking |
Service Behavior
Restart=always--- If the exporter crashes or is killed, systemd restarts it after 10 seconds.After=mariadb.service asterisk.service--- Starts after MySQL and Asterisk, ensuring both are available when the exporter begins collecting.Wants=mariadb.service--- Declares a soft dependency. If MySQL is down, the exporter still starts and returns Asterisk-only metrics.Type=simple--- The exporter is a long-running foreground process. Systemd considers it "started" as soon as the process forks.
8. Prometheus Scrape Configuration
On your central Prometheus server, add the exporter targets alongside the standard node_exporter targets:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# ── System metrics (CPU, RAM, disk, network) ──
- job_name: "node"
static_configs:
- targets: ["voip-server-1:9100"]
labels:
server: "uk-primary"
- targets: ["voip-server-2:9100"]
labels:
server: "romania"
- targets: ["voip-server-3:9100"]
labels:
server: "france"
# ── VoIP & ViciDial metrics (our custom exporter) ──
- job_name: "asterisk"
scrape_interval: 15s
static_configs:
- targets: ["voip-server-1:9101"]
labels:
server: "uk-primary"
- targets: ["voip-server-2:9101"]
labels:
server: "romania"
- targets: ["voip-server-3:9101"]
labels:
server: "france"
Why 15-Second Scrape Interval?
- Too fast (5s): Each scrape runs 6-8 Asterisk CLI commands and 2-3 MySQL queries. At 5-second intervals, you are running CLI commands almost continuously. On a server handling 100+ calls, the subprocess overhead becomes measurable.
- Too slow (60s): You miss short-lived events. A trunk that goes UNREACHABLE for 30 seconds and recovers would not show up. Queue depth spikes that resolve within a minute would be invisible.
- 15 seconds: The sweet spot. Low overhead, high resolution. You catch events that last more than ~30 seconds (two consecutive scrapes), which covers all operationally significant incidents.
Label Alignment
Notice that the server label in Prometheus matches the SERVER_LABEL environment variable in the exporter. This means you can join node_exporter metrics with asterisk_exporter metrics using the server label:
# CPU usage alongside active calls for the same server
node_cpu_seconds_total{server="uk-primary", mode="idle"}
asterisk_active_calls{server="uk-primary"}
Firewall Rules
Each VoIP server must allow the Prometheus server to reach port 9101:
# On each VoIP server (replace PROMETHEUS_IP with your actual IP)
iptables -I INPUT -s PROMETHEUS_IP -p tcp --dport 9101 -j ACCEPT
iptables -I INPUT -s PROMETHEUS_IP -p tcp --dport 9100 -j ACCEPT
# Persist the rules
iptables-save > /etc/sysconfig/iptables # CentOS/openSUSE
# or
netfilter-persistent save # Debian/Ubuntu
9. Grafana Dashboard Panels
Here are production-tested panel configurations for the key metrics. These examples use Grafana 10+ panel JSON.
9.1 Fleet Overview --- Stat Panels
A top row of stat panels showing the most critical numbers at a glance.
Active Calls (all servers)
Panel type: Stat
Query: sum(asterisk_active_calls)
Thresholds: 0=green, 50=yellow, 100=red
Unit: none
Title: "Total Active Calls"
Agents Logged In
Panel type: Stat
Query: sum(asterisk_agents_logged_in)
Thresholds: 0=red, 1=yellow, 5=green
Unit: none
Title: "Agents Online"
SIP Trunks Down
Panel type: Stat
Query: count(asterisk_sip_peer_up == 0)
Thresholds: 0=green, 1=yellow, 2=red
Unit: none
Title: "Trunks Down"
Color mode: Background
Queue Depth
Panel type: Stat
Query: sum(asterisk_queue_depth)
Thresholds: 0=green, 5=yellow, 10=red
Unit: none
Title: "Calls in Queue"
9.2 Calls and Agents --- Time Series
A multi-line time series showing calls and agent counts over time.
Panel type: Time Series
Queries:
A: asterisk_active_calls{server=~"$server"} (legend: "{{server}} calls")
B: asterisk_active_channels{server=~"$server"} (legend: "{{server}} channels")
C: asterisk_agents_incall{server=~"$server"} (legend: "{{server}} agents in-call")
D: asterisk_agents_waiting{server=~"$server"} (legend: "{{server}} agents waiting")
Axis: Left Y = count
Fill opacity: 10
Line width: 2
9.3 SIP Trunk Status --- State Timeline
A state-timeline panel showing each trunk's status over time, with color coding.
Panel type: State Timeline
Query: asterisk_sip_peer_up{server=~"$server"}
Legend: {{peer}}
Value mappings:
0 = "DOWN" (red)
1 = "UP" (green)
Show values: Always
Merge equal consecutive values: true
9.4 Trunk Latency --- Time Series
Panel type: Time Series
Query: asterisk_sip_peer_latency_ms{server=~"$server"}
Legend: {{peer}}
Unit: ms
Thresholds:
Line at 50ms (yellow)
Line at 100ms (red)
9.5 RTP Quality --- Three Gauges
Three gauge panels for the three RTP quality indicators.
Packet Loss
Panel type: Gauge
Query: max(asterisk_rtp_packet_loss_percent{server=~"$server"})
Unit: percent (0-100)
Min: 0, Max: 10
Thresholds: 0=green, 0.5=yellow, 2=red
Title: "Max Packet Loss"
Jitter
Panel type: Gauge
Query: max(asterisk_rtp_jitter_ms{server=~"$server"})
Unit: ms
Min: 0, Max: 100
Thresholds: 0=green, 20=yellow, 50=red
Title: "Max Jitter"
Round-Trip Time
Panel type: Gauge
Query: max(asterisk_rtp_rtt_ms{server=~"$server"})
Unit: ms
Min: 0, Max: 500
Thresholds: 0=green, 150=yellow, 300=red
Title: "Max RTT"
9.6 Agent Status --- Bar Gauge
A horizontal bar gauge showing per-agent in-call duration, highlighting agents who have been on unusually long calls.
Panel type: Bar Gauge
Query: asterisk_agent_incall_duration_seconds{server=~"$server"}
Legend: {{agent}}
Unit: seconds (s)
Orientation: Horizontal
Sort: Descending
Thresholds:
0=green (normal call)
600=yellow (10 min --- getting long)
1800=red (30 min --- unusually long)
9.7 Codec Distribution --- Pie Chart
Panel type: Pie Chart
Query: asterisk_channels_by_codec{server=~"$server"}
Legend: {{codec}}
Pie type: Donut
Title: "Active Codecs"
9.8 Phone Fleet --- State Timeline
For servers with phone fleet tracking enabled:
Panel type: State Timeline
Query: asterisk_phone_fleet_state{server="uk-primary"}
Legend: Ext {{peer}}
Value mappings:
0 = "Offline" (red)
1 = "Idle" (yellow)
2 = "In Call" (green)
Row height: 20
Show values: Auto
9.9 Fail2ban --- Time Series
Panel type: Time Series
Query: asterisk_fail2ban_active_bans{server=~"$server"}
Legend: {{server}} - {{jail}}
Unit: none
Title: "Active Fail2ban Bans"
Overrides:
Fill opacity: 20 (to make ban spikes visually prominent)
9.10 Missing Recordings --- Stat
Panel type: Stat
Query: asterisk_recordings_missing{server=~"$server"}
Thresholds: 0=green, 1=yellow, 5=red
Title: "Missing Recordings (1h)"
9.11 Suggested Dashboard Layout
Row 1: Fleet Overview (collapsed=no)
[Active Calls] [Agents Online] [Trunks Down] [Queue Depth] [Missing Recordings]
Row 2: Calls & Agents (collapsed=no)
[Calls + Agents time series, full width]
Row 3: SIP Trunks (collapsed=yes)
[Trunk status state-timeline] [Trunk latency time series]
Row 4: RTP Quality (collapsed=yes)
[Packet Loss gauge] [Jitter gauge] [RTT gauge]
[Packet Loss time series, full width]
Row 5: Agents Detail (collapsed=yes)
[In-call duration bar gauge] [Pause duration bar gauge]
[Pause codes table]
Row 6: Codecs & Transcoding (collapsed=yes)
[Codec pie chart] [Transcoding channels stat] [Codec mismatches table]
Row 7: Phone Fleet (collapsed=yes)
[Fleet state timeline, full width]
[Registered count] [In-call count] [Total count]
Row 8: Security (collapsed=yes)
[Fail2ban bans time series] [Ban totals table]
10. Automated Deployment
For deploying the exporter to multiple servers, use a shell script that handles the full lifecycle: binary detection, Python dependency installation, service file creation, and startup.
#!/bin/bash
# deploy-exporter.sh --- Deploy asterisk_exporter to a remote VoIP server
# Usage: ./deploy-exporter.sh <server_ip> <ssh_port> <server_label>
set -e
SERVER_IP="${1:?Usage: $0 <server_ip> <ssh_port> <server_label>}"
SSH_PORT="${2:-22}"
SERVER_LABEL="${3:?Provide server label}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
SSH_CMD="ssh -o StrictHostKeyChecking=no -p ${SSH_PORT} root@${SERVER_IP}"
echo "=== Deploying asterisk_exporter to ${SERVER_LABEL} ==="
# Create target directory
${SSH_CMD} "mkdir -p /opt/asterisk_exporter"
# Copy the exporter script
scp -o StrictHostKeyChecking=no -P ${SSH_PORT} \
${SCRIPT_DIR}/asterisk_exporter.py \
root@${SERVER_IP}:/opt/asterisk_exporter/
# Install dependencies and create systemd service
${SSH_CMD} bash << REMOTEOF
set -e
# Find Python 3
PYTHON_BIN=""
for p in python3.11 python3.9 python3.6 python3; do
if command -v \$p &>/dev/null; then
PYTHON_BIN=\$(command -v \$p)
break
fi
done
if [ -z "\$PYTHON_BIN" ]; then
echo "ERROR: No Python 3 found"
exit 1
fi
echo "Using Python: \$PYTHON_BIN"
# Install mysql-connector-python
\$PYTHON_BIN -m pip install mysql-connector-python 2>/dev/null \
|| \$PYTHON_BIN -m pip install "mysql-connector-python<8.1" 2>/dev/null \
|| true
# Verify import works
\$PYTHON_BIN -c "import mysql.connector; print('mysql-connector OK')"
chmod +x /opt/asterisk_exporter/asterisk_exporter.py
# Create systemd service
cat > /etc/systemd/system/asterisk_exporter.service << SVCFILE
[Unit]
Description=Asterisk/ViciDial Prometheus Exporter
After=network.target mariadb.service asterisk.service
Wants=mariadb.service
[Service]
Type=simple
ExecStart=\$PYTHON_BIN /opt/asterisk_exporter/asterisk_exporter.py
Restart=always
RestartSec=10
Environment=EXPORTER_PORT=9101
Environment=MYSQL_HOST=localhost
Environment=MYSQL_USER=exporter
Environment=MYSQL_PASS=YOUR_EXPORTER_PASSWORD
Environment=MYSQL_DB=asterisk
Environment=SERVER_LABEL=${SERVER_LABEL}
[Install]
WantedBy=multi-user.target
SVCFILE
systemctl daemon-reload
systemctl enable asterisk_exporter
systemctl restart asterisk_exporter
echo "asterisk_exporter started on :9101"
REMOTEOF
echo "=== Done: ${SERVER_LABEL} ==="
The script is idempotent --- running it again on the same server updates the exporter script and restarts the service without breaking anything.
11. Production Tips
Metric Cardinality
Cardinality (the total number of unique time series) is the primary scaling concern for Prometheus. Each unique combination of metric name + label values creates one time series.
Watch out for:
- Per-agent metrics with the
agentlabel: 50 agents x 3 metrics = 150 series. Acceptable. - Per-channel metrics (transcoding, codecs): With 100 concurrent calls, this could create 200+ series that appear and disappear every few minutes. Prometheus handles this fine, but Grafana queries over long time ranges will slow down.
- Per-peer RTP metrics: Each active call generates jitter/loss/rtt metrics. These are inherently high-cardinality but also inherently short-lived.
Mitigation strategies:
- Do not add labels you do not query. If you never filter by
ipin Grafana, remove theiplabel from phone fleet metrics. - Use aggregate metrics where possible.
asterisk_transcoding_channels(single number) is more useful for alerting than 50 individualasterisk_channel_transcodingmetrics. - Set Prometheus retention appropriately. 30 days is a good default; going to 90 days with high-cardinality VoIP metrics will consume significant disk.
Error Handling Philosophy
Every collector function follows the same pattern: catch exceptions, return empty metrics, never crash.
# Pattern used throughout the exporter:
def collect_something():
metrics = []
try:
# ... do work ...
except Exception:
pass # Return partial/empty metrics
return metrics
This is intentional. If MySQL is down, you still get Asterisk metrics. If Asterisk is restarting, you still get MySQL metrics. If fail2ban is not installed, everything else still works. The exporter degrades gracefully rather than failing completely.
Scrape Timeout
Prometheus's default scrape timeout is 10 seconds. Our exporter runs 6-8 CLI commands (each with a 10-second timeout) and 2-3 MySQL queries (5-second connection timeout). In the worst case, a single scrape could take several seconds.
If you see context deadline exceeded errors in Prometheus, increase the scrape timeout:
- job_name: "asterisk"
scrape_interval: 15s
scrape_timeout: 12s # Default is 10s
static_configs:
- targets: [...]
MySQL Connection Safety
The exporter's MySQL user should be strictly read-only with a query timeout:
-- Create user with 5-second query timeout
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'YOUR_EXPORTER_PASSWORD';
-- Grant SELECT only on required tables
GRANT SELECT ON asterisk.vicidial_live_agents TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.vicidial_auto_calls TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.vicidial_closer_log TO 'exporter'@'localhost';
GRANT SELECT ON asterisk.recording_log TO 'exporter'@'localhost';
-- Set max execution time (MariaDB 10.1+)
-- This prevents a runaway query from blocking the table
SET GLOBAL max_statement_time = 5;
FLUSH PRIVILEGES;
Log Suppression
The log_message override in MetricsHandler suppresses per-request HTTP logging:
def log_message(self, format, *args):
pass
Without this, every 15-second Prometheus scrape generates a log line. Over 24 hours, that is 5,760 lines of GET /metrics 200 --- pure noise. Suppress it.
12. Troubleshooting
Exporter Is Not Starting
# Check systemd status
systemctl status asterisk_exporter
# Check logs
journalctl -u asterisk_exporter --no-pager -n 50
# Common issues:
# 1. Python not found --- check ExecStart path in service file
# 2. mysql-connector not installed --- run: python3 -m pip install mysql-connector-python
# 3. Port 9101 already in use --- check: ss -tlnp | grep 9101
Prometheus Shows "Target Down"
# From the Prometheus server, verify network connectivity
curl -s http://VOIP_SERVER_IP:9101/metrics | head -5
# If connection refused:
# 1. Exporter not running: ssh to server, check systemctl status
# 2. Firewall blocking: iptables -L -n | grep 9101
# 3. Wrong port: check EXPORTER_PORT in service file
# If timeout:
# 1. Network issue between Prometheus and VoIP server
# 2. Exporter hung: restart with systemctl restart asterisk_exporter
MySQL Metrics Missing (Asterisk Metrics Present)
# Test MySQL connectivity from the server itself
mysql -u exporter -p'YOUR_EXPORTER_PASSWORD' -e "SELECT 1" asterisk
# Common issues:
# 1. Wrong credentials in Environment= lines
# 2. User lacks SELECT privilege: SHOW GRANTS FOR 'exporter'@'localhost';
# 3. MySQL not running: systemctl status mariadb
# 4. Database name wrong: MYSQL_DB should be "asterisk" for ViciDial
Asterisk Metrics Missing (MySQL Metrics Present)
# Test Asterisk CLI access
asterisk -rx "core show version"
# Common issues:
# 1. Asterisk not running: systemctl status asterisk
# 2. Permission denied: exporter must run as root or asterisk user
# 3. asterisk binary not in PATH: which asterisk
Metrics Look Stale or Frozen
# Check the scrape duration
curl -s localhost:9101/metrics | wc -l
# If very few lines, some collectors are silently failing
# Time a scrape manually
time curl -s localhost:9101/metrics > /dev/null
# Should be < 3 seconds. If > 10s, a CLI command or MySQL query is hanging
# Check if Asterisk is responsive
time asterisk -rx "sip show peers"
# Should return in < 1 second
High CPU from the Exporter
The transcoding detection feature runs core show channel for every active SIP channel. With 100+ concurrent calls, this means 100+ subprocess forks every 15 seconds.
If CPU is a concern, disable transcoding detection by commenting out the collect_transcoding() call in collect_all():
# lines.extend(collect_transcoding()) # Disabled: too many subprocess forks
"Too Many Open Files" Errors
Each subprocess fork opens file descriptors. On systems with low ulimit -n (default 1024), heavy scrape activity can exhaust them.
Fix in the systemd service file:
[Service]
LimitNOFILE=4096
13. Extending the Exporter
The modular structure (one function per metric family) makes it easy to add new collectors.
Adding a New Metric
- Write a collector function that returns a list of metric strings:
def collect_my_custom_metric():
metrics = []
# ... gather data ...
metrics.append(
f'asterisk_my_metric{{server="{SERVER_LABEL}"}} {value}'
)
return metrics
- Add
# HELPand# TYPEdeclarations tocollect_all():
"# HELP asterisk_my_metric Description of what it measures",
"# TYPE asterisk_my_metric gauge",
- Call the collector in
collect_all():
lines.extend(collect_my_custom_metric())
- Restart the exporter:
systemctl restart asterisk_exporter
Ideas for Additional Metrics
| Metric | Source | Value |
|---|---|---|
asterisk_calls_today_total |
MySQL vicidial_closer_log |
Daily call volume counter |
asterisk_avg_wait_time_seconds |
MySQL vicidial_closer_log |
Average queue wait time |
asterisk_dialplan_errors |
asterisk -rx "dialplan show" |
Dialplan syntax errors |
asterisk_sip_registry_status |
asterisk -rx "sip show registry" |
Outbound registration status |
asterisk_dahdi_spans_up |
asterisk -rx "dahdi show status" |
DAHDI/ISDN span status |
asterisk_disk_recordings_gb |
du -s /var/spool/asterisk/monitor |
Recording storage usage |
vicidial_campaign_calls_waiting |
MySQL vicidial_auto_calls |
Per-campaign queue depth |
vicidial_list_penetration |
MySQL vicidial_list |
% of leads called per list |
Converting to AMI (If Needed)
If you outgrow the CLI approach and need lower-overhead collection (e.g., 200+ concurrent calls with transcoding detection), you can convert individual collectors to use AMI. The Asterisk Manager Interface uses a persistent TCP connection:
import socket
class AMIConnection:
def __init__(self, host, port, username, secret):
self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.sock.settimeout(10)
self.sock.connect((host, port))
self._read_until("Asterisk Call Manager")
self._send_action({
"Action": "Login",
"Username": username,
"Secret": secret,
})
response = self._read_response()
if "Success" not in response.get("Response", ""):
raise Exception(f"AMI login failed: {response}")
def _send_action(self, action):
msg = "\r\n".join(
f"{k}: {v}" for k, v in action.items()
) + "\r\n\r\n"
self.sock.sendall(msg.encode())
def _read_until(self, marker):
data = b""
while marker.encode() not in data:
data += self.sock.recv(4096)
return data.decode()
def _read_response(self):
data = self._read_until("\r\n\r\n")
result = {}
for line in data.strip().split("\r\n"):
if ": " in line:
key, val = line.split(": ", 1)
result[key] = val
return result
def command(self, cmd):
self._send_action({
"Action": "Command",
"Command": cmd,
})
return self._read_response()
def close(self):
self._send_action({"Action": "Logoff"})
self.sock.close()
For AMI to work, you need to configure /etc/asterisk/manager.conf:
[general]
enabled = yes
port = 5038
bindaddr = 127.0.0.1 ; Only localhost --- never expose AMI externally
[exporter]
secret = YOUR_AMI_SECRET
deny = 0.0.0.0/0.0.0.0
permit = 127.0.0.1/255.255.255.0
read = system,call,agent
write = command
Our recommendation: Stick with CLI unless you have a measured performance problem. The CLI approach has zero authentication surface, zero network exposure, and works identically across all Asterisk versions from 11 to 21.
Summary
What we built:
- A single-file Python exporter (~500 lines) that bridges Asterisk CLI and ViciDial MySQL into Prometheus metrics
- 30+ metric families covering SIP trunks, call quality, agent states, queue depth, security, and recording integrity
- Systemd service with environment-based configuration, auto-restart, and graceful dependency handling
- Prometheus scrape config for multi-server fleet monitoring at 15-second resolution
- Grafana dashboard panels for every metric family, from fleet overview stats to per-phone state timelines
- Automated deployment script for rolling out to new servers in minutes
The exporter runs in production across multiple VoIP servers, each handling hundreds of concurrent calls, scraped every 15 seconds. Total overhead per scrape: under 100ms of subprocess time and one MySQL connection lasting <50ms.
This is the monitoring stack that tells you a trunk went UNREACHABLE at 02:14, that agent 1042 has been paused for 47 minutes, that RTP jitter spiked to 35ms on the France server, and that 3 calls in the last hour have no recordings --- all from a single Grafana dashboard, all alertable, all queryable in PromQL.
Standard node_exporter tells you the disk is 60% full. This exporter tells you the business is running.