Building a Complete VoIP Monitoring Stack with Docker
Grafana + Prometheus + Loki + Homer + Smokeping + Blackbox Exporter
| Difficulty | Intermediate |
| Time to Complete | 3-4 hours |
| Prerequisites | Linux VPS (Ubuntu 22.04+), Docker, basic VoIP/SIP knowledge |
| Tested On | Ubuntu 24.04 LTS, Docker 24.x, 8 CPU / 16 GB RAM |
Table of Contents
- Introduction
- What You'll Build
- Architecture Overview
- Prerequisites
- Directory Structure
- Step 1: Environment Variables
- Step 2: Docker Compose Stack
- Step 3: PostgreSQL Init Script
- Step 4: Prometheus Configuration
- Step 5: Blackbox Exporter Modules
- Step 6: Prometheus Alert Rules
- Step 7: Loki Configuration
- Step 8: Smokeping Targets
- Step 9: Grafana Provisioning
- Step 10: Custom Asterisk Exporter
- Step 11: Remote Agent Installation Script
- Step 12: Backup Script
- Step 13: Launch and Verify
- Grafana Dashboard Ideas
- Tips and Tricks
- Troubleshooting
- Security Considerations
- What's Next
Introduction
If you run a VoIP operation -- a call center, a telecom platform, or even a handful of Asterisk/FreePBX servers -- you know the pain. A SIP trunk silently drops. Packet loss creeps up at 2 AM. An agent gets stuck in a zombie conference for three hours. Disk fills up with recordings and nobody notices until calls start failing.
The standard approach is to SSH into each server, run sip show peers, grep some logs, and hope you catch problems before your customers do. That does not scale past two servers.
This tutorial walks you through building a centralized VoIP monitoring stack that runs on a single Docker host and monitors any number of remote VoIP servers. It is based on a production system that monitors a multi-server ViciDial call center fleet across four data centers and seven SIP providers. Every configuration file in this tutorial comes from that real deployment, sanitized and annotated.
What problems does this solve?
- SIP trunk monitoring: Know within 5 minutes when a trunk goes UNREACHABLE, before calls start failing
- Call quality visibility: Track RTP packet loss, jitter, and codec transcoding across all servers in real time
- Centralized log search: Search Asterisk logs, ViciDial dialer logs, and syslog from all servers in one place
- SIP packet capture: Full SIP ladder diagrams (INVITE/200/BYE) for any call, stored and searchable for 7 days
- Network latency baselines: Continuous latency measurement to every SIP provider, with historical graphs
- Proactive alerting: 14 alert rules that catch trunk failures, zombie conferences, disk space, CPU spikes, and more
- Agent state monitoring: See which agents are logged in, paused, stuck, or in call -- across all servers
What You'll Build
When you finish this tutorial, you will have a single Docker Compose stack exposing these services:
| Service | Port | Purpose |
|---|---|---|
| Grafana | :3000 |
Unified dashboards -- metrics, logs, SIP data, all in one UI |
| Prometheus | :9090 |
Time-series metrics database (30-day retention) |
| Loki | :3100 |
Log aggregation engine (7-day retention) |
| Homer | :9080 |
SIP capture and search (7-day retention) |
| Smokeping | :8081 |
Network latency graphs with historical baselines |
| Blackbox Exporter | (internal) | ICMP pings, TCP SIP port checks, HTTP probes |
| PostgreSQL | (internal) | Backend database for Homer SIP data |
On each remote VoIP server, you will install four lightweight agents:
| Agent | Port | Purpose |
|---|---|---|
| node_exporter | :9100 |
System metrics (CPU, RAM, disk, network) |
| asterisk_exporter | :9101 |
Custom Asterisk/ViciDial metrics (SIP peers, active calls, agent states, RTP quality, codecs) |
| promtail | :9080 |
Ships Asterisk logs, ViciDial logs, and syslog to Loki |
| heplify | -- | Captures SIP packets off the wire and sends HEP to Homer |
What the dashboards look like
- VoIP Overview: A grid showing active calls, logged-in agents, SIP trunk status, and queue depth per server -- updated every 15 seconds
- SIP Trunk Health: Per-trunk latency gauges, up/down status with history, and time-to-failure trends
- RTP Quality: Packet loss percentage, jitter, and round-trip time per active channel, with heatmaps over time
- Agent Activity: Per-agent state timeline (READY / INCALL / PAUSED), pause durations, and zombie call detection
- Network Latency: Smokeping-style graphs showing latency distribution to each SIP provider over days/weeks
- Log Explorer: Full-text search across Asterisk
messageslog, dialer logs, and syslog from all servers, with label filtering - SIP Call Flow: Homer ladder diagrams showing the complete SIP dialog for any call (INVITE, 100, 180, 200, ACK, BYE)
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ MONITORING VPS (Docker Host) │
│ │
│ ┌───────────┐ ┌────────────┐ ┌──────────────────┐ │
│ │ Grafana │ │ Prometheus │ │ Loki │ │
│ │ :3000 │ │ :9090 │ │ :3100 │ │
│ │ ◄├──┤ │ │ │ │
│ │ ◄├──┼────────────┼──┤ │ │
│ │ ◄├──┤ │ │ │ │
│ └───────────┘ │ ┌───────┐ │ └────────▲─────────┘ │
│ │ │ Rules │ │ │ │
│ ┌───────────┐ │ │(14) │ │ ┌───────┴──────────┐ │
│ │ Smokeping │ │ └───────┘ │ │ Homer (webapp) │ │
│ │ :8081 │ │ │ │ │ :9080 │ │
│ └───────────┘ │ ▼ │ └───────▲──────────┘ │
│ │ ┌────────┐ │ │ │
│ ┌───────────┐ │ │Blackbox│ │ ┌──────┴───────────┐ │
│ │PostgreSQL │ │ │Exporter│ │ │ heplify-server │ │
│ │ (16) │ │ └────────┘ │ │ :9060/udp │ │
│ └─────▲─────┘ └────────────┘ └──────▲───────────┘ │
│ │ │ │
└────────┼───────────────────────────────┼───────────────┘
│ │
┌─────────────────────┼───────────────────────────────┼──────────────────┐
│ │ NETWORK │ │
│ ┌──────────────────┼───────────────────────────────┼────────────────┐ │
│ │ │ │
┌──────┴──┴──────┐ ┌──────┴──┴──────┐ ┌──────────────┐ ┌─────────────┐ │ │
│ VoIP Server 1 │ │ VoIP Server 2 │ │ VoIP Server 3│ │ SIP Providers│ │ │
│ │ │ │ │ │ │ │ │ │
│ node_exporter │ │ node_exporter │ │ node_exporter│ │ ICMP ping │ │ │
│ :9100 │ │ :9100 │ │ :9100 │ │ TCP :5060 │ │ │
│ ast_exporter │ │ ast_exporter │ │ ast_exporter │ │ │ │ │
│ :9101 │ │ :9101 │ │ :9101 │ └─────────────┘ │ │
│ promtail ──────┼───► Loki │ │ promtail │ │ │
│ heplify ──────┼───► heplify-server │ │ heplify │ │ │
└────────────────┘ └────────────────┘ └──────────────┘ │ │
│ │ │
└──────────────────────────────────────────────────────────────────────┘ │
│
┌────────────────────────────────────────────────────────────────────────┘
│ Blackbox Exporter probes: ICMP, TCP SIP (:5060), HTTP to all targets
└─────────────────────────────────────────────────────────────────────────
Data flow summary
- Metrics (pull): Prometheus scrapes node_exporter (:9100) and asterisk_exporter (:9101) on each VoIP server every 15 seconds. It also scrapes Blackbox Exporter results for external probes.
- Logs (push): Promtail on each VoIP server pushes Asterisk logs, ViciDial logs, and syslog to Loki on :3100.
- SIP packets (push): Heplify on each VoIP server captures SIP packets off the network interface and sends them via HEP protocol to heplify-server on :9060/udp.
- Latency (active): Smokeping sends FPing probes to all VoIP servers and SIP providers continuously.
- External probes (active): Blackbox Exporter probes SIP provider ports (TCP :5060), pings servers (ICMP), and checks HTTP endpoints.
- Visualization: Grafana connects to Prometheus, Loki, and PostgreSQL (Homer data) as data sources, providing a single pane of glass.
Prerequisites
Monitoring VPS requirements
- OS: Ubuntu 22.04+ or Debian 12+ (any Docker-capable Linux)
- CPU: 4+ cores (8 recommended)
- RAM: 8 GB minimum (16 GB recommended)
- Disk: 100 GB+ (Prometheus 30-day retention + Loki logs + Homer SIP data)
- Docker: Docker Engine 24.x+ and Docker Compose v2
- Network: Public IP, ports 3000/3100/9060/9080/9090 accessible from monitored servers
On each monitored VoIP server
- SSH access as root (for agent installation)
- Asterisk installed and running (any version 11+)
- Python 3.6+ with
mysql-connector-python(for the custom exporter) - MySQL/MariaDB with a read-only user for ViciDial queries (or your Asterisk CDR database)
Install Docker (if not already installed)
# Install Docker Engine
curl -fsSL https://get.docker.com | sh
# Install Docker Compose plugin
apt-get install -y docker-compose-plugin
# Verify
docker --version
docker compose version
Directory Structure
Create the full directory tree before starting:
mkdir -p /opt/monitoring/{prometheus/rules,loki,grafana/provisioning/{datasources,dashboards},smokeping/config,postgres-init,scripts}
cd /opt/monitoring
Final structure:
/opt/monitoring/
├── docker-compose.yml # All services
├── .env # Passwords, IPs, ports
├── postgres-init/
│ └── 01-init-dbs.sql # Homer database setup
├── prometheus/
│ ├── prometheus.yml # Scrape targets, job definitions
│ ├── blackbox.yml # Blackbox Exporter probe modules
│ └── rules/
│ └── alerts.yml # 14 alert rules in 8 groups
├── loki/
│ └── loki-config.yml # Loki storage, retention, limits
├── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── all.yml # Prometheus, Loki, Homer, MySQL sources
│ └── dashboards/
│ └── dashboard.yml # Dashboard folder provisioning
├── smokeping/
│ └── config/
│ └── Targets # Ping targets (servers + providers)
└── scripts/
├── install-agents.sh # One-command remote agent installer
├── asterisk_exporter.py # Custom Prometheus exporter for Asterisk
└── backup-monitoring.sh # Daily config backup (7-day retention)
Step 1: Environment Variables
Create the .env file. This is the single place where all secrets and server-specific values live. Never commit this file to git.
cat > /opt/monitoring/.env << 'EOF'
# ──────────────────────────────────────────────────
# VoIP Monitoring Stack — Environment Variables
# ──────────────────────────────────────────────────
# ─── Passwords (CHANGE THESE) ───
POSTGRES_PASSWORD=YOUR_POSTGRES_PASSWORD
GRAFANA_ADMIN_PASSWORD=YOUR_GRAFANA_PASSWORD
HOMER_DB_PASSWORD=YOUR_HOMER_DB_PASSWORD
# ─── Monitoring VPS IP ───
MONITOR_IP=YOUR_MONITOR_VPS_IP
# ─── Service Ports ───
GRAFANA_PORT=3000
PROMETHEUS_PORT=9090
LOKI_PORT=3100
HOMER_PORT=9080
SMOKEPING_PORT=8081
HEP_PORT=9060
# ─── Retention ───
HOMER_RETENTION_DAYS=7
PROMETHEUS_RETENTION_DAYS=30
LOKI_RETENTION_DAYS=7
# ─── VoIP Server IPs (for reference) ───
VOIP_SERVER_1_IP=YOUR_VOIP_SERVER_1_IP
VOIP_SERVER_2_IP=YOUR_VOIP_SERVER_2_IP
VOIP_SERVER_3_IP=YOUR_VOIP_SERVER_3_IP
VOIP_SERVER_4_IP=YOUR_VOIP_SERVER_4_IP
# ─── VoIP MySQL Read-Only Access (for Grafana datasources) ───
VOIP_MYSQL_USER=YOUR_MYSQL_RO_USER
VOIP_MYSQL_PASSWORD=YOUR_MYSQL_RO_PASSWORD
VOIP_MYSQL_DB=asterisk
EOF
chmod 600 /opt/monitoring/.env
Security note: The
.envfile contains database passwords and should be readable only by root. Thechmod 600ensures this.
Step 2: Docker Compose Stack
This is the core of the deployment. Eight services, one network, persistent volumes for all data.
cat > /opt/monitoring/docker-compose.yml << 'COMPOSE'
version: "3.8"
services:
# ─── PostgreSQL (Homer SIP data backend) ───────────────────────
postgres:
image: postgres:16-alpine
container_name: postgres
restart: unless-stopped
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./postgres-init:/docker-entrypoint-initdb.d
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
networks:
- monitoring
# ─── heplify-server (HEP collector → PostgreSQL) ──────────────
# Receives SIP packets from heplify agents on VoIP servers
# and stores them in PostgreSQL for Homer webapp to query.
heplify-server:
image: sipcapture/heplify-server:latest
container_name: heplify-server
restart: unless-stopped
ports:
- "9060:9060/udp" # HEP input (UDP)
- "9060:9060/tcp" # HEP input (TCP fallback)
command:
- "./heplify-server"
environment:
HEPLIFYSERVER_HEPADDR: "0.0.0.0:9060"
HEPLIFYSERVER_DBSHEMA: "homer7"
HEPLIFYSERVER_DBDRIVER: "postgres"
HEPLIFYSERVER_DBADDR: "postgres:5432"
HEPLIFYSERVER_DBUSER: "homer"
HEPLIFYSERVER_DBPASS: "${HOMER_DB_PASSWORD}"
HEPLIFYSERVER_DBDATATABLE: "homer_data"
HEPLIFYSERVER_DBCONFTABLE: "homer_config"
HEPLIFYSERVER_DBDROPDAYS: 7 # Auto-purge SIP data older than 7 days
HEPLIFYSERVER_LOGLVL: "info"
HEPLIFYSERVER_LOGSTD: "true"
HEPLIFYSERVER_PROMADDR: "0.0.0.0:9096" # Expose metrics for Prometheus
HEPLIFYSERVER_DEDUP: "false"
HEPLIFYSERVER_ALEGIDS: "X-CID"
HEPLIFYSERVER_FORCEALEGID: "false"
depends_on:
postgres:
condition: service_healthy
networks:
- monitoring
# ─── Homer Web UI ──────────────────────────────────────────────
# SIP search interface: call flow diagrams, SIP message search,
# correlation by Call-ID, From, To, etc.
homer-webapp:
image: sipcapture/webapp:latest
container_name: homer-webapp
restart: unless-stopped
ports:
- "${HOMER_PORT:-9080}:80"
environment:
DB_HOST: postgres
DB_USER: homer
DB_PASS: ${HOMER_DB_PASSWORD}
depends_on:
postgres:
condition: service_healthy
heplify-server:
condition: service_started
networks:
- monitoring
# ─── Prometheus ────────────────────────────────────────────────
# Scrapes metrics from all exporters every 15s.
# 30-day retention. Hot-reload via /-/reload endpoint.
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
restart: unless-stopped
ports:
- "${PROMETHEUS_PORT:-9090}:9090"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle" # Enables /-/reload for config changes
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- ./prometheus/blackbox.yml:/etc/prometheus/blackbox.yml:ro
- prometheus_data:/prometheus
depends_on:
- blackbox-exporter
networks:
- monitoring
# ─── Blackbox Exporter (external probing) ──────────────────────
# Probes external endpoints: ICMP ping, TCP SIP port, HTTP checks.
# Prometheus scrapes the results via /probe endpoint.
blackbox-exporter:
image: prom/blackbox-exporter:v0.25.0
container_name: blackbox-exporter
restart: unless-stopped
volumes:
- ./prometheus/blackbox.yml:/config/blackbox.yml:ro
command:
- "--config.file=/config/blackbox.yml"
networks:
- monitoring
# ─── Loki (log aggregation) ────────────────────────────────────
# Receives logs from promtail agents. TSDB v13 schema, 7-day retention.
loki:
image: grafana/loki:2.9.6
container_name: loki
restart: unless-stopped
ports:
- "${LOKI_PORT:-3100}:3100"
command: -config.file=/etc/loki/loki-config.yml
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml:ro
- loki_data:/loki
networks:
- monitoring
# ─── Grafana (unified dashboards) ──────────────────────────────
# Single pane of glass: Prometheus metrics, Loki logs, Homer SIP
# data, and direct MySQL queries to VoIP servers.
grafana:
image: grafana/grafana:10.4.1
container_name: grafana
restart: unless-stopped
ports:
- "${GRAFANA_PORT:-3000}:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
GF_INSTALL_PLUGINS: >-
grafana-clock-panel,
grafana-worldmap-panel,
grafana-polystat-panel,
grafana-piechart-panel,
yesoreyeram-infinity-datasource
GF_SERVER_ROOT_URL: "http://${MONITOR_IP:-localhost}:3000"
GF_SMTP_ENABLED: "false"
GF_ALERTING_ENABLED: "false" # Disable legacy alerting
GF_UNIFIED_ALERTING_ENABLED: "true" # Use unified alerting (Grafana 10+)
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
- loki
networks:
- monitoring
# ─── Smokeping ─────────────────────────────────────────────────
# Continuous FPing-based latency monitoring with historical graphs.
# Excellent for spotting intermittent packet loss patterns.
smokeping:
image: lscr.io/linuxserver/smokeping:latest
container_name: smokeping
restart: unless-stopped
ports:
- "${SMOKEPING_PORT:-8081}:80"
environment:
PUID: 1000
PGID: 1000
TZ: Europe/London # Adjust to your timezone
volumes:
- ./smokeping/config/Targets:/config/Targets:ro
- smokeping_data:/data
networks:
- monitoring
volumes:
postgres_data:
prometheus_data:
loki_data:
grafana_data:
smokeping_data:
networks:
monitoring:
driver: bridge
COMPOSE
Why these specific versions?
| Image | Version | Reason |
|---|---|---|
| Prometheus | v2.51.0 | Stable TSDB, native histogram support, lifecycle API |
| Loki | 2.9.6 | Last 2.x LTS before 3.0 breaking changes, TSDB v13 support |
| Grafana | 10.4.1 | Unified alerting, correlation features, stable plugin ecosystem |
| PostgreSQL | 16-alpine | Homer 7 compatibility, small image footprint |
| Blackbox | v0.25.0 | Stable release with all probe types we need |
Step 3: PostgreSQL Init Script
Homer needs two databases: one for SIP data, one for its configuration. This script runs automatically on first container start.
cat > /opt/monitoring/postgres-init/01-init-dbs.sql << 'EOF'
-- Create Homer user and databases
CREATE USER homer WITH PASSWORD 'YOUR_HOMER_DB_PASSWORD';
CREATE DATABASE homer_data OWNER homer;
CREATE DATABASE homer_config OWNER homer;
GRANT ALL PRIVILEGES ON DATABASE homer_data TO homer;
GRANT ALL PRIVILEGES ON DATABASE homer_config TO homer;
EOF
Important: Replace
YOUR_HOMER_DB_PASSWORDwith the same value you used forHOMER_DB_PASSWORDin the.envfile. This SQL file runs only once -- when the PostgreSQL volume is first created.
Step 4: Prometheus Configuration
This is the most complex config file. It defines what Prometheus scrapes, how often, and from where.
cat > /opt/monitoring/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s # How often to pull metrics from targets
evaluation_interval: 15s # How often to evaluate alert rules
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers: []
# If you add Alertmanager later:
# alertmanagers:
# - static_configs:
# - targets: ["alertmanager:9093"]
scrape_configs:
# ─── Prometheus self-monitoring ───
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# ─── Blackbox Exporter self-metrics ───
- job_name: "blackbox"
static_configs:
- targets: ["blackbox-exporter:9115"]
# ─── heplify-server metrics (SIP packet rates, DB writes) ───
- job_name: "heplify-server"
static_configs:
- targets: ["heplify-server:9096"]
# ─── Node Exporter (system metrics per server) ────────────
# Each target is a VoIP server running node_exporter on :9100.
# Labels let you filter/group by server name in Grafana.
- job_name: "node"
static_configs:
- targets: ["YOUR_VOIP_SERVER_1_IP:9100"]
labels:
server: "voip-server-1"
alias: "primary"
- targets: ["YOUR_VOIP_SERVER_2_IP:9100"]
labels:
server: "voip-server-2"
alias: "secondary"
- targets: ["YOUR_VOIP_SERVER_3_IP:9100"]
labels:
server: "voip-server-3"
alias: "tertiary"
- targets: ["YOUR_VOIP_SERVER_4_IP:9100"]
labels:
server: "voip-server-4"
alias: "quaternary"
# ─── Asterisk Exporter (VoIP-specific metrics per server) ─
# Custom Python exporter that queries Asterisk AMI and
# ViciDial/CDR MySQL for SIP peer status, active calls,
# agent states, RTP quality, codecs, and more.
- job_name: "asterisk"
scrape_interval: 15s
static_configs:
- targets: ["YOUR_VOIP_SERVER_1_IP:9101"]
labels:
server: "voip-server-1"
alias: "primary"
- targets: ["YOUR_VOIP_SERVER_2_IP:9101"]
labels:
server: "voip-server-2"
alias: "secondary"
- targets: ["YOUR_VOIP_SERVER_3_IP:9101"]
labels:
server: "voip-server-3"
alias: "tertiary"
- targets: ["YOUR_VOIP_SERVER_4_IP:9101"]
labels:
server: "voip-server-4"
alias: "quaternary"
# ─── Blackbox: ICMP ping to all servers and SIP providers ─
# Tests basic reachability. Alert if any target is down for 5 min.
- job_name: "blackbox_icmp"
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- YOUR_VOIP_SERVER_1_IP
- YOUR_VOIP_SERVER_2_IP
- YOUR_VOIP_SERVER_3_IP
- YOUR_VOIP_SERVER_4_IP
- YOUR_SIP_PROVIDER_1_IP # e.g., primary inbound provider
- YOUR_SIP_PROVIDER_2_IP # e.g., outbound provider
- YOUR_SIP_PROVIDER_3_IP # e.g., backup trunk
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# ─── Blackbox: TCP probe SIP port 5060 ────────────────────
# Verifies SIP providers are accepting connections on :5060.
# More specific than ICMP -- detects SIP service crashes
# even when the host is still pingable.
- job_name: "blackbox_sip_tcp"
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- YOUR_SIP_PROVIDER_1_IP:5060
- YOUR_SIP_PROVIDER_2_IP:5060
- YOUR_SIP_PROVIDER_3_IP:5060
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# ─── Blackbox: HTTP probe VoIP web interfaces ─────────────
# Checks that web UIs (ViciDial admin, FreePBX, etc.) respond.
# Accepts 200, 301, 302, 401, 403 as "up" (login pages return 401/403).
- job_name: "blackbox_http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://YOUR_VOIP_SERVER_1_IP/vicidial/
labels:
server: "voip-server-1"
- targets:
- http://YOUR_VOIP_SERVER_2_IP/vicidial/
labels:
server: "voip-server-2"
- targets:
- http://YOUR_VOIP_SERVER_3_IP/vicidial/
labels:
server: "voip-server-3"
- targets:
- http://YOUR_VOIP_SERVER_4_IP/vicidial/
labels:
server: "voip-server-4"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
EOF
Understanding the relabel_configs (Blackbox pattern)
The relabel_configs block in the blackbox jobs is a standard Prometheus pattern that confuses newcomers. Here is what it does:
- The
targetslist contains the actual endpoints to probe (e.g.,YOUR_SIP_PROVIDER_1_IP:5060) - Prometheus needs to scrape the Blackbox Exporter, not the target directly
- The relabel rules:
- Copy the target address into the
__param_targetlabel (becomes?target=query param) - Save it as the
instancelabel (so it shows up correctly in Grafana) - Replace
__address__with the Blackbox Exporter's address (where Prometheus actually sends the HTTP request)
- Copy the target address into the
Result: Prometheus sends GET http://blackbox-exporter:9115/probe?target=YOUR_SIP_PROVIDER_1_IP:5060&module=tcp_connect, and the Blackbox Exporter performs the actual probe.
Step 5: Blackbox Exporter Modules
Four probe types, each tuned for VoIP infrastructure.
cat > /opt/monitoring/prometheus/blackbox.yml << 'EOF'
modules:
# ─── ICMP Ping ─────────────────────────────────────
# Basic reachability check. Force IPv4 to avoid
# dual-stack issues common in data centers.
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: ip4
# ─── TCP Connect ───────────────────────────────────
# Verify a TCP port accepts connections.
# Used for SIP :5060 checks.
tcp_connect:
prober: tcp
timeout: 5s
# ─── HTTP 2xx ──────────────────────────────────────
# Check web interfaces respond. We accept 401/403
# because login-protected pages (ViciDial, FreePBX)
# return these codes when not authenticated.
http_2xx:
prober: http
timeout: 10s
http:
method: GET
preferred_ip_protocol: ip4
valid_status_codes: [200, 301, 302, 401, 403]
no_follow_redirects: false
# ─── SIP Options (TCP) ────────────────────────────
# TCP-level check specifically for SIP endpoints.
# For actual SIP OPTIONS probing, consider using
# a dedicated SIP prober like sipvicious or sipp.
sip_options:
prober: tcp
timeout: 5s
tcp:
preferred_ip_protocol: ip4
EOF
Step 6: Prometheus Alert Rules
14 alert rules organized into 8 groups. These cover the most common VoIP failure modes, from trunk failures to zombie conferences to disk space exhaustion.
cat > /opt/monitoring/prometheus/rules/alerts.yml << 'EOF'
groups:
# ═══════════════════════════════════════════════════════════
# GROUP 1: SIP Trunk Health
# ═══════════════════════════════════════════════════════════
- name: trunk_alerts
rules:
# Alert when a SIP trunk (not an agent extension) goes UNREACHABLE.
# The regex filter peer!~"[0-9]+" excludes numeric SIP peers
# (agent softphones), which go offline normally when agents log out.
- alert: SIPTrunkDown
expr: asterisk_sip_peer_status{status!="OK",peer!~"[0-9]+"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "SIP trunk {{ $labels.peer }} DOWN on {{ $labels.server }}"
description: >-
Trunk {{ $labels.peer }} has been unreachable for more than 5 minutes.
Check provider status page, verify SIP credentials, and inspect
Asterisk logs for registration failures.
# High trunk latency degrades audio quality before the trunk
# fully drops. 500ms threshold gives early warning.
- alert: SIPTrunkHighLatency
expr: asterisk_sip_peer_latency_ms{peer!~"[0-9]+"} > 500
for: 5m
labels:
severity: warning
annotations:
summary: "SIP trunk {{ $labels.peer }} high latency on {{ $labels.server }}"
description: >-
Trunk {{ $labels.peer }} qualify latency is {{ $value }}ms (>500ms for 5 min).
This may cause choppy audio. Check network path and provider load.
# ═══════════════════════════════════════════════════════════
# GROUP 2: Call Activity
# ═══════════════════════════════════════════════════════════
- name: call_alerts
rules:
# If a server has zero active calls for 30 minutes during
# business hours, something is probably wrong.
- alert: NoActiveCalls
expr: asterisk_active_calls == 0
for: 30m
labels:
severity: warning
annotations:
summary: "No active calls on {{ $labels.server }} for 30 minutes"
description: "Zero active calls. Check if dialer campaigns are running."
# A call lasting >2 hours is almost certainly a zombie conference
# (agent hung up but the bridge stayed open). These consume a
# channel and can block the agent from receiving new calls.
- alert: ZombieConference
expr: asterisk_agent_incall_duration_seconds > 7200
for: 1m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent }} stuck in call >2h on {{ $labels.server }}"
description: >-
Agent has been INCALL for {{ $value | humanizeDuration }}.
Possible zombie conference. Check with: asterisk -rx "confbridge list"
# ═══════════════════════════════════════════════════════════
# GROUP 3: Codec / Transcoding
# ═══════════════════════════════════════════════════════════
- name: codec_alerts
rules:
# Transcoding (e.g., G.729 → alaw) causes CPU load and can
# produce robotic-sounding audio. In a properly configured
# system, there should be zero transcoding.
- alert: ActiveTranscoding
expr: asterisk_transcoding_channels > 0
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $value }} channels transcoding on {{ $labels.server }}"
description: >-
Asterisk is actively transcoding {{ $value }} channels.
This causes CPU load and can degrade audio quality (robotic voice).
Ensure all trunks and endpoints use the same codec (usually alaw or ulaw).
# ═══════════════════════════════════════════════════════════
# GROUP 4: RTP Quality
# ═══════════════════════════════════════════════════════════
- name: rtp_alerts
rules:
# >5% packet loss means noticeably degraded call quality.
# At >10%, calls become unusable.
- alert: HighRTPPacketLoss
expr: asterisk_rtp_packet_loss_percent > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High RTP packet loss on {{ $labels.server }}"
description: "RTP packet loss is {{ $value }}% for peer {{ $labels.peer }}."
# ═══════════════════════════════════════════════════════════
# GROUP 5: System Resources
# ═══════════════════════════════════════════════════════════
- name: system_alerts
rules:
- alert: DiskSpaceHigh
expr: >-
(1 - node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: 'Disk usage >85% on {{ $labels.server }}'
description: 'Root filesystem is {{ $value | printf "%.1f" }}% full.'
- alert: CPUSustainedHigh
expr: >-
100 - (avg by(server)
(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: warning
annotations:
summary: "CPU >90% sustained on {{ $labels.server }}"
description: "CPU usage has been above 90% for 10 minutes."
- alert: HighMemoryUsage
expr: >-
(1 - node_memory_MemAvailable_bytes
/ node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage >90% on {{ $labels.server }}"
description: 'Memory usage is {{ $value | printf "%.1f" }}%.'
- alert: HighLoadAverage
expr: >-
node_load5
/ count without(cpu, mode) (node_cpu_seconds_total{mode="idle"}) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High load average on {{ $labels.server }}"
description: '5-min load average is {{ $value | printf "%.1f" }}x CPU count.'
# ═══════════════════════════════════════════════════════════
# GROUP 6: Security
# ═══════════════════════════════════════════════════════════
- name: security_alerts
rules:
# >20 fail2ban bans in 5 minutes suggests a brute-force attack
# or a misconfigured SIP device flooding registrations.
- alert: Fail2banStorm
expr: increase(asterisk_fail2ban_bans_total[5m]) > 20
for: 1m
labels:
severity: warning
annotations:
summary: "Fail2ban storm on {{ $labels.server }}"
description: ">20 bans in 5 minutes. Possible SIP brute-force attack."
# ═══════════════════════════════════════════════════════════
# GROUP 7: Agent Behavior
# ═══════════════════════════════════════════════════════════
- name: agent_alerts
rules:
# Agents paused for >2 hours are probably AFK without logging out.
# This is an operational issue, not a system failure.
- alert: AgentStuckPaused
expr: asterisk_agent_pause_duration_seconds > 7200
for: 1m
labels:
severity: info
annotations:
summary: "Agent {{ $labels.agent }} paused >2 hours on {{ $labels.server }}"
description: "Agent has been in PAUSED state for {{ $value | humanizeDuration }}."
# ═══════════════════════════════════════════════════════════
# GROUP 8: External Probe Failures
# ═══════════════════════════════════════════════════════════
- name: probe_alerts
rules:
# Catches SIP provider outages, server unreachability,
# and web UI failures -- anything the Blackbox Exporter probes.
- alert: EndpointDown
expr: probe_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Endpoint {{ $labels.instance }} unreachable"
description: >-
Blackbox probe to {{ $labels.instance }} has been failing for 5 minutes.
Check network connectivity and service status.
# Detects when Prometheus itself can't reach an exporter.
# This means metrics are stale and other alerts won't fire.
- alert: ScrapeFailing
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Scrape target {{ $labels.instance }} down"
description: >-
Prometheus cannot scrape {{ $labels.job }} target {{ $labels.instance }}.
All other alerts for this target are now blind.
EOF
Alert rule design notes
fordurations: Set to avoid flapping. Trunks get 5 minutes (brief drops are normal during re-registration). Zombie conferences only need 1 minute (a 2-hour call is already confirmed stuck).- Severity levels:
critical= immediate action needed (trunk down, endpoint unreachable).warning= investigate soon (high CPU, transcoding).info= operational awareness (agent paused). - Peer filtering:
peer!~"[0-9]+"excludes agent extension numbers (like1001,1052) from trunk alerts. Agent phones go offline normally; only named trunks (likeprovider_inbound) should trigger alerts.
Step 7: Loki Configuration
Loki stores logs pushed by Promtail agents. This config uses TSDB v13 with filesystem storage -- simple, no object storage needed.
cat > /opt/monitoring/loki/loki-config.yml << 'EOF'
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
# ─── Schema: TSDB v13 (recommended for Loki 2.9+) ───
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
# ─── Ingestion limits ───
# Tuned for 4-8 VoIP servers pushing Asterisk + syslog.
# Increase if you see "rate limit exceeded" errors in promtail.
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h # 7 days
max_query_series: 5000
ingestion_rate_mb: 30 # MB/s ingestion rate
ingestion_burst_size_mb: 60 # Allow bursts (e.g., log rotation)
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 30MB
# ─── Compactor: handles retention enforcement ───
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
delete_request_cancel_period: 10m
retention_delete_delay: 2h # Grace period before deletion
# ─── Retention: 7 days ───
chunk_store_config:
max_look_back_period: 168h # 7 days
table_manager:
retention_deletes_enabled: true
retention_period: 168h # 7 days
EOF
Why TSDB v13 instead of BoltDB?
TSDB (Time Series Database) index is the modern storage engine for Loki. It replaces the older BoltDB-shipper approach and provides:
- Better query performance on label-heavy workloads
- Lower memory usage during compaction
- Built-in retention enforcement without a separate table manager
Step 8: Smokeping Targets
Smokeping provides beautiful latency graphs with packet loss visualization. Unlike Blackbox Exporter (which gives point-in-time metrics), Smokeping builds up long-term latency baselines that make intermittent network issues visible.
cat > /opt/monitoring/smokeping/config/Targets << 'EOF'
*** Targets ***
probe = FPing
menu = Top
title = VoIP Monitoring - Network Latency
remark = Centralized latency monitoring for all SIP servers and providers
# ─────────────────────────────────────────────────
# VoIP Servers
# ─────────────────────────────────────────────────
+ VoIP_Servers
menu = VoIP Servers
title = VoIP Servers
++ server_1
menu = Server 1 (Primary)
title = VoIP Server 1 - YOUR_VOIP_SERVER_1_IP
host = YOUR_VOIP_SERVER_1_IP
++ server_2
menu = Server 2 (Secondary)
title = VoIP Server 2 - YOUR_VOIP_SERVER_2_IP
host = YOUR_VOIP_SERVER_2_IP
++ server_3
menu = Server 3 (Tertiary)
title = VoIP Server 3 - YOUR_VOIP_SERVER_3_IP
host = YOUR_VOIP_SERVER_3_IP
++ server_4
menu = Server 4 (Quaternary)
title = VoIP Server 4 - YOUR_VOIP_SERVER_4_IP
host = YOUR_VOIP_SERVER_4_IP
# ─────────────────────────────────────────────────
# SIP Providers
# ─────────────────────────────────────────────────
+ SIP_Providers
menu = SIP Providers
title = SIP Provider Latency
++ provider_primary
menu = Primary Inbound
title = Primary Inbound Provider - YOUR_SIP_PROVIDER_1_IP
host = YOUR_SIP_PROVIDER_1_IP
++ provider_outbound
menu = Outbound
title = Outbound Provider - YOUR_SIP_PROVIDER_2_IP
host = YOUR_SIP_PROVIDER_2_IP
++ provider_backup
menu = Backup Trunk
title = Backup Trunk Provider - YOUR_SIP_PROVIDER_3_IP
host = YOUR_SIP_PROVIDER_3_IP
EOF
Step 9: Grafana Provisioning
Provisioning files auto-configure Grafana data sources and dashboard folders on first boot. No manual UI clicks needed.
Datasources
cat > /opt/monitoring/grafana/provisioning/datasources/all.yml << 'EOF'
apiVersion: 1
datasources:
# ─── Prometheus (metrics) ───
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
# ─── Loki (logs) ───
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: false
# ─── Homer SIP data (PostgreSQL) ───
- name: Homer
type: postgres
access: proxy
url: postgres:5432
database: homer_data
user: homer
secureJsonData:
password: YOUR_HOMER_DB_PASSWORD
jsonData:
sslmode: disable
maxOpenConns: 5
maxIdleConns: 2
editable: false
# ─── VoIP Server MySQL (direct CDR queries) ───
# Optional: allows Grafana to query ViciDial/Asterisk CDR
# tables directly for call reports and agent analytics.
# Create a read-only MySQL user on each VoIP server first:
# CREATE USER 'grafana_ro'@'%' IDENTIFIED BY 'YOUR_PASSWORD';
# GRANT SELECT ON asterisk.* TO 'grafana_ro'@'%';
- name: VoIP-Server-1
type: mysql
access: proxy
url: YOUR_VOIP_SERVER_1_IP:3306
database: asterisk
user: YOUR_MYSQL_RO_USER
secureJsonData:
password: YOUR_MYSQL_RO_PASSWORD
jsonData:
maxOpenConns: 3
maxIdleConns: 1
editable: false
EOF
Dashboard provider
cat > /opt/monitoring/grafana/provisioning/dashboards/dashboard.yml << 'EOF'
apiVersion: 1
providers:
- name: "VoIP Monitoring"
orgId: 1
folder: "VoIP Monitoring"
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 30
options:
path: /etc/grafana/provisioning/dashboards
foldersFromFilesStructure: false
EOF
Tip: You can place
.jsondashboard files alongsidedashboard.ymland they will be auto-imported into the "VoIP Monitoring" folder on startup. Export dashboards from the Grafana UI (Share > Export > Save to file) and place them here for infrastructure-as-code.
Step 10: Custom Asterisk Exporter
This is a custom Prometheus exporter written in Python that runs on each VoIP server. It collects metrics that no off-the-shelf exporter provides: SIP peer status from Asterisk AMI, agent states from ViciDial MySQL, RTP quality statistics, codec transcoding detection, and more.
The exporter exposes a standard /metrics endpoint on port 9101 that Prometheus scrapes.
cat > /opt/monitoring/scripts/asterisk_exporter.py << 'EXPORTER'
#!/usr/bin/env python3
"""
Asterisk/VoIP Prometheus Exporter
Queries Asterisk CLI + MySQL CDR database to expose VoIP metrics.
Runs on each monitored server, listens on :9101.
Metrics exposed:
- asterisk_sip_peer_up/status/latency (SIP peer health)
- asterisk_active_calls/channels (call volume)
- asterisk_channels_by_codec (codec distribution)
- asterisk_rtp_packet_loss_percent (RTP quality)
- asterisk_rtp_jitter_ms (RTP jitter)
- asterisk_transcoding_channels (transcoding detection)
- asterisk_agents_logged_in/incall/paused/waiting (agent states)
- asterisk_agent_incall_duration_seconds (zombie call detection)
- asterisk_agent_pause_duration_seconds (stuck pause detection)
- asterisk_queue_depth (queue backlog)
- asterisk_fail2ban_active_bans/total (security)
- asterisk_uptime_seconds (system health)
- asterisk_confbridge_count (conference count)
"""
import http.server
import subprocess
import re
import os
import mysql.connector
from mysql.connector import Error
LISTEN_PORT = int(os.environ.get("EXPORTER_PORT", 9101))
MYSQL_HOST = os.environ.get("MYSQL_HOST", "localhost")
MYSQL_USER = os.environ.get("MYSQL_USER", "cron")
MYSQL_PASS = os.environ.get("MYSQL_PASS", "YOUR_MYSQL_PASSWORD")
MYSQL_DB = os.environ.get("MYSQL_DB", "asterisk")
SERVER_LABEL = os.environ.get("SERVER_LABEL", "server1")
def run_ast_cmd(cmd):
"""Run an Asterisk CLI command and return output."""
try:
result = subprocess.run(
["asterisk", "-rx", cmd],
capture_output=True, text=True, timeout=10
)
return result.stdout
except Exception:
return ""
def get_mysql_connection():
"""Get MySQL connection."""
try:
return mysql.connector.connect(
host=MYSQL_HOST, user=MYSQL_USER,
password=MYSQL_PASS, database=MYSQL_DB,
connect_timeout=5
)
except Error:
return None
def collect_sip_peers():
"""Parse 'sip show peers' for status and latency."""
metrics = []
output = run_ast_cmd("sip show peers")
for line in output.splitlines():
m = re.match(
r'^(\S+)\s+(\d+\.\d+\.\d+\.\d+)\s+\S+\s+\S+\s+(\S+)\s+(\S+)',
line
)
if m:
peer = m.group(1).split('/')[0]
status_str = m.group(3)
latency_str = m.group(4)
is_up = 1 if status_str == "OK" else 0
metrics.append(
f'asterisk_sip_peer_up{{server="{SERVER_LABEL}",'
f'peer="{peer}"}} {is_up}'
)
metrics.append(
f'asterisk_sip_peer_status{{server="{SERVER_LABEL}",'
f'peer="{peer}",status="{status_str}"}} 1'
)
lat_match = re.search(r'(\d+)', latency_str)
if lat_match:
metrics.append(
f'asterisk_sip_peer_latency_ms{{server="{SERVER_LABEL}",'
f'peer="{peer}"}} {lat_match.group(1)}'
)
return metrics
def collect_channels():
"""Parse 'core show channels' for active call count and codec info."""
metrics = []
output = run_ast_cmd("core show channels")
m = re.search(r'(\d+) active channel', output)
channels = int(m.group(1)) if m else 0
m2 = re.search(r'(\d+) active call', output)
calls = int(m2.group(1)) if m2 else 0
metrics.append(
f'asterisk_active_channels{{server="{SERVER_LABEL}"}} {channels}'
)
metrics.append(
f'asterisk_active_calls{{server="{SERVER_LABEL}"}} {calls}'
)
# Count codecs from channel stats
codec_counts = {}
stats_output = run_ast_cmd("sip show channelstats")
for line in stats_output.splitlines():
parts = line.split()
if len(parts) >= 12:
codec = parts[11] if len(parts) > 11 else "unknown"
if codec in ("alaw", "ulaw", "g722", "g729", "gsm", "opus"):
codec_counts[codec] = codec_counts.get(codec, 0) + 1
for codec, count in codec_counts.items():
metrics.append(
f'asterisk_channels_by_codec{{server="{SERVER_LABEL}",'
f'codec="{codec}"}} {count}'
)
return metrics
def collect_rtp_stats():
"""Parse 'sip show channelstats' for RTP quality metrics."""
metrics = []
output = run_ast_cmd("sip show channelstats")
for line in output.splitlines():
parts = line.split()
if len(parts) >= 10 and parts[0] != "Peer":
try:
peer = parts[0]
recv_loss_pct = (
float(parts[3].rstrip('%'))
if '%' in parts[3] else 0
)
recv_jitter = (
float(parts[4])
if parts[4].replace('.', '').isdigit() else 0
)
rtt = (
float(parts[7])
if len(parts) > 7
and parts[7].replace('.', '').isdigit()
else 0
)
metrics.append(
f'asterisk_rtp_packet_loss_percent'
f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
f'{recv_loss_pct}'
)
metrics.append(
f'asterisk_rtp_jitter_ms'
f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
f'{recv_jitter}'
)
if rtt > 0:
metrics.append(
f'asterisk_rtp_rtt_ms'
f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
f'{rtt}'
)
except (ValueError, IndexError):
continue
return metrics
def collect_uptime():
"""Get Asterisk uptime."""
metrics = []
output = run_ast_cmd("core show uptime seconds")
m = re.search(r'System uptime:\s+(\d+)', output)
if m:
metrics.append(
f'asterisk_uptime_seconds{{server="{SERVER_LABEL}"}} {m.group(1)}'
)
return metrics
def collect_confbridge():
"""Count active ConfBridge conferences."""
metrics = []
output = run_ast_cmd("confbridge list")
count = 0
for line in output.splitlines():
if re.match(r'^\d+', line):
count += 1
metrics.append(
f'asterisk_confbridge_count{{server="{SERVER_LABEL}"}} {count}'
)
return metrics
def collect_transcoding():
"""Detect active transcoding by inspecting channel read/write codecs."""
metrics = []
transcoding_count = 0
output = run_ast_cmd("core show channels verbose")
sip_channels = []
for line in output.splitlines():
m = re.match(r'^(SIP/\S+)', line)
if m:
sip_channels.append(m.group(1))
for chan in sip_channels:
ch_output = run_ast_cmd(f"core show channel {chan}")
read_tc = False
write_tc = False
for ch_line in ch_output.splitlines():
ch_line = ch_line.strip()
if ch_line.startswith("ReadTranscode:") and "Yes" in ch_line:
read_tc = True
elif ch_line.startswith("WriteTranscode:") and "Yes" in ch_line:
write_tc = True
if read_tc or write_tc:
transcoding_count += 1
metrics.append(
f'asterisk_transcoding_channels{{server="{SERVER_LABEL}"}} '
f'{transcoding_count}'
)
return metrics
def collect_agents():
"""Query MySQL for agent states (ViciDial-specific, adapt for your PBX)."""
metrics = []
conn = get_mysql_connection()
if not conn:
return metrics
try:
cursor = conn.cursor(dictionary=True)
# Agent counts by status
cursor.execute("""
SELECT status, COUNT(*) as cnt
FROM vicidial_live_agents
WHERE server_ip != ''
GROUP BY status
""")
logged_in = incall = paused = waiting = 0
for row in cursor.fetchall():
s, c = row['status'], row['cnt']
logged_in += c
if s == 'INCALL':
incall = c
elif s == 'PAUSED':
paused = c
elif s in ('READY', 'CLOSER'):
waiting += c
metrics.append(
f'asterisk_agents_logged_in{{server="{SERVER_LABEL}"}} {logged_in}'
)
metrics.append(
f'asterisk_agents_incall{{server="{SERVER_LABEL}"}} {incall}'
)
metrics.append(
f'asterisk_agents_paused{{server="{SERVER_LABEL}"}} {paused}'
)
metrics.append(
f'asterisk_agents_waiting{{server="{SERVER_LABEL}"}} {waiting}'
)
# Per-agent status with duration (for zombie/stuck detection)
cursor.execute("""
SELECT user, status, pause_code,
TIMESTAMPDIFF(SECOND, last_state_change, NOW())
as state_duration
FROM vicidial_live_agents
WHERE server_ip != ''
""")
for row in cursor.fetchall():
user = row['user']
status = row['status']
duration = row['state_duration'] or 0
if status == 'INCALL':
metrics.append(
f'asterisk_agent_incall_duration_seconds'
f'{{server="{SERVER_LABEL}",agent="{user}"}} {duration}'
)
elif status == 'PAUSED':
metrics.append(
f'asterisk_agent_pause_duration_seconds'
f'{{server="{SERVER_LABEL}",agent="{user}"}} {duration}'
)
# Queue depth by campaign/ingroup
cursor.execute("""
SELECT campaign_id, COUNT(*) as cnt
FROM vicidial_auto_calls
WHERE status = 'LIVE'
GROUP BY campaign_id
""")
for row in cursor.fetchall():
metrics.append(
f'asterisk_queue_depth{{server="{SERVER_LABEL}",'
f'ingroup="{row["campaign_id"]}"}} {row["cnt"]}'
)
cursor.close()
except Exception:
pass
finally:
try:
conn.close()
except Exception:
pass
return metrics
def collect_fail2ban():
"""Parse fail2ban-client for ban counts."""
metrics = []
try:
result = subprocess.run(
["fail2ban-client", "status"],
capture_output=True, text=True, timeout=5
)
jails = re.findall(r'Jail list:\s*(.*)', result.stdout)
if jails:
for jail in jails[0].split(','):
jail = jail.strip()
if not jail:
continue
jr = subprocess.run(
["fail2ban-client", "status", jail],
capture_output=True, text=True, timeout=5
)
banned = re.search(
r'Currently banned:\s+(\d+)', jr.stdout
)
total = re.search(
r'Total banned:\s+(\d+)', jr.stdout
)
if banned:
metrics.append(
f'asterisk_fail2ban_active_bans'
f'{{server="{SERVER_LABEL}",jail="{jail}"}} '
f'{banned.group(1)}'
)
if total:
metrics.append(
f'asterisk_fail2ban_bans_total'
f'{{server="{SERVER_LABEL}",jail="{jail}"}} '
f'{total.group(1)}'
)
except Exception:
pass
return metrics
def collect_all():
"""Collect all metrics and return Prometheus text format."""
lines = [
"# HELP asterisk_sip_peer_up SIP peer reachability (1=up, 0=down)",
"# TYPE asterisk_sip_peer_up gauge",
"# HELP asterisk_sip_peer_latency_ms SIP peer qualify latency in ms",
"# TYPE asterisk_sip_peer_latency_ms gauge",
"# HELP asterisk_active_calls Number of active calls",
"# TYPE asterisk_active_calls gauge",
"# HELP asterisk_active_channels Number of active channels",
"# TYPE asterisk_active_channels gauge",
"# HELP asterisk_agents_logged_in Number of agents logged in",
"# TYPE asterisk_agents_logged_in gauge",
"# HELP asterisk_agents_incall Number of agents in call",
"# TYPE asterisk_agents_incall gauge",
"# HELP asterisk_agents_paused Number of agents paused",
"# TYPE asterisk_agents_paused gauge",
"# HELP asterisk_queue_depth Calls waiting in queue per ingroup",
"# TYPE asterisk_queue_depth gauge",
"# HELP asterisk_fail2ban_active_bans Current fail2ban active bans",
"# TYPE asterisk_fail2ban_active_bans gauge",
"# HELP asterisk_fail2ban_bans_total Total fail2ban bans",
"# TYPE asterisk_fail2ban_bans_total counter",
"# HELP asterisk_uptime_seconds Asterisk system uptime",
"# TYPE asterisk_uptime_seconds gauge",
"# HELP asterisk_confbridge_count Active ConfBridge conferences",
"# TYPE asterisk_confbridge_count gauge",
"# HELP asterisk_rtp_packet_loss_percent RTP packet loss percentage",
"# TYPE asterisk_rtp_packet_loss_percent gauge",
"# HELP asterisk_rtp_jitter_ms RTP jitter in ms",
"# TYPE asterisk_rtp_jitter_ms gauge",
"# HELP asterisk_transcoding_channels Channels actively transcoding",
"# TYPE asterisk_transcoding_channels gauge",
"",
]
lines.extend(collect_sip_peers())
lines.extend(collect_channels())
lines.extend(collect_rtp_stats())
lines.extend(collect_uptime())
lines.extend(collect_confbridge())
lines.extend(collect_agents())
lines.extend(collect_fail2ban())
lines.extend(collect_transcoding())
return "\n".join(lines) + "\n"
class MetricsHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/metrics":
body = collect_all()
self.send_response(200)
self.send_header("Content-Type", "text/plain; charset=utf-8")
self.end_headers()
self.wfile.write(body.encode())
else:
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.end_headers()
self.wfile.write(b"<a href='/metrics'>Metrics</a>")
def log_message(self, format, *args):
pass # Suppress request logging
if __name__ == "__main__":
server = http.server.HTTPServer(("0.0.0.0", LISTEN_PORT), MetricsHandler)
print(f"asterisk_exporter listening on :{LISTEN_PORT}")
server.serve_forever()
EXPORTER
chmod +x /opt/monitoring/scripts/asterisk_exporter.py
Adapting for non-ViciDial systems
The collect_agents() function queries ViciDial-specific tables (vicidial_live_agents, vicidial_auto_calls). If you run FreePBX, FusionPBX, or plain Asterisk:
- FreePBX: Query the
asteriskcdrdb.cdrtable instead, or use the Asterisk AMIQueueStatusaction - FusionPBX: Query the
v_call_center_agentstable in PostgreSQL - Plain Asterisk: Remove the MySQL dependency and rely solely on AMI commands (
queue show,sip show peers)
The SIP peer, channel, RTP, and codec collection functions work with any Asterisk version 11+.
Step 11: Remote Agent Installation Script
This script SSHs into a VoIP server and installs all four monitoring agents (node_exporter, heplify, promtail, asterisk_exporter) in one command. It auto-detects the OS (Ubuntu/Debian, CentOS, openSUSE) and adjusts package installation accordingly.
cat > /opt/monitoring/scripts/install-agents.sh << 'INSTALLER'
#!/bin/bash
# install-agents.sh — Install monitoring agents on a remote VoIP server
# Usage: ./install-agents.sh <server_ip> <ssh_port> <server_label> <monitor_vps_ip>
#
# Example:
# ./install-agents.sh 10.0.1.50 22 server1 10.0.0.10
#
# This installs:
# 1. heplify — SIP packet capture → Homer
# 2. node_exporter — System metrics → Prometheus
# 3. promtail — Log shipping → Loki
# 4. asterisk_exporter — VoIP metrics → Prometheus
set -e
SERVER_IP="${1:?Usage: $0 <server_ip> <ssh_port> <server_label> <monitor_vps_ip>}"
SSH_PORT="${2:-22}"
SERVER_LABEL="${3:?Provide server label (e.g., server1, primary, london)}"
MONITOR_IP="${4:?Provide monitoring VPS IP}"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
echo "=== Installing monitoring agents on ${SERVER_LABEL} (${SERVER_IP}:${SSH_PORT}) ==="
echo "Monitor VPS: ${MONITOR_IP}"
echo ""
SSH_CMD="ssh -o StrictHostKeyChecking=no -p ${SSH_PORT} root@${SERVER_IP}"
# ─── 1. heplify (SIP capture agent) ───
echo "[1/4] Installing heplify..."
${SSH_CMD} bash << REMOTEOF
set -e
if [ ! -f /usr/local/bin/heplify ]; then
curl -sL https://github.com/sipcapture/heplify/releases/download/v1.67.1/heplify \
-o /usr/local/bin/heplify
chmod +x /usr/local/bin/heplify
echo " heplify binary installed"
else
echo " heplify already installed"
fi
cat > /etc/systemd/system/heplify.service << SVCFILE
[Unit]
Description=heplify SIP Capture Agent
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/heplify -hs ${MONITOR_IP}:9060 -i any -dim "OPTIONS,NOTIFY" -e
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
SVCFILE
systemctl daemon-reload
systemctl enable heplify
systemctl restart heplify
echo " heplify service started"
REMOTEOF
# ─── 2. node_exporter (system metrics) ───
echo "[2/4] Installing node_exporter..."
${SSH_CMD} bash << 'REMOTEOF'
set -e
if [ ! -f /usr/local/bin/node_exporter ]; then
cd /tmp
curl -sL https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz | tar xz
cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
rm -rf node_exporter-1.7.0.linux-amd64*
echo " node_exporter binary installed"
else
echo " node_exporter already installed"
fi
cat > /etc/systemd/system/node_exporter.service << 'SVCFILE'
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
SVCFILE
systemctl daemon-reload
systemctl enable node_exporter
systemctl restart node_exporter
echo " node_exporter service started"
REMOTEOF
# ─── 3. promtail (log shipping) ───
echo "[3/4] Installing promtail..."
${SSH_CMD} bash << REMOTEOF
set -e
if [ ! -f /usr/local/bin/promtail ]; then
cd /tmp
curl -sL https://github.com/grafana/loki/releases/download/v2.9.6/promtail-linux-amd64.zip \
-o promtail.zip
# Install unzip on whichever OS
if command -v apt-get &>/dev/null; then
apt-get install -y unzip 2>/dev/null || true
elif command -v zypper &>/dev/null; then
zypper install -y unzip 2>/dev/null || true
elif command -v yum &>/dev/null; then
yum install -y unzip 2>/dev/null || true
fi
unzip -o promtail.zip
mv promtail-linux-amd64 /usr/local/bin/promtail
chmod +x /usr/local/bin/promtail
rm -f promtail.zip
echo " promtail binary installed"
else
echo " promtail already installed"
fi
mkdir -p /etc/promtail /var/lib/promtail
cat > /etc/promtail/config.yml << CFGFILE
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://${MONITOR_IP}:3100/loki/api/v1/push
scrape_configs:
# Asterisk main log (warnings, errors, notices)
- job_name: asterisk_messages
static_configs:
- targets: [localhost]
labels:
job: asterisk
server: ${SERVER_LABEL}
logtype: messages
__path__: /var/log/asterisk/messages
# Asterisk verbose log (if enabled)
- job_name: asterisk_full
static_configs:
- targets: [localhost]
labels:
job: asterisk
server: ${SERVER_LABEL}
logtype: full
__path__: /var/log/asterisk/full
# ViciDial/astguiclient logs (dialer, listener, etc.)
- job_name: vicidial
static_configs:
- targets: [localhost]
labels:
job: vicidial
server: ${SERVER_LABEL}
logtype: vicidial
__path__: /var/log/astguiclient/*.log
# System syslog
- job_name: syslog
static_configs:
- targets: [localhost]
labels:
job: syslog
server: ${SERVER_LABEL}
logtype: syslog
__path__: /var/log/messages
CFGFILE
cat > /etc/systemd/system/promtail.service << 'SVCFILE'
[Unit]
Description=Promtail Log Agent
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
SVCFILE
systemctl daemon-reload
systemctl enable promtail
systemctl restart promtail
echo " promtail service started"
REMOTEOF
# ─── 4. asterisk_exporter (VoIP metrics) ───
echo "[4/4] Installing asterisk_exporter..."
${SSH_CMD} "mkdir -p /opt/asterisk_exporter"
# Copy the exporter script to the remote server
scp -o StrictHostKeyChecking=no -P ${SSH_PORT} \
${SCRIPT_DIR}/asterisk_exporter.py \
root@${SERVER_IP}:/opt/asterisk_exporter/asterisk_exporter.py
${SSH_CMD} bash << REMOTEOF
set -e
# Find Python 3
PYTHON_BIN=""
for p in python3.11 python3.6 python3; do
if command -v \$p &>/dev/null; then
PYTHON_BIN=\$(command -v \$p)
break
fi
done
if [ -z "\$PYTHON_BIN" ]; then
if command -v yum &>/dev/null; then
yum install -y python3 python3-pip 2>/dev/null || true
PYTHON_BIN=\$(command -v python3)
fi
fi
echo " Using Python: \$PYTHON_BIN"
# Install MySQL connector
\$PYTHON_BIN -m pip install mysql-connector-python 2>/dev/null \
|| \$PYTHON_BIN -m pip install "mysql-connector-python<8.1" 2>/dev/null \
|| true
chmod +x /opt/asterisk_exporter/asterisk_exporter.py
cat > /etc/systemd/system/asterisk_exporter.service << SVCFILE
[Unit]
Description=Asterisk/VoIP Prometheus Exporter
After=network.target mariadb.service asterisk.service
Wants=mariadb.service
[Service]
Type=simple
ExecStart=\$PYTHON_BIN /opt/asterisk_exporter/asterisk_exporter.py
Restart=always
RestartSec=10
Environment=EXPORTER_PORT=9101
Environment=MYSQL_HOST=localhost
Environment=MYSQL_USER=YOUR_MYSQL_USER
Environment=MYSQL_PASS=YOUR_MYSQL_PASSWORD
Environment=MYSQL_DB=asterisk
Environment=SERVER_LABEL=${SERVER_LABEL}
[Install]
WantedBy=multi-user.target
SVCFILE
systemctl daemon-reload
systemctl enable asterisk_exporter
systemctl restart asterisk_exporter
echo " asterisk_exporter service started"
REMOTEOF
echo ""
echo "=== All 4 agents installed on ${SERVER_LABEL} (${SERVER_IP}) ==="
echo " heplify -> sending HEP to ${MONITOR_IP}:9060"
echo " node_exporter -> :9100"
echo " promtail -> shipping logs to ${MONITOR_IP}:3100"
echo " ast_exporter -> :9101"
echo ""
INSTALLER
chmod +x /opt/monitoring/scripts/install-agents.sh
Usage
# Install agents on your first VoIP server
./scripts/install-agents.sh YOUR_VOIP_SERVER_1_IP 22 server1 YOUR_MONITOR_VPS_IP
# Install on second server (custom SSH port)
./scripts/install-agents.sh YOUR_VOIP_SERVER_2_IP 9322 server2 YOUR_MONITOR_VPS_IP
Step 12: Backup Script
A simple daily backup that archives all configuration files. Add it to cron on the monitoring VPS.
cat > /opt/monitoring/scripts/backup-monitoring.sh << 'BACKUP'
#!/bin/bash
# backup-monitoring.sh — Backup all monitoring configs
# Run daily via cron: 0 2 * * * /opt/monitoring/scripts/backup-monitoring.sh
BACKUP_DIR="/var/backups/monitoring"
DATE=$(date +%Y%m%d_%H%M%S)
TARGET="${BACKUP_DIR}/monitoring_${DATE}.tar.gz"
mkdir -p "${BACKUP_DIR}"
tar czf "${TARGET}" \
/opt/monitoring/docker-compose.yml \
/opt/monitoring/.env \
/opt/monitoring/prometheus/ \
/opt/monitoring/grafana/ \
/opt/monitoring/loki/ \
/opt/monitoring/smokeping/ \
/opt/monitoring/scripts/ \
2>/dev/null
# Keep last 7 backups, delete older ones
ls -t "${BACKUP_DIR}"/monitoring_*.tar.gz | tail -n +8 | xargs rm -f 2>/dev/null
echo "Backup saved: ${TARGET} ($(du -h ${TARGET} | cut -f1))"
BACKUP
chmod +x /opt/monitoring/scripts/backup-monitoring.sh
Add to cron:
# Run backup daily at 2 AM
echo "0 2 * * * /opt/monitoring/scripts/backup-monitoring.sh >> /var/log/monitoring-backup.log 2>&1" \
| crontab -
Step 13: Launch and Verify
Start the stack
cd /opt/monitoring
docker compose up -d
Verify all containers are running
docker compose ps
Expected output:
NAME STATUS PORTS
blackbox-exporter Up (healthy)
grafana Up 0.0.0.0:3000->3000/tcp
heplify-server Up 0.0.0.0:9060->9060/tcp+udp
homer-webapp Up 0.0.0.0:9080->80/tcp
loki Up 0.0.0.0:3100->3100/tcp
postgres Up (healthy)
prometheus Up 0.0.0.0:9090->9090/tcp
smokeping Up 0.0.0.0:8081->80/tcp
Verify Prometheus targets
Open http://YOUR_MONITOR_VPS_IP:9090/targets in your browser. You should see all scrape jobs listed with their status (UP or DOWN). Jobs targeting remote VoIP servers will show DOWN until you install the agents.
Verify Loki is ready
curl -s http://localhost:3100/ready
# Expected: "ready"
Verify Homer is receiving data
After installing heplify on a VoIP server, check Homer at http://YOUR_MONITOR_VPS_IP:9080. Search for recent SIP traffic. Default login is admin / sipcapture.
Install agents on your VoIP servers
cd /opt/monitoring
./scripts/install-agents.sh YOUR_VOIP_SERVER_1_IP 22 server1 YOUR_MONITOR_VPS_IP
Wait 30 seconds, then check Prometheus targets again. The node and asterisk jobs for that server should show UP.
Log in to Grafana
Open http://YOUR_MONITOR_VPS_IP:3000 and log in with admin / the password from your .env file. The Prometheus, Loki, and Homer data sources should already be configured.
Grafana Dashboard Ideas
Here are PromQL queries you can use to build dashboards.
VoIP Overview Panel
# Active calls per server (stat panel)
asterisk_active_calls
# Total agents logged in (stat panel)
sum(asterisk_agents_logged_in)
# SIP trunk status table
asterisk_sip_peer_status{peer!~"[0-9]+"}
SIP Trunk Latency Graph
# Trunk latency over time (time series panel)
asterisk_sip_peer_latency_ms{peer!~"[0-9]+"}
RTP Quality Heatmap
# Packet loss distribution (heatmap panel)
asterisk_rtp_packet_loss_percent
Blackbox Probe Duration
# Probe response time (time series panel)
probe_duration_seconds{job="blackbox_icmp"}
# Probe success rate (stat panel, percentage)
avg_over_time(probe_success{job="blackbox_sip_tcp"}[1h]) * 100
System Resource Overview
# CPU usage per server (time series)
100 - (avg by(server) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage (gauge)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage percentage (gauge)
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
Loki Log Query Examples
# Asterisk errors on a specific server
{job="asterisk", server="server1"} |= "ERROR"
# SIP registration failures across all servers
{job="asterisk"} |~ "Registration.*failed|UNREACHABLE"
# ViciDial dialer errors
{job="vicidial"} |= "ERROR" | logfmt
Tips and Tricks
1. Hot-reload Prometheus config without restart
After editing prometheus.yml or alert rules:
curl -X POST http://localhost:9090/-/reload
This works because we started Prometheus with --web.enable-lifecycle. No downtime, no data loss.
2. Check Prometheus config syntax before applying
docker exec prometheus promtool check config /etc/prometheus/prometheus.yml
docker exec prometheus promtool check rules /etc/prometheus/rules/alerts.yml
Always validate before reloading. A syntax error in rules will cause Prometheus to reject the entire reload.
3. Exclude noisy SIP messages from Homer
The heplify agent flag -dim "OPTIONS,NOTIFY" filters out SIP OPTIONS keepalives and NOTIFY events. These make up 90%+ of SIP traffic but are rarely useful for debugging. If you need them, remove the flag.
4. Use labels consistently
Every metric from the asterisk_exporter includes a server label. Use the same label values across all configs (prometheus.yml targets, promtail config, Smokeping targets). This lets you correlate metrics, logs, and SIP captures for the same server in Grafana.
5. Grafana variables for multi-server dashboards
Create a Grafana dashboard variable:
- Name:
server - Type: Query
- Data source: Prometheus
- Query:
label_values(asterisk_active_calls, server)
Then use $server in all panel queries:
asterisk_active_calls{server="$server"}
This gives you a dropdown at the top of the dashboard to switch between servers.
6. Set up recording rules for expensive queries
If you have many SIP peers and the sip show peers command is slow, pre-compute aggregates:
# Add to prometheus/rules/recording.yml
groups:
- name: voip_recording_rules
interval: 30s
rules:
- record: job:asterisk_trunks_up:count
expr: count by(server) (asterisk_sip_peer_up{peer!~"[0-9]+"}==1)
- record: job:asterisk_trunks_total:count
expr: count by(server) (asterisk_sip_peer_up{peer!~"[0-9]+"})
7. Monitor Loki ingestion rate
If promtail stops shipping logs, you won't notice unless you check. Add this to your alert rules:
- alert: LokiIngestionStopped
expr: sum(rate(loki_distributor_bytes_received_total[5m])) == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Loki is not receiving any logs"
8. Smokeping graph colors
Smokeping uses RRD graphs. The gray bands show packet loss. Narrow green lines = stable. Wide gray bands = intermittent loss. If you see periodic patterns (e.g., loss every hour), it often correlates with backup jobs or log rotation on the target server.
9. Scale the asterisk_exporter
The exporter runs Asterisk CLI commands synchronously. On a busy server with 100+ active channels, core show channel <chan> for transcoding detection can take several seconds per channel. If scrape timeouts occur:
- Increase the Prometheus
scrape_timeoutfor the asterisk job to 30s - Or disable the
collect_transcoding()function (it is the most expensive)
10. Persistent Docker volumes
All data is stored in named Docker volumes (prometheus_data, loki_data, etc.). This means docker compose down preserves data, but docker compose down -v destroys it. Never use -v unless you want a clean start.
Troubleshooting
Prometheus shows target as DOWN
Symptoms: Target status shows DOWN with connection refused or context deadline exceeded.
Checklist:
- Is the exporter running on the remote server?
ssh root@YOUR_SERVER "systemctl status node_exporter" - Is the port accessible?
curl -s http://YOUR_SERVER_IP:9100/metrics | head -5 - Is a firewall blocking the port?
ssh root@YOUR_SERVER "iptables -L -n | grep 9100" # Or for firewalld: ssh root@YOUR_SERVER "firewall-cmd --list-ports" - Add firewall rules if needed:
# iptables iptables -I INPUT -p tcp --dport 9100 -s YOUR_MONITOR_VPS_IP -j ACCEPT iptables -I INPUT -p tcp --dport 9101 -s YOUR_MONITOR_VPS_IP -j ACCEPT # firewalld firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=YOUR_MONITOR_VPS_IP port port=9100-9101 protocol=tcp accept' firewall-cmd --reload
Loki not receiving logs from promtail
Symptoms: No logs visible in Grafana Explore with Loki data source.
Checklist:
- Check promtail status on the remote server:
ssh root@YOUR_SERVER "systemctl status promtail" ssh root@YOUR_SERVER "journalctl -u promtail -n 50" - Common errors:
429 Too Many Requests: Increaseingestion_rate_mbinloki-config.ymlconnection refused: Verify Loki port 3100 is open on the monitoring VPS firewallfile not found: The log path in promtail config does not exist on that server (e.g.,/var/log/asterisk/fullmay not exist iffulllogging is disabled)
- Test Loki directly:
curl -s http://YOUR_MONITOR_VPS_IP:3100/ready curl -s http://YOUR_MONITOR_VPS_IP:3100/loki/api/v1/labels
Homer not showing SIP messages
Symptoms: Homer webapp loads but shows no SIP data.
Checklist:
- Is heplify running on the VoIP server?
ssh root@YOUR_SERVER "systemctl status heplify" - Is the HEP port accessible?
# From the VoIP server, test connectivity to the monitor ssh root@YOUR_SERVER "nc -zvu YOUR_MONITOR_VPS_IP 9060" - Check heplify-server logs:
docker logs heplify-server --tail 50 - Check PostgreSQL has homer tables:
docker exec postgres psql -U homer -d homer_data -c "\dt" - Verify the homer user password matches between
.env,01-init-dbs.sql, and theheplify-serverenvironment variables.
Grafana data source connection errors
Symptoms: Grafana shows "Bad Gateway" or "Connection refused" for a data source.
Checklist:
- Data sources use Docker container names (e.g.,
http://prometheus:9090), notlocalhost. This is because Grafana runs inside the Docker network. - For MySQL data sources pointing to external VoIP servers, use the actual IP address (not container name).
- Verify the MySQL read-only user exists and can connect from the monitoring VPS IP:
-- Run on the VoIP server's MySQL SELECT user, host FROM mysql.user WHERE user = 'grafana_ro'; -- Must show host = '%' or the specific monitor VPS IP
Docker container keeps restarting
# Check logs for the failing container
docker logs <container_name> --tail 100
# Common causes:
# - postgres: init script SQL error (check 01-init-dbs.sql)
# - loki: permission error on /loki directory
# - prometheus: YAML syntax error in config
# - heplify-server: can't connect to postgres (check DB password)
High disk usage on monitoring VPS
# Check Docker volume sizes
docker system df -v
# Prometheus is usually the largest consumer
# Reduce retention: change --storage.tsdb.retention.time=30d to 14d
# Force Loki compaction
curl -X POST http://localhost:3100/compactor/ring/delete
# Prune unused Docker resources
docker system prune -f
Security Considerations
Firewall the monitoring ports
The monitoring stack exposes several ports. In production, restrict access:
# Allow only your admin IP to access dashboards
iptables -I INPUT -p tcp --dport 3000 -s YOUR_ADMIN_IP -j ACCEPT # Grafana
iptables -I INPUT -p tcp --dport 9090 -s YOUR_ADMIN_IP -j ACCEPT # Prometheus
iptables -I INPUT -p tcp --dport 9080 -s YOUR_ADMIN_IP -j ACCEPT # Homer
# Allow only VoIP servers to push data
iptables -I INPUT -p tcp --dport 3100 -s YOUR_VOIP_SERVER_1_IP -j ACCEPT # Loki
iptables -I INPUT -p udp --dport 9060 -s YOUR_VOIP_SERVER_1_IP -j ACCEPT # HEP
# Repeat for each VoIP server
# Block all other access to these ports
iptables -A INPUT -p tcp --dport 3000 -j DROP
iptables -A INPUT -p tcp --dport 9090 -j DROP
# ... etc
Use read-only MySQL users
The asterisk_exporter and Grafana MySQL data sources should use a read-only MySQL user. Never give them write access to your production database.
-- On each VoIP server
CREATE USER 'grafana_ro'@'YOUR_MONITOR_VPS_IP' IDENTIFIED BY 'YOUR_STRONG_PASSWORD';
GRANT SELECT ON asterisk.* TO 'grafana_ro'@'YOUR_MONITOR_VPS_IP';
FLUSH PRIVILEGES;
Reverse proxy with TLS
For production use, put Grafana behind nginx or Caddy with TLS:
# /etc/nginx/sites-available/grafana
server {
listen 443 ssl;
server_name monitoring.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/monitoring.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/monitoring.yourdomain.com/privkey.pem;
location / {
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support (required for Grafana Live)
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
Do not expose Prometheus externally
Prometheus has no built-in authentication. If you need remote access, use an SSH tunnel or VPN rather than exposing port 9090 to the internet.
What's Next
Once the base stack is running, consider these additions:
Alertmanager: Add
prom/alertmanagerto the Docker Compose stack to send alerts via email, Slack, PagerDuty, or Telegram. Connect it by filling in thealertmanagerssection inprometheus.yml.Grafana Alerting: Instead of Alertmanager, use Grafana's built-in unified alerting (already enabled in this stack) to create alert rules with contact points directly in the Grafana UI.
Recording cleanup monitoring: Add a cron job that checks recording disk usage and alert when retention policy is not working.
SIP quality scoring: Use the RTP metrics to compute a Mean Opinion Score (MOS) approximation:
# Simplified R-factor → MOS conversion # R = 93.2 - packet_loss*2.5 - jitter*0.03 - latency*0.024 # MOS = 1 + 0.035*R + R*(R-60)*(100-R)*7e-6Dashboard JSON exports: Export your best dashboards as JSON and commit them to the
grafana/provisioning/dashboards/directory for infrastructure-as-code.Log alerting in Loki: Use Grafana's log-based alerting to trigger on specific Asterisk log patterns (e.g.,
UNREACHABLE,chan_sip.c: Failed to authenticate).Uptime monitoring: Add an external uptime check (e.g., Uptime Kuma in Docker) that monitors the monitoring stack itself.
Summary
You now have a complete, production-grade VoIP monitoring stack:
- 7 Docker services working together on a single host
- 4 lightweight agents deployable to any number of VoIP servers with one script
- 14 alert rules covering trunk failures, call quality, system resources, agent behavior, and security
- Centralized logs searchable across all servers
- SIP packet capture with call flow diagrams
- Network latency baselines for all SIP providers
- Daily backups with 7-day retention
The total resource footprint on the monitoring VPS is approximately 2-3 GB RAM and minimal CPU at idle, scaling linearly with the number of monitored servers. The remote agents use less than 100 MB RAM combined per server.
This stack has been running in production monitoring a multi-server VoIP call center fleet (4 Asterisk/ViciDial servers, 7 SIP providers, 50+ agents) with zero data loss and sub-second query times in Grafana.
Built from production experience. Every configuration in this tutorial has been tested under real VoIP traffic.