← All Tutorials

Building a Complete VoIP Monitoring Stack with Docker

Monitoring & Observability Intermediate 46 min read #01

Building a Complete VoIP Monitoring Stack with Docker

Grafana + Prometheus + Loki + Homer + Smokeping + Blackbox Exporter


Difficulty Intermediate
Time to Complete 3-4 hours
Prerequisites Linux VPS (Ubuntu 22.04+), Docker, basic VoIP/SIP knowledge
Tested On Ubuntu 24.04 LTS, Docker 24.x, 8 CPU / 16 GB RAM

Table of Contents

  1. Introduction
  2. What You'll Build
  3. Architecture Overview
  4. Prerequisites
  5. Directory Structure
  6. Step 1: Environment Variables
  7. Step 2: Docker Compose Stack
  8. Step 3: PostgreSQL Init Script
  9. Step 4: Prometheus Configuration
  10. Step 5: Blackbox Exporter Modules
  11. Step 6: Prometheus Alert Rules
  12. Step 7: Loki Configuration
  13. Step 8: Smokeping Targets
  14. Step 9: Grafana Provisioning
  15. Step 10: Custom Asterisk Exporter
  16. Step 11: Remote Agent Installation Script
  17. Step 12: Backup Script
  18. Step 13: Launch and Verify
  19. Grafana Dashboard Ideas
  20. Tips and Tricks
  21. Troubleshooting
  22. Security Considerations
  23. What's Next

Introduction

If you run a VoIP operation -- a call center, a telecom platform, or even a handful of Asterisk/FreePBX servers -- you know the pain. A SIP trunk silently drops. Packet loss creeps up at 2 AM. An agent gets stuck in a zombie conference for three hours. Disk fills up with recordings and nobody notices until calls start failing.

The standard approach is to SSH into each server, run sip show peers, grep some logs, and hope you catch problems before your customers do. That does not scale past two servers.

This tutorial walks you through building a centralized VoIP monitoring stack that runs on a single Docker host and monitors any number of remote VoIP servers. It is based on a production system that monitors a multi-server ViciDial call center fleet across four data centers and seven SIP providers. Every configuration file in this tutorial comes from that real deployment, sanitized and annotated.

What problems does this solve?


What You'll Build

When you finish this tutorial, you will have a single Docker Compose stack exposing these services:

Service Port Purpose
Grafana :3000 Unified dashboards -- metrics, logs, SIP data, all in one UI
Prometheus :9090 Time-series metrics database (30-day retention)
Loki :3100 Log aggregation engine (7-day retention)
Homer :9080 SIP capture and search (7-day retention)
Smokeping :8081 Network latency graphs with historical baselines
Blackbox Exporter (internal) ICMP pings, TCP SIP port checks, HTTP probes
PostgreSQL (internal) Backend database for Homer SIP data

On each remote VoIP server, you will install four lightweight agents:

Agent Port Purpose
node_exporter :9100 System metrics (CPU, RAM, disk, network)
asterisk_exporter :9101 Custom Asterisk/ViciDial metrics (SIP peers, active calls, agent states, RTP quality, codecs)
promtail :9080 Ships Asterisk logs, ViciDial logs, and syslog to Loki
heplify -- Captures SIP packets off the wire and sends HEP to Homer

What the dashboards look like


Architecture Overview

                         ┌─────────────────────────────────────────────────────────┐
                         │              MONITORING VPS (Docker Host)               │
                         │                                                         │
                         │  ┌───────────┐  ┌────────────┐  ┌──────────────────┐   │
                         │  │  Grafana   │  │ Prometheus │  │      Loki        │   │
                         │  │  :3000     │  │  :9090     │  │     :3100        │   │
                         │  │           ◄├──┤            │  │                  │   │
                         │  │           ◄├──┼────────────┼──┤                  │   │
                         │  │           ◄├──┤            │  │                  │   │
                         │  └───────────┘  │  ┌───────┐ │  └────────▲─────────┘   │
                         │                 │  │ Rules │ │           │              │
                         │  ┌───────────┐  │  │(14)   │ │  ┌───────┴──────────┐   │
                         │  │ Smokeping │  │  └───────┘ │  │  Homer (webapp)   │   │
                         │  │  :8081    │  │     │      │  │     :9080         │   │
                         │  └───────────┘  │     ▼      │  └───────▲──────────┘   │
                         │                 │ ┌────────┐ │          │              │
                         │  ┌───────────┐  │ │Blackbox│ │  ┌──────┴───────────┐   │
                         │  │PostgreSQL │  │ │Exporter│ │  │ heplify-server   │   │
                         │  │   (16)    │  │ └────────┘ │  │    :9060/udp     │   │
                         │  └─────▲─────┘  └────────────┘  └──────▲───────────┘   │
                         │        │                               │               │
                         └────────┼───────────────────────────────┼───────────────┘
                                  │                               │
            ┌─────────────────────┼───────────────────────────────┼──────────────────┐
            │                     │         NETWORK               │                  │
            │  ┌──────────────────┼───────────────────────────────┼────────────────┐ │
            │  │                                                                   │ │
     ┌──────┴──┴──────┐   ┌──────┴──┴──────┐   ┌──────────────┐   ┌─────────────┐ │ │
     │  VoIP Server 1 │   │  VoIP Server 2 │   │ VoIP Server 3│   │ SIP Providers│ │ │
     │                │   │                │   │              │   │             │ │ │
     │ node_exporter  │   │ node_exporter  │   │ node_exporter│   │ ICMP ping   │ │ │
     │ :9100          │   │ :9100          │   │ :9100        │   │ TCP :5060   │ │ │
     │ ast_exporter   │   │ ast_exporter   │   │ ast_exporter │   │             │ │ │
     │ :9101          │   │ :9101          │   │ :9101        │   └─────────────┘ │ │
     │ promtail ──────┼───► Loki           │   │ promtail     │                   │ │
     │ heplify  ──────┼───► heplify-server │   │ heplify      │                   │ │
     └────────────────┘   └────────────────┘   └──────────────┘                   │ │
            │                                                                      │ │
            └──────────────────────────────────────────────────────────────────────┘ │
                                                                                     │
            ┌────────────────────────────────────────────────────────────────────────┘
            │ Blackbox Exporter probes: ICMP, TCP SIP (:5060), HTTP to all targets
            └─────────────────────────────────────────────────────────────────────────

Data flow summary

  1. Metrics (pull): Prometheus scrapes node_exporter (:9100) and asterisk_exporter (:9101) on each VoIP server every 15 seconds. It also scrapes Blackbox Exporter results for external probes.
  2. Logs (push): Promtail on each VoIP server pushes Asterisk logs, ViciDial logs, and syslog to Loki on :3100.
  3. SIP packets (push): Heplify on each VoIP server captures SIP packets off the network interface and sends them via HEP protocol to heplify-server on :9060/udp.
  4. Latency (active): Smokeping sends FPing probes to all VoIP servers and SIP providers continuously.
  5. External probes (active): Blackbox Exporter probes SIP provider ports (TCP :5060), pings servers (ICMP), and checks HTTP endpoints.
  6. Visualization: Grafana connects to Prometheus, Loki, and PostgreSQL (Homer data) as data sources, providing a single pane of glass.

Prerequisites

Monitoring VPS requirements

On each monitored VoIP server

Install Docker (if not already installed)

# Install Docker Engine
curl -fsSL https://get.docker.com | sh

# Install Docker Compose plugin
apt-get install -y docker-compose-plugin

# Verify
docker --version
docker compose version

Directory Structure

Create the full directory tree before starting:

mkdir -p /opt/monitoring/{prometheus/rules,loki,grafana/provisioning/{datasources,dashboards},smokeping/config,postgres-init,scripts}
cd /opt/monitoring

Final structure:

/opt/monitoring/
├── docker-compose.yml              # All services
├── .env                            # Passwords, IPs, ports
├── postgres-init/
│   └── 01-init-dbs.sql             # Homer database setup
├── prometheus/
│   ├── prometheus.yml              # Scrape targets, job definitions
│   ├── blackbox.yml                # Blackbox Exporter probe modules
│   └── rules/
│       └── alerts.yml              # 14 alert rules in 8 groups
├── loki/
│   └── loki-config.yml             # Loki storage, retention, limits
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── all.yml             # Prometheus, Loki, Homer, MySQL sources
│       └── dashboards/
│           └── dashboard.yml       # Dashboard folder provisioning
├── smokeping/
│   └── config/
│       └── Targets                 # Ping targets (servers + providers)
└── scripts/
    ├── install-agents.sh           # One-command remote agent installer
    ├── asterisk_exporter.py        # Custom Prometheus exporter for Asterisk
    └── backup-monitoring.sh        # Daily config backup (7-day retention)

Step 1: Environment Variables

Create the .env file. This is the single place where all secrets and server-specific values live. Never commit this file to git.

cat > /opt/monitoring/.env << 'EOF'
# ──────────────────────────────────────────────────
# VoIP Monitoring Stack — Environment Variables
# ──────────────────────────────────────────────────

# ─── Passwords (CHANGE THESE) ───
POSTGRES_PASSWORD=YOUR_POSTGRES_PASSWORD
GRAFANA_ADMIN_PASSWORD=YOUR_GRAFANA_PASSWORD
HOMER_DB_PASSWORD=YOUR_HOMER_DB_PASSWORD

# ─── Monitoring VPS IP ───
MONITOR_IP=YOUR_MONITOR_VPS_IP

# ─── Service Ports ───
GRAFANA_PORT=3000
PROMETHEUS_PORT=9090
LOKI_PORT=3100
HOMER_PORT=9080
SMOKEPING_PORT=8081
HEP_PORT=9060

# ─── Retention ───
HOMER_RETENTION_DAYS=7
PROMETHEUS_RETENTION_DAYS=30
LOKI_RETENTION_DAYS=7

# ─── VoIP Server IPs (for reference) ───
VOIP_SERVER_1_IP=YOUR_VOIP_SERVER_1_IP
VOIP_SERVER_2_IP=YOUR_VOIP_SERVER_2_IP
VOIP_SERVER_3_IP=YOUR_VOIP_SERVER_3_IP
VOIP_SERVER_4_IP=YOUR_VOIP_SERVER_4_IP

# ─── VoIP MySQL Read-Only Access (for Grafana datasources) ───
VOIP_MYSQL_USER=YOUR_MYSQL_RO_USER
VOIP_MYSQL_PASSWORD=YOUR_MYSQL_RO_PASSWORD
VOIP_MYSQL_DB=asterisk
EOF

chmod 600 /opt/monitoring/.env

Security note: The .env file contains database passwords and should be readable only by root. The chmod 600 ensures this.


Step 2: Docker Compose Stack

This is the core of the deployment. Eight services, one network, persistent volumes for all data.

cat > /opt/monitoring/docker-compose.yml << 'COMPOSE'
version: "3.8"

services:
  # ─── PostgreSQL (Homer SIP data backend) ───────────────────────
  postgres:
    image: postgres:16-alpine
    container_name: postgres
    restart: unless-stopped
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./postgres-init:/docker-entrypoint-initdb.d
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - monitoring

  # ─── heplify-server (HEP collector → PostgreSQL) ──────────────
  # Receives SIP packets from heplify agents on VoIP servers
  # and stores them in PostgreSQL for Homer webapp to query.
  heplify-server:
    image: sipcapture/heplify-server:latest
    container_name: heplify-server
    restart: unless-stopped
    ports:
      - "9060:9060/udp"    # HEP input (UDP)
      - "9060:9060/tcp"    # HEP input (TCP fallback)
    command:
      - "./heplify-server"
    environment:
      HEPLIFYSERVER_HEPADDR: "0.0.0.0:9060"
      HEPLIFYSERVER_DBSHEMA: "homer7"
      HEPLIFYSERVER_DBDRIVER: "postgres"
      HEPLIFYSERVER_DBADDR: "postgres:5432"
      HEPLIFYSERVER_DBUSER: "homer"
      HEPLIFYSERVER_DBPASS: "${HOMER_DB_PASSWORD}"
      HEPLIFYSERVER_DBDATATABLE: "homer_data"
      HEPLIFYSERVER_DBCONFTABLE: "homer_config"
      HEPLIFYSERVER_DBDROPDAYS: 7        # Auto-purge SIP data older than 7 days
      HEPLIFYSERVER_LOGLVL: "info"
      HEPLIFYSERVER_LOGSTD: "true"
      HEPLIFYSERVER_PROMADDR: "0.0.0.0:9096"   # Expose metrics for Prometheus
      HEPLIFYSERVER_DEDUP: "false"
      HEPLIFYSERVER_ALEGIDS: "X-CID"
      HEPLIFYSERVER_FORCEALEGID: "false"
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - monitoring

  # ─── Homer Web UI ──────────────────────────────────────────────
  # SIP search interface: call flow diagrams, SIP message search,
  # correlation by Call-ID, From, To, etc.
  homer-webapp:
    image: sipcapture/webapp:latest
    container_name: homer-webapp
    restart: unless-stopped
    ports:
      - "${HOMER_PORT:-9080}:80"
    environment:
      DB_HOST: postgres
      DB_USER: homer
      DB_PASS: ${HOMER_DB_PASSWORD}
    depends_on:
      postgres:
        condition: service_healthy
      heplify-server:
        condition: service_started
    networks:
      - monitoring

  # ─── Prometheus ────────────────────────────────────────────────
  # Scrapes metrics from all exporters every 15s.
  # 30-day retention. Hot-reload via /-/reload endpoint.
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "${PROMETHEUS_PORT:-9090}:9090"
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"           # Enables /-/reload for config changes
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - ./prometheus/blackbox.yml:/etc/prometheus/blackbox.yml:ro
      - prometheus_data:/prometheus
    depends_on:
      - blackbox-exporter
    networks:
      - monitoring

  # ─── Blackbox Exporter (external probing) ──────────────────────
  # Probes external endpoints: ICMP ping, TCP SIP port, HTTP checks.
  # Prometheus scrapes the results via /probe endpoint.
  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    volumes:
      - ./prometheus/blackbox.yml:/config/blackbox.yml:ro
    command:
      - "--config.file=/config/blackbox.yml"
    networks:
      - monitoring

  # ─── Loki (log aggregation) ────────────────────────────────────
  # Receives logs from promtail agents. TSDB v13 schema, 7-day retention.
  loki:
    image: grafana/loki:2.9.6
    container_name: loki
    restart: unless-stopped
    ports:
      - "${LOKI_PORT:-3100}:3100"
    command: -config.file=/etc/loki/loki-config.yml
    volumes:
      - ./loki/loki-config.yml:/etc/loki/loki-config.yml:ro
      - loki_data:/loki
    networks:
      - monitoring

  # ─── Grafana (unified dashboards) ──────────────────────────────
  # Single pane of glass: Prometheus metrics, Loki logs, Homer SIP
  # data, and direct MySQL queries to VoIP servers.
  grafana:
    image: grafana/grafana:10.4.1
    container_name: grafana
    restart: unless-stopped
    ports:
      - "${GRAFANA_PORT:-3000}:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_INSTALL_PLUGINS: >-
        grafana-clock-panel,
        grafana-worldmap-panel,
        grafana-polystat-panel,
        grafana-piechart-panel,
        yesoreyeram-infinity-datasource
      GF_SERVER_ROOT_URL: "http://${MONITOR_IP:-localhost}:3000"
      GF_SMTP_ENABLED: "false"
      GF_ALERTING_ENABLED: "false"                 # Disable legacy alerting
      GF_UNIFIED_ALERTING_ENABLED: "true"           # Use unified alerting (Grafana 10+)
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus
      - loki
    networks:
      - monitoring

  # ─── Smokeping ─────────────────────────────────────────────────
  # Continuous FPing-based latency monitoring with historical graphs.
  # Excellent for spotting intermittent packet loss patterns.
  smokeping:
    image: lscr.io/linuxserver/smokeping:latest
    container_name: smokeping
    restart: unless-stopped
    ports:
      - "${SMOKEPING_PORT:-8081}:80"
    environment:
      PUID: 1000
      PGID: 1000
      TZ: Europe/London     # Adjust to your timezone
    volumes:
      - ./smokeping/config/Targets:/config/Targets:ro
      - smokeping_data:/data
    networks:
      - monitoring

volumes:
  postgres_data:
  prometheus_data:
  loki_data:
  grafana_data:
  smokeping_data:

networks:
  monitoring:
    driver: bridge
COMPOSE

Why these specific versions?

Image Version Reason
Prometheus v2.51.0 Stable TSDB, native histogram support, lifecycle API
Loki 2.9.6 Last 2.x LTS before 3.0 breaking changes, TSDB v13 support
Grafana 10.4.1 Unified alerting, correlation features, stable plugin ecosystem
PostgreSQL 16-alpine Homer 7 compatibility, small image footprint
Blackbox v0.25.0 Stable release with all probe types we need

Step 3: PostgreSQL Init Script

Homer needs two databases: one for SIP data, one for its configuration. This script runs automatically on first container start.

cat > /opt/monitoring/postgres-init/01-init-dbs.sql << 'EOF'
-- Create Homer user and databases
CREATE USER homer WITH PASSWORD 'YOUR_HOMER_DB_PASSWORD';
CREATE DATABASE homer_data OWNER homer;
CREATE DATABASE homer_config OWNER homer;
GRANT ALL PRIVILEGES ON DATABASE homer_data TO homer;
GRANT ALL PRIVILEGES ON DATABASE homer_config TO homer;
EOF

Important: Replace YOUR_HOMER_DB_PASSWORD with the same value you used for HOMER_DB_PASSWORD in the .env file. This SQL file runs only once -- when the PostgreSQL volume is first created.


Step 4: Prometheus Configuration

This is the most complex config file. It defines what Prometheus scrapes, how often, and from where.

cat > /opt/monitoring/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s       # How often to pull metrics from targets
  evaluation_interval: 15s   # How often to evaluate alert rules

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers: []
  # If you add Alertmanager later:
  # alertmanagers:
  #   - static_configs:
  #       - targets: ["alertmanager:9093"]

scrape_configs:
  # ─── Prometheus self-monitoring ───
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # ─── Blackbox Exporter self-metrics ───
  - job_name: "blackbox"
    static_configs:
      - targets: ["blackbox-exporter:9115"]

  # ─── heplify-server metrics (SIP packet rates, DB writes) ───
  - job_name: "heplify-server"
    static_configs:
      - targets: ["heplify-server:9096"]

  # ─── Node Exporter (system metrics per server) ────────────
  # Each target is a VoIP server running node_exporter on :9100.
  # Labels let you filter/group by server name in Grafana.
  - job_name: "node"
    static_configs:
      - targets: ["YOUR_VOIP_SERVER_1_IP:9100"]
        labels:
          server: "voip-server-1"
          alias: "primary"
      - targets: ["YOUR_VOIP_SERVER_2_IP:9100"]
        labels:
          server: "voip-server-2"
          alias: "secondary"
      - targets: ["YOUR_VOIP_SERVER_3_IP:9100"]
        labels:
          server: "voip-server-3"
          alias: "tertiary"
      - targets: ["YOUR_VOIP_SERVER_4_IP:9100"]
        labels:
          server: "voip-server-4"
          alias: "quaternary"

  # ─── Asterisk Exporter (VoIP-specific metrics per server) ─
  # Custom Python exporter that queries Asterisk AMI and
  # ViciDial/CDR MySQL for SIP peer status, active calls,
  # agent states, RTP quality, codecs, and more.
  - job_name: "asterisk"
    scrape_interval: 15s
    static_configs:
      - targets: ["YOUR_VOIP_SERVER_1_IP:9101"]
        labels:
          server: "voip-server-1"
          alias: "primary"
      - targets: ["YOUR_VOIP_SERVER_2_IP:9101"]
        labels:
          server: "voip-server-2"
          alias: "secondary"
      - targets: ["YOUR_VOIP_SERVER_3_IP:9101"]
        labels:
          server: "voip-server-3"
          alias: "tertiary"
      - targets: ["YOUR_VOIP_SERVER_4_IP:9101"]
        labels:
          server: "voip-server-4"
          alias: "quaternary"

  # ─── Blackbox: ICMP ping to all servers and SIP providers ─
  # Tests basic reachability. Alert if any target is down for 5 min.
  - job_name: "blackbox_icmp"
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
          - YOUR_VOIP_SERVER_1_IP
          - YOUR_VOIP_SERVER_2_IP
          - YOUR_VOIP_SERVER_3_IP
          - YOUR_VOIP_SERVER_4_IP
          - YOUR_SIP_PROVIDER_1_IP    # e.g., primary inbound provider
          - YOUR_SIP_PROVIDER_2_IP    # e.g., outbound provider
          - YOUR_SIP_PROVIDER_3_IP    # e.g., backup trunk
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # ─── Blackbox: TCP probe SIP port 5060 ────────────────────
  # Verifies SIP providers are accepting connections on :5060.
  # More specific than ICMP -- detects SIP service crashes
  # even when the host is still pingable.
  - job_name: "blackbox_sip_tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - YOUR_SIP_PROVIDER_1_IP:5060
          - YOUR_SIP_PROVIDER_2_IP:5060
          - YOUR_SIP_PROVIDER_3_IP:5060
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # ─── Blackbox: HTTP probe VoIP web interfaces ─────────────
  # Checks that web UIs (ViciDial admin, FreePBX, etc.) respond.
  # Accepts 200, 301, 302, 401, 403 as "up" (login pages return 401/403).
  - job_name: "blackbox_http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - http://YOUR_VOIP_SERVER_1_IP/vicidial/
        labels:
          server: "voip-server-1"
      - targets:
          - http://YOUR_VOIP_SERVER_2_IP/vicidial/
        labels:
          server: "voip-server-2"
      - targets:
          - http://YOUR_VOIP_SERVER_3_IP/vicidial/
        labels:
          server: "voip-server-3"
      - targets:
          - http://YOUR_VOIP_SERVER_4_IP/vicidial/
        labels:
          server: "voip-server-4"
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
EOF

Understanding the relabel_configs (Blackbox pattern)

The relabel_configs block in the blackbox jobs is a standard Prometheus pattern that confuses newcomers. Here is what it does:

  1. The targets list contains the actual endpoints to probe (e.g., YOUR_SIP_PROVIDER_1_IP:5060)
  2. Prometheus needs to scrape the Blackbox Exporter, not the target directly
  3. The relabel rules:
    • Copy the target address into the __param_target label (becomes ?target= query param)
    • Save it as the instance label (so it shows up correctly in Grafana)
    • Replace __address__ with the Blackbox Exporter's address (where Prometheus actually sends the HTTP request)

Result: Prometheus sends GET http://blackbox-exporter:9115/probe?target=YOUR_SIP_PROVIDER_1_IP:5060&module=tcp_connect, and the Blackbox Exporter performs the actual probe.


Step 5: Blackbox Exporter Modules

Four probe types, each tuned for VoIP infrastructure.

cat > /opt/monitoring/prometheus/blackbox.yml << 'EOF'
modules:
  # ─── ICMP Ping ─────────────────────────────────────
  # Basic reachability check. Force IPv4 to avoid
  # dual-stack issues common in data centers.
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: ip4

  # ─── TCP Connect ───────────────────────────────────
  # Verify a TCP port accepts connections.
  # Used for SIP :5060 checks.
  tcp_connect:
    prober: tcp
    timeout: 5s

  # ─── HTTP 2xx ──────────────────────────────────────
  # Check web interfaces respond. We accept 401/403
  # because login-protected pages (ViciDial, FreePBX)
  # return these codes when not authenticated.
  http_2xx:
    prober: http
    timeout: 10s
    http:
      method: GET
      preferred_ip_protocol: ip4
      valid_status_codes: [200, 301, 302, 401, 403]
      no_follow_redirects: false

  # ─── SIP Options (TCP) ────────────────────────────
  # TCP-level check specifically for SIP endpoints.
  # For actual SIP OPTIONS probing, consider using
  # a dedicated SIP prober like sipvicious or sipp.
  sip_options:
    prober: tcp
    timeout: 5s
    tcp:
      preferred_ip_protocol: ip4
EOF

Step 6: Prometheus Alert Rules

14 alert rules organized into 8 groups. These cover the most common VoIP failure modes, from trunk failures to zombie conferences to disk space exhaustion.

cat > /opt/monitoring/prometheus/rules/alerts.yml << 'EOF'
groups:
  # ═══════════════════════════════════════════════════════════
  # GROUP 1: SIP Trunk Health
  # ═══════════════════════════════════════════════════════════
  - name: trunk_alerts
    rules:
      # Alert when a SIP trunk (not an agent extension) goes UNREACHABLE.
      # The regex filter peer!~"[0-9]+" excludes numeric SIP peers
      # (agent softphones), which go offline normally when agents log out.
      - alert: SIPTrunkDown
        expr: asterisk_sip_peer_status{status!="OK",peer!~"[0-9]+"} == 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SIP trunk {{ $labels.peer }} DOWN on {{ $labels.server }}"
          description: >-
            Trunk {{ $labels.peer }} has been unreachable for more than 5 minutes.
            Check provider status page, verify SIP credentials, and inspect
            Asterisk logs for registration failures.

      # High trunk latency degrades audio quality before the trunk
      # fully drops. 500ms threshold gives early warning.
      - alert: SIPTrunkHighLatency
        expr: asterisk_sip_peer_latency_ms{peer!~"[0-9]+"} > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SIP trunk {{ $labels.peer }} high latency on {{ $labels.server }}"
          description: >-
            Trunk {{ $labels.peer }} qualify latency is {{ $value }}ms (>500ms for 5 min).
            This may cause choppy audio. Check network path and provider load.

  # ═══════════════════════════════════════════════════════════
  # GROUP 2: Call Activity
  # ═══════════════════════════════════════════════════════════
  - name: call_alerts
    rules:
      # If a server has zero active calls for 30 minutes during
      # business hours, something is probably wrong.
      - alert: NoActiveCalls
        expr: asterisk_active_calls == 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "No active calls on {{ $labels.server }} for 30 minutes"
          description: "Zero active calls. Check if dialer campaigns are running."

      # A call lasting >2 hours is almost certainly a zombie conference
      # (agent hung up but the bridge stayed open). These consume a
      # channel and can block the agent from receiving new calls.
      - alert: ZombieConference
        expr: asterisk_agent_incall_duration_seconds > 7200
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent }} stuck in call >2h on {{ $labels.server }}"
          description: >-
            Agent has been INCALL for {{ $value | humanizeDuration }}.
            Possible zombie conference. Check with: asterisk -rx "confbridge list"

  # ═══════════════════════════════════════════════════════════
  # GROUP 3: Codec / Transcoding
  # ═══════════════════════════════════════════════════════════
  - name: codec_alerts
    rules:
      # Transcoding (e.g., G.729 → alaw) causes CPU load and can
      # produce robotic-sounding audio. In a properly configured
      # system, there should be zero transcoding.
      - alert: ActiveTranscoding
        expr: asterisk_transcoding_channels > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} channels transcoding on {{ $labels.server }}"
          description: >-
            Asterisk is actively transcoding {{ $value }} channels.
            This causes CPU load and can degrade audio quality (robotic voice).
            Ensure all trunks and endpoints use the same codec (usually alaw or ulaw).

  # ═══════════════════════════════════════════════════════════
  # GROUP 4: RTP Quality
  # ═══════════════════════════════════════════════════════════
  - name: rtp_alerts
    rules:
      # >5% packet loss means noticeably degraded call quality.
      # At >10%, calls become unusable.
      - alert: HighRTPPacketLoss
        expr: asterisk_rtp_packet_loss_percent > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High RTP packet loss on {{ $labels.server }}"
          description: "RTP packet loss is {{ $value }}% for peer {{ $labels.peer }}."

  # ═══════════════════════════════════════════════════════════
  # GROUP 5: System Resources
  # ═══════════════════════════════════════════════════════════
  - name: system_alerts
    rules:
      - alert: DiskSpaceHigh
        expr: >-
          (1 - node_filesystem_avail_bytes{mountpoint="/"}
             / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'Disk usage >85% on {{ $labels.server }}'
          description: 'Root filesystem is {{ $value | printf "%.1f" }}% full.'

      - alert: CPUSustainedHigh
        expr: >-
          100 - (avg by(server)
            (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU >90% sustained on {{ $labels.server }}"
          description: "CPU usage has been above 90% for 10 minutes."

      - alert: HighMemoryUsage
        expr: >-
          (1 - node_memory_MemAvailable_bytes
             / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage >90% on {{ $labels.server }}"
          description: 'Memory usage is {{ $value | printf "%.1f" }}%.'

      - alert: HighLoadAverage
        expr: >-
          node_load5
          / count without(cpu, mode) (node_cpu_seconds_total{mode="idle"}) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High load average on {{ $labels.server }}"
          description: '5-min load average is {{ $value | printf "%.1f" }}x CPU count.'

  # ═══════════════════════════════════════════════════════════
  # GROUP 6: Security
  # ═══════════════════════════════════════════════════════════
  - name: security_alerts
    rules:
      # >20 fail2ban bans in 5 minutes suggests a brute-force attack
      # or a misconfigured SIP device flooding registrations.
      - alert: Fail2banStorm
        expr: increase(asterisk_fail2ban_bans_total[5m]) > 20
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Fail2ban storm on {{ $labels.server }}"
          description: ">20 bans in 5 minutes. Possible SIP brute-force attack."

  # ═══════════════════════════════════════════════════════════
  # GROUP 7: Agent Behavior
  # ═══════════════════════════════════════════════════════════
  - name: agent_alerts
    rules:
      # Agents paused for >2 hours are probably AFK without logging out.
      # This is an operational issue, not a system failure.
      - alert: AgentStuckPaused
        expr: asterisk_agent_pause_duration_seconds > 7200
        for: 1m
        labels:
          severity: info
        annotations:
          summary: "Agent {{ $labels.agent }} paused >2 hours on {{ $labels.server }}"
          description: "Agent has been in PAUSED state for {{ $value | humanizeDuration }}."

  # ═══════════════════════════════════════════════════════════
  # GROUP 8: External Probe Failures
  # ═══════════════════════════════════════════════════════════
  - name: probe_alerts
    rules:
      # Catches SIP provider outages, server unreachability,
      # and web UI failures -- anything the Blackbox Exporter probes.
      - alert: EndpointDown
        expr: probe_success == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Endpoint {{ $labels.instance }} unreachable"
          description: >-
            Blackbox probe to {{ $labels.instance }} has been failing for 5 minutes.
            Check network connectivity and service status.

      # Detects when Prometheus itself can't reach an exporter.
      # This means metrics are stale and other alerts won't fire.
      - alert: ScrapeFailing
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Scrape target {{ $labels.instance }} down"
          description: >-
            Prometheus cannot scrape {{ $labels.job }} target {{ $labels.instance }}.
            All other alerts for this target are now blind.
EOF

Alert rule design notes


Step 7: Loki Configuration

Loki stores logs pushed by Promtail agents. This config uses TSDB v13 with filesystem storage -- simple, no object storage needed.

cat > /opt/monitoring/loki/loki-config.yml << 'EOF'
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

# ─── Schema: TSDB v13 (recommended for Loki 2.9+) ───
schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

# ─── Ingestion limits ───
# Tuned for 4-8 VoIP servers pushing Asterisk + syslog.
# Increase if you see "rate limit exceeded" errors in promtail.
limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h    # 7 days
  max_query_series: 5000
  ingestion_rate_mb: 30               # MB/s ingestion rate
  ingestion_burst_size_mb: 60         # Allow bursts (e.g., log rotation)
  per_stream_rate_limit: 10MB
  per_stream_rate_limit_burst: 30MB

# ─── Compactor: handles retention enforcement ───
compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  delete_request_cancel_period: 10m
  retention_delete_delay: 2h          # Grace period before deletion

# ─── Retention: 7 days ───
chunk_store_config:
  max_look_back_period: 168h          # 7 days

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h              # 7 days
EOF

Why TSDB v13 instead of BoltDB?

TSDB (Time Series Database) index is the modern storage engine for Loki. It replaces the older BoltDB-shipper approach and provides:


Step 8: Smokeping Targets

Smokeping provides beautiful latency graphs with packet loss visualization. Unlike Blackbox Exporter (which gives point-in-time metrics), Smokeping builds up long-term latency baselines that make intermittent network issues visible.

cat > /opt/monitoring/smokeping/config/Targets << 'EOF'
*** Targets ***

probe = FPing

menu = Top
title = VoIP Monitoring - Network Latency
remark = Centralized latency monitoring for all SIP servers and providers

# ─────────────────────────────────────────────────
# VoIP Servers
# ─────────────────────────────────────────────────
+ VoIP_Servers
menu = VoIP Servers
title = VoIP Servers

++ server_1
menu = Server 1 (Primary)
title = VoIP Server 1 - YOUR_VOIP_SERVER_1_IP
host = YOUR_VOIP_SERVER_1_IP

++ server_2
menu = Server 2 (Secondary)
title = VoIP Server 2 - YOUR_VOIP_SERVER_2_IP
host = YOUR_VOIP_SERVER_2_IP

++ server_3
menu = Server 3 (Tertiary)
title = VoIP Server 3 - YOUR_VOIP_SERVER_3_IP
host = YOUR_VOIP_SERVER_3_IP

++ server_4
menu = Server 4 (Quaternary)
title = VoIP Server 4 - YOUR_VOIP_SERVER_4_IP
host = YOUR_VOIP_SERVER_4_IP

# ─────────────────────────────────────────────────
# SIP Providers
# ─────────────────────────────────────────────────
+ SIP_Providers
menu = SIP Providers
title = SIP Provider Latency

++ provider_primary
menu = Primary Inbound
title = Primary Inbound Provider - YOUR_SIP_PROVIDER_1_IP
host = YOUR_SIP_PROVIDER_1_IP

++ provider_outbound
menu = Outbound
title = Outbound Provider - YOUR_SIP_PROVIDER_2_IP
host = YOUR_SIP_PROVIDER_2_IP

++ provider_backup
menu = Backup Trunk
title = Backup Trunk Provider - YOUR_SIP_PROVIDER_3_IP
host = YOUR_SIP_PROVIDER_3_IP
EOF

Step 9: Grafana Provisioning

Provisioning files auto-configure Grafana data sources and dashboard folders on first boot. No manual UI clicks needed.

Datasources

cat > /opt/monitoring/grafana/provisioning/datasources/all.yml << 'EOF'
apiVersion: 1

datasources:
  # ─── Prometheus (metrics) ───
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

  # ─── Loki (logs) ───
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false

  # ─── Homer SIP data (PostgreSQL) ───
  - name: Homer
    type: postgres
    access: proxy
    url: postgres:5432
    database: homer_data
    user: homer
    secureJsonData:
      password: YOUR_HOMER_DB_PASSWORD
    jsonData:
      sslmode: disable
      maxOpenConns: 5
      maxIdleConns: 2
    editable: false

  # ─── VoIP Server MySQL (direct CDR queries) ───
  # Optional: allows Grafana to query ViciDial/Asterisk CDR
  # tables directly for call reports and agent analytics.
  # Create a read-only MySQL user on each VoIP server first:
  #   CREATE USER 'grafana_ro'@'%' IDENTIFIED BY 'YOUR_PASSWORD';
  #   GRANT SELECT ON asterisk.* TO 'grafana_ro'@'%';
  - name: VoIP-Server-1
    type: mysql
    access: proxy
    url: YOUR_VOIP_SERVER_1_IP:3306
    database: asterisk
    user: YOUR_MYSQL_RO_USER
    secureJsonData:
      password: YOUR_MYSQL_RO_PASSWORD
    jsonData:
      maxOpenConns: 3
      maxIdleConns: 1
    editable: false
EOF

Dashboard provider

cat > /opt/monitoring/grafana/provisioning/dashboards/dashboard.yml << 'EOF'
apiVersion: 1

providers:
  - name: "VoIP Monitoring"
    orgId: 1
    folder: "VoIP Monitoring"
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 30
    options:
      path: /etc/grafana/provisioning/dashboards
      foldersFromFilesStructure: false
EOF

Tip: You can place .json dashboard files alongside dashboard.yml and they will be auto-imported into the "VoIP Monitoring" folder on startup. Export dashboards from the Grafana UI (Share > Export > Save to file) and place them here for infrastructure-as-code.


Step 10: Custom Asterisk Exporter

This is a custom Prometheus exporter written in Python that runs on each VoIP server. It collects metrics that no off-the-shelf exporter provides: SIP peer status from Asterisk AMI, agent states from ViciDial MySQL, RTP quality statistics, codec transcoding detection, and more.

The exporter exposes a standard /metrics endpoint on port 9101 that Prometheus scrapes.

cat > /opt/monitoring/scripts/asterisk_exporter.py << 'EXPORTER'
#!/usr/bin/env python3
"""
Asterisk/VoIP Prometheus Exporter
Queries Asterisk CLI + MySQL CDR database to expose VoIP metrics.
Runs on each monitored server, listens on :9101.

Metrics exposed:
  - asterisk_sip_peer_up/status/latency  (SIP peer health)
  - asterisk_active_calls/channels        (call volume)
  - asterisk_channels_by_codec            (codec distribution)
  - asterisk_rtp_packet_loss_percent      (RTP quality)
  - asterisk_rtp_jitter_ms               (RTP jitter)
  - asterisk_transcoding_channels         (transcoding detection)
  - asterisk_agents_logged_in/incall/paused/waiting  (agent states)
  - asterisk_agent_incall_duration_seconds  (zombie call detection)
  - asterisk_agent_pause_duration_seconds   (stuck pause detection)
  - asterisk_queue_depth                    (queue backlog)
  - asterisk_fail2ban_active_bans/total     (security)
  - asterisk_uptime_seconds                 (system health)
  - asterisk_confbridge_count               (conference count)
"""

import http.server
import subprocess
import re
import os
import mysql.connector
from mysql.connector import Error

LISTEN_PORT = int(os.environ.get("EXPORTER_PORT", 9101))
MYSQL_HOST = os.environ.get("MYSQL_HOST", "localhost")
MYSQL_USER = os.environ.get("MYSQL_USER", "cron")
MYSQL_PASS = os.environ.get("MYSQL_PASS", "YOUR_MYSQL_PASSWORD")
MYSQL_DB = os.environ.get("MYSQL_DB", "asterisk")
SERVER_LABEL = os.environ.get("SERVER_LABEL", "server1")


def run_ast_cmd(cmd):
    """Run an Asterisk CLI command and return output."""
    try:
        result = subprocess.run(
            ["asterisk", "-rx", cmd],
            capture_output=True, text=True, timeout=10
        )
        return result.stdout
    except Exception:
        return ""


def get_mysql_connection():
    """Get MySQL connection."""
    try:
        return mysql.connector.connect(
            host=MYSQL_HOST, user=MYSQL_USER,
            password=MYSQL_PASS, database=MYSQL_DB,
            connect_timeout=5
        )
    except Error:
        return None


def collect_sip_peers():
    """Parse 'sip show peers' for status and latency."""
    metrics = []
    output = run_ast_cmd("sip show peers")
    for line in output.splitlines():
        m = re.match(
            r'^(\S+)\s+(\d+\.\d+\.\d+\.\d+)\s+\S+\s+\S+\s+(\S+)\s+(\S+)',
            line
        )
        if m:
            peer = m.group(1).split('/')[0]
            status_str = m.group(3)
            latency_str = m.group(4)
            is_up = 1 if status_str == "OK" else 0
            metrics.append(
                f'asterisk_sip_peer_up{{server="{SERVER_LABEL}",'
                f'peer="{peer}"}} {is_up}'
            )
            metrics.append(
                f'asterisk_sip_peer_status{{server="{SERVER_LABEL}",'
                f'peer="{peer}",status="{status_str}"}} 1'
            )
            lat_match = re.search(r'(\d+)', latency_str)
            if lat_match:
                metrics.append(
                    f'asterisk_sip_peer_latency_ms{{server="{SERVER_LABEL}",'
                    f'peer="{peer}"}} {lat_match.group(1)}'
                )
    return metrics


def collect_channels():
    """Parse 'core show channels' for active call count and codec info."""
    metrics = []
    output = run_ast_cmd("core show channels")
    m = re.search(r'(\d+) active channel', output)
    channels = int(m.group(1)) if m else 0
    m2 = re.search(r'(\d+) active call', output)
    calls = int(m2.group(1)) if m2 else 0
    metrics.append(
        f'asterisk_active_channels{{server="{SERVER_LABEL}"}} {channels}'
    )
    metrics.append(
        f'asterisk_active_calls{{server="{SERVER_LABEL}"}} {calls}'
    )

    # Count codecs from channel stats
    codec_counts = {}
    stats_output = run_ast_cmd("sip show channelstats")
    for line in stats_output.splitlines():
        parts = line.split()
        if len(parts) >= 12:
            codec = parts[11] if len(parts) > 11 else "unknown"
            if codec in ("alaw", "ulaw", "g722", "g729", "gsm", "opus"):
                codec_counts[codec] = codec_counts.get(codec, 0) + 1
    for codec, count in codec_counts.items():
        metrics.append(
            f'asterisk_channels_by_codec{{server="{SERVER_LABEL}",'
            f'codec="{codec}"}} {count}'
        )
    return metrics


def collect_rtp_stats():
    """Parse 'sip show channelstats' for RTP quality metrics."""
    metrics = []
    output = run_ast_cmd("sip show channelstats")
    for line in output.splitlines():
        parts = line.split()
        if len(parts) >= 10 and parts[0] != "Peer":
            try:
                peer = parts[0]
                recv_loss_pct = (
                    float(parts[3].rstrip('%'))
                    if '%' in parts[3] else 0
                )
                recv_jitter = (
                    float(parts[4])
                    if parts[4].replace('.', '').isdigit() else 0
                )
                rtt = (
                    float(parts[7])
                    if len(parts) > 7
                    and parts[7].replace('.', '').isdigit()
                    else 0
                )
                metrics.append(
                    f'asterisk_rtp_packet_loss_percent'
                    f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
                    f'{recv_loss_pct}'
                )
                metrics.append(
                    f'asterisk_rtp_jitter_ms'
                    f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
                    f'{recv_jitter}'
                )
                if rtt > 0:
                    metrics.append(
                        f'asterisk_rtp_rtt_ms'
                        f'{{server="{SERVER_LABEL}",peer="{peer}"}} '
                        f'{rtt}'
                    )
            except (ValueError, IndexError):
                continue
    return metrics


def collect_uptime():
    """Get Asterisk uptime."""
    metrics = []
    output = run_ast_cmd("core show uptime seconds")
    m = re.search(r'System uptime:\s+(\d+)', output)
    if m:
        metrics.append(
            f'asterisk_uptime_seconds{{server="{SERVER_LABEL}"}} {m.group(1)}'
        )
    return metrics


def collect_confbridge():
    """Count active ConfBridge conferences."""
    metrics = []
    output = run_ast_cmd("confbridge list")
    count = 0
    for line in output.splitlines():
        if re.match(r'^\d+', line):
            count += 1
    metrics.append(
        f'asterisk_confbridge_count{{server="{SERVER_LABEL}"}} {count}'
    )
    return metrics


def collect_transcoding():
    """Detect active transcoding by inspecting channel read/write codecs."""
    metrics = []
    transcoding_count = 0

    output = run_ast_cmd("core show channels verbose")
    sip_channels = []
    for line in output.splitlines():
        m = re.match(r'^(SIP/\S+)', line)
        if m:
            sip_channels.append(m.group(1))

    for chan in sip_channels:
        ch_output = run_ast_cmd(f"core show channel {chan}")
        read_tc = False
        write_tc = False
        for ch_line in ch_output.splitlines():
            ch_line = ch_line.strip()
            if ch_line.startswith("ReadTranscode:") and "Yes" in ch_line:
                read_tc = True
            elif ch_line.startswith("WriteTranscode:") and "Yes" in ch_line:
                write_tc = True
        if read_tc or write_tc:
            transcoding_count += 1

    metrics.append(
        f'asterisk_transcoding_channels{{server="{SERVER_LABEL}"}} '
        f'{transcoding_count}'
    )
    return metrics


def collect_agents():
    """Query MySQL for agent states (ViciDial-specific, adapt for your PBX)."""
    metrics = []
    conn = get_mysql_connection()
    if not conn:
        return metrics
    try:
        cursor = conn.cursor(dictionary=True)

        # Agent counts by status
        cursor.execute("""
            SELECT status, COUNT(*) as cnt
            FROM vicidial_live_agents
            WHERE server_ip != ''
            GROUP BY status
        """)
        logged_in = incall = paused = waiting = 0
        for row in cursor.fetchall():
            s, c = row['status'], row['cnt']
            logged_in += c
            if s == 'INCALL':
                incall = c
            elif s == 'PAUSED':
                paused = c
            elif s in ('READY', 'CLOSER'):
                waiting += c

        metrics.append(
            f'asterisk_agents_logged_in{{server="{SERVER_LABEL}"}} {logged_in}'
        )
        metrics.append(
            f'asterisk_agents_incall{{server="{SERVER_LABEL}"}} {incall}'
        )
        metrics.append(
            f'asterisk_agents_paused{{server="{SERVER_LABEL}"}} {paused}'
        )
        metrics.append(
            f'asterisk_agents_waiting{{server="{SERVER_LABEL}"}} {waiting}'
        )

        # Per-agent status with duration (for zombie/stuck detection)
        cursor.execute("""
            SELECT user, status, pause_code,
                   TIMESTAMPDIFF(SECOND, last_state_change, NOW())
                     as state_duration
            FROM vicidial_live_agents
            WHERE server_ip != ''
        """)
        for row in cursor.fetchall():
            user = row['user']
            status = row['status']
            duration = row['state_duration'] or 0
            if status == 'INCALL':
                metrics.append(
                    f'asterisk_agent_incall_duration_seconds'
                    f'{{server="{SERVER_LABEL}",agent="{user}"}} {duration}'
                )
            elif status == 'PAUSED':
                metrics.append(
                    f'asterisk_agent_pause_duration_seconds'
                    f'{{server="{SERVER_LABEL}",agent="{user}"}} {duration}'
                )

        # Queue depth by campaign/ingroup
        cursor.execute("""
            SELECT campaign_id, COUNT(*) as cnt
            FROM vicidial_auto_calls
            WHERE status = 'LIVE'
            GROUP BY campaign_id
        """)
        for row in cursor.fetchall():
            metrics.append(
                f'asterisk_queue_depth{{server="{SERVER_LABEL}",'
                f'ingroup="{row["campaign_id"]}"}} {row["cnt"]}'
            )

        cursor.close()
    except Exception:
        pass
    finally:
        try:
            conn.close()
        except Exception:
            pass
    return metrics


def collect_fail2ban():
    """Parse fail2ban-client for ban counts."""
    metrics = []
    try:
        result = subprocess.run(
            ["fail2ban-client", "status"],
            capture_output=True, text=True, timeout=5
        )
        jails = re.findall(r'Jail list:\s*(.*)', result.stdout)
        if jails:
            for jail in jails[0].split(','):
                jail = jail.strip()
                if not jail:
                    continue
                jr = subprocess.run(
                    ["fail2ban-client", "status", jail],
                    capture_output=True, text=True, timeout=5
                )
                banned = re.search(
                    r'Currently banned:\s+(\d+)', jr.stdout
                )
                total = re.search(
                    r'Total banned:\s+(\d+)', jr.stdout
                )
                if banned:
                    metrics.append(
                        f'asterisk_fail2ban_active_bans'
                        f'{{server="{SERVER_LABEL}",jail="{jail}"}} '
                        f'{banned.group(1)}'
                    )
                if total:
                    metrics.append(
                        f'asterisk_fail2ban_bans_total'
                        f'{{server="{SERVER_LABEL}",jail="{jail}"}} '
                        f'{total.group(1)}'
                    )
    except Exception:
        pass
    return metrics


def collect_all():
    """Collect all metrics and return Prometheus text format."""
    lines = [
        "# HELP asterisk_sip_peer_up SIP peer reachability (1=up, 0=down)",
        "# TYPE asterisk_sip_peer_up gauge",
        "# HELP asterisk_sip_peer_latency_ms SIP peer qualify latency in ms",
        "# TYPE asterisk_sip_peer_latency_ms gauge",
        "# HELP asterisk_active_calls Number of active calls",
        "# TYPE asterisk_active_calls gauge",
        "# HELP asterisk_active_channels Number of active channels",
        "# TYPE asterisk_active_channels gauge",
        "# HELP asterisk_agents_logged_in Number of agents logged in",
        "# TYPE asterisk_agents_logged_in gauge",
        "# HELP asterisk_agents_incall Number of agents in call",
        "# TYPE asterisk_agents_incall gauge",
        "# HELP asterisk_agents_paused Number of agents paused",
        "# TYPE asterisk_agents_paused gauge",
        "# HELP asterisk_queue_depth Calls waiting in queue per ingroup",
        "# TYPE asterisk_queue_depth gauge",
        "# HELP asterisk_fail2ban_active_bans Current fail2ban active bans",
        "# TYPE asterisk_fail2ban_active_bans gauge",
        "# HELP asterisk_fail2ban_bans_total Total fail2ban bans",
        "# TYPE asterisk_fail2ban_bans_total counter",
        "# HELP asterisk_uptime_seconds Asterisk system uptime",
        "# TYPE asterisk_uptime_seconds gauge",
        "# HELP asterisk_confbridge_count Active ConfBridge conferences",
        "# TYPE asterisk_confbridge_count gauge",
        "# HELP asterisk_rtp_packet_loss_percent RTP packet loss percentage",
        "# TYPE asterisk_rtp_packet_loss_percent gauge",
        "# HELP asterisk_rtp_jitter_ms RTP jitter in ms",
        "# TYPE asterisk_rtp_jitter_ms gauge",
        "# HELP asterisk_transcoding_channels Channels actively transcoding",
        "# TYPE asterisk_transcoding_channels gauge",
        "",
    ]
    lines.extend(collect_sip_peers())
    lines.extend(collect_channels())
    lines.extend(collect_rtp_stats())
    lines.extend(collect_uptime())
    lines.extend(collect_confbridge())
    lines.extend(collect_agents())
    lines.extend(collect_fail2ban())
    lines.extend(collect_transcoding())
    return "\n".join(lines) + "\n"


class MetricsHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/metrics":
            body = collect_all()
            self.send_response(200)
            self.send_header("Content-Type", "text/plain; charset=utf-8")
            self.end_headers()
            self.wfile.write(body.encode())
        else:
            self.send_response(200)
            self.send_header("Content-Type", "text/html")
            self.end_headers()
            self.wfile.write(b"<a href='/metrics'>Metrics</a>")

    def log_message(self, format, *args):
        pass  # Suppress request logging


if __name__ == "__main__":
    server = http.server.HTTPServer(("0.0.0.0", LISTEN_PORT), MetricsHandler)
    print(f"asterisk_exporter listening on :{LISTEN_PORT}")
    server.serve_forever()
EXPORTER

chmod +x /opt/monitoring/scripts/asterisk_exporter.py

Adapting for non-ViciDial systems

The collect_agents() function queries ViciDial-specific tables (vicidial_live_agents, vicidial_auto_calls). If you run FreePBX, FusionPBX, or plain Asterisk:

The SIP peer, channel, RTP, and codec collection functions work with any Asterisk version 11+.


Step 11: Remote Agent Installation Script

This script SSHs into a VoIP server and installs all four monitoring agents (node_exporter, heplify, promtail, asterisk_exporter) in one command. It auto-detects the OS (Ubuntu/Debian, CentOS, openSUSE) and adjusts package installation accordingly.

cat > /opt/monitoring/scripts/install-agents.sh << 'INSTALLER'
#!/bin/bash
# install-agents.sh — Install monitoring agents on a remote VoIP server
# Usage: ./install-agents.sh <server_ip> <ssh_port> <server_label> <monitor_vps_ip>
#
# Example:
#   ./install-agents.sh 10.0.1.50 22 server1 10.0.0.10
#
# This installs:
#   1. heplify        — SIP packet capture → Homer
#   2. node_exporter  — System metrics → Prometheus
#   3. promtail       — Log shipping → Loki
#   4. asterisk_exporter — VoIP metrics → Prometheus

set -e

SERVER_IP="${1:?Usage: $0 <server_ip> <ssh_port> <server_label> <monitor_vps_ip>}"
SSH_PORT="${2:-22}"
SERVER_LABEL="${3:?Provide server label (e.g., server1, primary, london)}"
MONITOR_IP="${4:?Provide monitoring VPS IP}"

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

echo "=== Installing monitoring agents on ${SERVER_LABEL} (${SERVER_IP}:${SSH_PORT}) ==="
echo "Monitor VPS: ${MONITOR_IP}"
echo ""

SSH_CMD="ssh -o StrictHostKeyChecking=no -p ${SSH_PORT} root@${SERVER_IP}"

# ─── 1. heplify (SIP capture agent) ───
echo "[1/4] Installing heplify..."
${SSH_CMD} bash << REMOTEOF
set -e
if [ ! -f /usr/local/bin/heplify ]; then
    curl -sL https://github.com/sipcapture/heplify/releases/download/v1.67.1/heplify \
      -o /usr/local/bin/heplify
    chmod +x /usr/local/bin/heplify
    echo "  heplify binary installed"
else
    echo "  heplify already installed"
fi

cat > /etc/systemd/system/heplify.service << SVCFILE
[Unit]
Description=heplify SIP Capture Agent
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/heplify -hs ${MONITOR_IP}:9060 -i any -dim "OPTIONS,NOTIFY" -e
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
SVCFILE

systemctl daemon-reload
systemctl enable heplify
systemctl restart heplify
echo "  heplify service started"
REMOTEOF

# ─── 2. node_exporter (system metrics) ───
echo "[2/4] Installing node_exporter..."
${SSH_CMD} bash << 'REMOTEOF'
set -e
if [ ! -f /usr/local/bin/node_exporter ]; then
    cd /tmp
    curl -sL https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz | tar xz
    cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
    rm -rf node_exporter-1.7.0.linux-amd64*
    echo "  node_exporter binary installed"
else
    echo "  node_exporter already installed"
fi

cat > /etc/systemd/system/node_exporter.service << 'SVCFILE'
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
SVCFILE

systemctl daemon-reload
systemctl enable node_exporter
systemctl restart node_exporter
echo "  node_exporter service started"
REMOTEOF

# ─── 3. promtail (log shipping) ───
echo "[3/4] Installing promtail..."
${SSH_CMD} bash << REMOTEOF
set -e
if [ ! -f /usr/local/bin/promtail ]; then
    cd /tmp
    curl -sL https://github.com/grafana/loki/releases/download/v2.9.6/promtail-linux-amd64.zip \
      -o promtail.zip
    # Install unzip on whichever OS
    if command -v apt-get &>/dev/null; then
        apt-get install -y unzip 2>/dev/null || true
    elif command -v zypper &>/dev/null; then
        zypper install -y unzip 2>/dev/null || true
    elif command -v yum &>/dev/null; then
        yum install -y unzip 2>/dev/null || true
    fi
    unzip -o promtail.zip
    mv promtail-linux-amd64 /usr/local/bin/promtail
    chmod +x /usr/local/bin/promtail
    rm -f promtail.zip
    echo "  promtail binary installed"
else
    echo "  promtail already installed"
fi

mkdir -p /etc/promtail /var/lib/promtail

cat > /etc/promtail/config.yml << CFGFILE
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://${MONITOR_IP}:3100/loki/api/v1/push

scrape_configs:
  # Asterisk main log (warnings, errors, notices)
  - job_name: asterisk_messages
    static_configs:
      - targets: [localhost]
        labels:
          job: asterisk
          server: ${SERVER_LABEL}
          logtype: messages
          __path__: /var/log/asterisk/messages

  # Asterisk verbose log (if enabled)
  - job_name: asterisk_full
    static_configs:
      - targets: [localhost]
        labels:
          job: asterisk
          server: ${SERVER_LABEL}
          logtype: full
          __path__: /var/log/asterisk/full

  # ViciDial/astguiclient logs (dialer, listener, etc.)
  - job_name: vicidial
    static_configs:
      - targets: [localhost]
        labels:
          job: vicidial
          server: ${SERVER_LABEL}
          logtype: vicidial
          __path__: /var/log/astguiclient/*.log

  # System syslog
  - job_name: syslog
    static_configs:
      - targets: [localhost]
        labels:
          job: syslog
          server: ${SERVER_LABEL}
          logtype: syslog
          __path__: /var/log/messages
CFGFILE

cat > /etc/systemd/system/promtail.service << 'SVCFILE'
[Unit]
Description=Promtail Log Agent
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
SVCFILE

systemctl daemon-reload
systemctl enable promtail
systemctl restart promtail
echo "  promtail service started"
REMOTEOF

# ─── 4. asterisk_exporter (VoIP metrics) ───
echo "[4/4] Installing asterisk_exporter..."
${SSH_CMD} "mkdir -p /opt/asterisk_exporter"

# Copy the exporter script to the remote server
scp -o StrictHostKeyChecking=no -P ${SSH_PORT} \
  ${SCRIPT_DIR}/asterisk_exporter.py \
  root@${SERVER_IP}:/opt/asterisk_exporter/asterisk_exporter.py

${SSH_CMD} bash << REMOTEOF
set -e

# Find Python 3
PYTHON_BIN=""
for p in python3.11 python3.6 python3; do
    if command -v \$p &>/dev/null; then
        PYTHON_BIN=\$(command -v \$p)
        break
    fi
done

if [ -z "\$PYTHON_BIN" ]; then
    if command -v yum &>/dev/null; then
        yum install -y python3 python3-pip 2>/dev/null || true
        PYTHON_BIN=\$(command -v python3)
    fi
fi

echo "  Using Python: \$PYTHON_BIN"

# Install MySQL connector
\$PYTHON_BIN -m pip install mysql-connector-python 2>/dev/null \
  || \$PYTHON_BIN -m pip install "mysql-connector-python<8.1" 2>/dev/null \
  || true

chmod +x /opt/asterisk_exporter/asterisk_exporter.py

cat > /etc/systemd/system/asterisk_exporter.service << SVCFILE
[Unit]
Description=Asterisk/VoIP Prometheus Exporter
After=network.target mariadb.service asterisk.service
Wants=mariadb.service

[Service]
Type=simple
ExecStart=\$PYTHON_BIN /opt/asterisk_exporter/asterisk_exporter.py
Restart=always
RestartSec=10
Environment=EXPORTER_PORT=9101
Environment=MYSQL_HOST=localhost
Environment=MYSQL_USER=YOUR_MYSQL_USER
Environment=MYSQL_PASS=YOUR_MYSQL_PASSWORD
Environment=MYSQL_DB=asterisk
Environment=SERVER_LABEL=${SERVER_LABEL}

[Install]
WantedBy=multi-user.target
SVCFILE

systemctl daemon-reload
systemctl enable asterisk_exporter
systemctl restart asterisk_exporter
echo "  asterisk_exporter service started"
REMOTEOF

echo ""
echo "=== All 4 agents installed on ${SERVER_LABEL} (${SERVER_IP}) ==="
echo "  heplify       -> sending HEP to ${MONITOR_IP}:9060"
echo "  node_exporter -> :9100"
echo "  promtail      -> shipping logs to ${MONITOR_IP}:3100"
echo "  ast_exporter  -> :9101"
echo ""
INSTALLER

chmod +x /opt/monitoring/scripts/install-agents.sh

Usage

# Install agents on your first VoIP server
./scripts/install-agents.sh YOUR_VOIP_SERVER_1_IP 22 server1 YOUR_MONITOR_VPS_IP

# Install on second server (custom SSH port)
./scripts/install-agents.sh YOUR_VOIP_SERVER_2_IP 9322 server2 YOUR_MONITOR_VPS_IP

Step 12: Backup Script

A simple daily backup that archives all configuration files. Add it to cron on the monitoring VPS.

cat > /opt/monitoring/scripts/backup-monitoring.sh << 'BACKUP'
#!/bin/bash
# backup-monitoring.sh — Backup all monitoring configs
# Run daily via cron: 0 2 * * * /opt/monitoring/scripts/backup-monitoring.sh

BACKUP_DIR="/var/backups/monitoring"
DATE=$(date +%Y%m%d_%H%M%S)
TARGET="${BACKUP_DIR}/monitoring_${DATE}.tar.gz"

mkdir -p "${BACKUP_DIR}"

tar czf "${TARGET}" \
  /opt/monitoring/docker-compose.yml \
  /opt/monitoring/.env \
  /opt/monitoring/prometheus/ \
  /opt/monitoring/grafana/ \
  /opt/monitoring/loki/ \
  /opt/monitoring/smokeping/ \
  /opt/monitoring/scripts/ \
  2>/dev/null

# Keep last 7 backups, delete older ones
ls -t "${BACKUP_DIR}"/monitoring_*.tar.gz | tail -n +8 | xargs rm -f 2>/dev/null

echo "Backup saved: ${TARGET} ($(du -h ${TARGET} | cut -f1))"
BACKUP

chmod +x /opt/monitoring/scripts/backup-monitoring.sh

Add to cron:

# Run backup daily at 2 AM
echo "0 2 * * * /opt/monitoring/scripts/backup-monitoring.sh >> /var/log/monitoring-backup.log 2>&1" \
  | crontab -

Step 13: Launch and Verify

Start the stack

cd /opt/monitoring
docker compose up -d

Verify all containers are running

docker compose ps

Expected output:

NAME               STATUS                    PORTS
blackbox-exporter  Up (healthy)
grafana            Up                        0.0.0.0:3000->3000/tcp
heplify-server     Up                        0.0.0.0:9060->9060/tcp+udp
homer-webapp       Up                        0.0.0.0:9080->80/tcp
loki               Up                        0.0.0.0:3100->3100/tcp
postgres           Up (healthy)
prometheus         Up                        0.0.0.0:9090->9090/tcp
smokeping          Up                        0.0.0.0:8081->80/tcp

Verify Prometheus targets

Open http://YOUR_MONITOR_VPS_IP:9090/targets in your browser. You should see all scrape jobs listed with their status (UP or DOWN). Jobs targeting remote VoIP servers will show DOWN until you install the agents.

Verify Loki is ready

curl -s http://localhost:3100/ready
# Expected: "ready"

Verify Homer is receiving data

After installing heplify on a VoIP server, check Homer at http://YOUR_MONITOR_VPS_IP:9080. Search for recent SIP traffic. Default login is admin / sipcapture.

Install agents on your VoIP servers

cd /opt/monitoring
./scripts/install-agents.sh YOUR_VOIP_SERVER_1_IP 22 server1 YOUR_MONITOR_VPS_IP

Wait 30 seconds, then check Prometheus targets again. The node and asterisk jobs for that server should show UP.

Log in to Grafana

Open http://YOUR_MONITOR_VPS_IP:3000 and log in with admin / the password from your .env file. The Prometheus, Loki, and Homer data sources should already be configured.


Grafana Dashboard Ideas

Here are PromQL queries you can use to build dashboards.

VoIP Overview Panel

# Active calls per server (stat panel)
asterisk_active_calls

# Total agents logged in (stat panel)
sum(asterisk_agents_logged_in)

# SIP trunk status table
asterisk_sip_peer_status{peer!~"[0-9]+"}

SIP Trunk Latency Graph

# Trunk latency over time (time series panel)
asterisk_sip_peer_latency_ms{peer!~"[0-9]+"}

RTP Quality Heatmap

# Packet loss distribution (heatmap panel)
asterisk_rtp_packet_loss_percent

Blackbox Probe Duration

# Probe response time (time series panel)
probe_duration_seconds{job="blackbox_icmp"}

# Probe success rate (stat panel, percentage)
avg_over_time(probe_success{job="blackbox_sip_tcp"}[1h]) * 100

System Resource Overview

# CPU usage per server (time series)
100 - (avg by(server) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage (gauge)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage percentage (gauge)
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

Loki Log Query Examples

# Asterisk errors on a specific server
{job="asterisk", server="server1"} |= "ERROR"

# SIP registration failures across all servers
{job="asterisk"} |~ "Registration.*failed|UNREACHABLE"

# ViciDial dialer errors
{job="vicidial"} |= "ERROR" | logfmt

Tips and Tricks

1. Hot-reload Prometheus config without restart

After editing prometheus.yml or alert rules:

curl -X POST http://localhost:9090/-/reload

This works because we started Prometheus with --web.enable-lifecycle. No downtime, no data loss.

2. Check Prometheus config syntax before applying

docker exec prometheus promtool check config /etc/prometheus/prometheus.yml
docker exec prometheus promtool check rules /etc/prometheus/rules/alerts.yml

Always validate before reloading. A syntax error in rules will cause Prometheus to reject the entire reload.

3. Exclude noisy SIP messages from Homer

The heplify agent flag -dim "OPTIONS,NOTIFY" filters out SIP OPTIONS keepalives and NOTIFY events. These make up 90%+ of SIP traffic but are rarely useful for debugging. If you need them, remove the flag.

4. Use labels consistently

Every metric from the asterisk_exporter includes a server label. Use the same label values across all configs (prometheus.yml targets, promtail config, Smokeping targets). This lets you correlate metrics, logs, and SIP captures for the same server in Grafana.

5. Grafana variables for multi-server dashboards

Create a Grafana dashboard variable:

Then use $server in all panel queries:

asterisk_active_calls{server="$server"}

This gives you a dropdown at the top of the dashboard to switch between servers.

6. Set up recording rules for expensive queries

If you have many SIP peers and the sip show peers command is slow, pre-compute aggregates:

# Add to prometheus/rules/recording.yml
groups:
  - name: voip_recording_rules
    interval: 30s
    rules:
      - record: job:asterisk_trunks_up:count
        expr: count by(server) (asterisk_sip_peer_up{peer!~"[0-9]+"}==1)
      - record: job:asterisk_trunks_total:count
        expr: count by(server) (asterisk_sip_peer_up{peer!~"[0-9]+"})

7. Monitor Loki ingestion rate

If promtail stops shipping logs, you won't notice unless you check. Add this to your alert rules:

- alert: LokiIngestionStopped
  expr: sum(rate(loki_distributor_bytes_received_total[5m])) == 0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Loki is not receiving any logs"

8. Smokeping graph colors

Smokeping uses RRD graphs. The gray bands show packet loss. Narrow green lines = stable. Wide gray bands = intermittent loss. If you see periodic patterns (e.g., loss every hour), it often correlates with backup jobs or log rotation on the target server.

9. Scale the asterisk_exporter

The exporter runs Asterisk CLI commands synchronously. On a busy server with 100+ active channels, core show channel <chan> for transcoding detection can take several seconds per channel. If scrape timeouts occur:

10. Persistent Docker volumes

All data is stored in named Docker volumes (prometheus_data, loki_data, etc.). This means docker compose down preserves data, but docker compose down -v destroys it. Never use -v unless you want a clean start.


Troubleshooting

Prometheus shows target as DOWN

Symptoms: Target status shows DOWN with connection refused or context deadline exceeded.

Checklist:

  1. Is the exporter running on the remote server?
    ssh root@YOUR_SERVER "systemctl status node_exporter"
    
  2. Is the port accessible?
    curl -s http://YOUR_SERVER_IP:9100/metrics | head -5
    
  3. Is a firewall blocking the port?
    ssh root@YOUR_SERVER "iptables -L -n | grep 9100"
    # Or for firewalld:
    ssh root@YOUR_SERVER "firewall-cmd --list-ports"
    
  4. Add firewall rules if needed:
    # iptables
    iptables -I INPUT -p tcp --dport 9100 -s YOUR_MONITOR_VPS_IP -j ACCEPT
    iptables -I INPUT -p tcp --dport 9101 -s YOUR_MONITOR_VPS_IP -j ACCEPT
    
    # firewalld
    firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=YOUR_MONITOR_VPS_IP port port=9100-9101 protocol=tcp accept'
    firewall-cmd --reload
    

Loki not receiving logs from promtail

Symptoms: No logs visible in Grafana Explore with Loki data source.

Checklist:

  1. Check promtail status on the remote server:
    ssh root@YOUR_SERVER "systemctl status promtail"
    ssh root@YOUR_SERVER "journalctl -u promtail -n 50"
    
  2. Common errors:
    • 429 Too Many Requests: Increase ingestion_rate_mb in loki-config.yml
    • connection refused: Verify Loki port 3100 is open on the monitoring VPS firewall
    • file not found: The log path in promtail config does not exist on that server (e.g., /var/log/asterisk/full may not exist if full logging is disabled)
  3. Test Loki directly:
    curl -s http://YOUR_MONITOR_VPS_IP:3100/ready
    curl -s http://YOUR_MONITOR_VPS_IP:3100/loki/api/v1/labels
    

Homer not showing SIP messages

Symptoms: Homer webapp loads but shows no SIP data.

Checklist:

  1. Is heplify running on the VoIP server?
    ssh root@YOUR_SERVER "systemctl status heplify"
    
  2. Is the HEP port accessible?
    # From the VoIP server, test connectivity to the monitor
    ssh root@YOUR_SERVER "nc -zvu YOUR_MONITOR_VPS_IP 9060"
    
  3. Check heplify-server logs:
    docker logs heplify-server --tail 50
    
  4. Check PostgreSQL has homer tables:
    docker exec postgres psql -U homer -d homer_data -c "\dt"
    
  5. Verify the homer user password matches between .env, 01-init-dbs.sql, and the heplify-server environment variables.

Grafana data source connection errors

Symptoms: Grafana shows "Bad Gateway" or "Connection refused" for a data source.

Checklist:

Docker container keeps restarting

# Check logs for the failing container
docker logs <container_name> --tail 100

# Common causes:
# - postgres: init script SQL error (check 01-init-dbs.sql)
# - loki: permission error on /loki directory
# - prometheus: YAML syntax error in config
# - heplify-server: can't connect to postgres (check DB password)

High disk usage on monitoring VPS

# Check Docker volume sizes
docker system df -v

# Prometheus is usually the largest consumer
# Reduce retention: change --storage.tsdb.retention.time=30d to 14d

# Force Loki compaction
curl -X POST http://localhost:3100/compactor/ring/delete

# Prune unused Docker resources
docker system prune -f

Security Considerations

Firewall the monitoring ports

The monitoring stack exposes several ports. In production, restrict access:

# Allow only your admin IP to access dashboards
iptables -I INPUT -p tcp --dport 3000 -s YOUR_ADMIN_IP -j ACCEPT    # Grafana
iptables -I INPUT -p tcp --dport 9090 -s YOUR_ADMIN_IP -j ACCEPT    # Prometheus
iptables -I INPUT -p tcp --dport 9080 -s YOUR_ADMIN_IP -j ACCEPT    # Homer

# Allow only VoIP servers to push data
iptables -I INPUT -p tcp --dport 3100 -s YOUR_VOIP_SERVER_1_IP -j ACCEPT  # Loki
iptables -I INPUT -p udp --dport 9060 -s YOUR_VOIP_SERVER_1_IP -j ACCEPT  # HEP
# Repeat for each VoIP server

# Block all other access to these ports
iptables -A INPUT -p tcp --dport 3000 -j DROP
iptables -A INPUT -p tcp --dport 9090 -j DROP
# ... etc

Use read-only MySQL users

The asterisk_exporter and Grafana MySQL data sources should use a read-only MySQL user. Never give them write access to your production database.

-- On each VoIP server
CREATE USER 'grafana_ro'@'YOUR_MONITOR_VPS_IP' IDENTIFIED BY 'YOUR_STRONG_PASSWORD';
GRANT SELECT ON asterisk.* TO 'grafana_ro'@'YOUR_MONITOR_VPS_IP';
FLUSH PRIVILEGES;

Reverse proxy with TLS

For production use, put Grafana behind nginx or Caddy with TLS:

# /etc/nginx/sites-available/grafana
server {
    listen 443 ssl;
    server_name monitoring.yourdomain.com;

    ssl_certificate     /etc/letsencrypt/live/monitoring.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/monitoring.yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support (required for Grafana Live)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Do not expose Prometheus externally

Prometheus has no built-in authentication. If you need remote access, use an SSH tunnel or VPN rather than exposing port 9090 to the internet.


What's Next

Once the base stack is running, consider these additions:

  1. Alertmanager: Add prom/alertmanager to the Docker Compose stack to send alerts via email, Slack, PagerDuty, or Telegram. Connect it by filling in the alertmanagers section in prometheus.yml.

  2. Grafana Alerting: Instead of Alertmanager, use Grafana's built-in unified alerting (already enabled in this stack) to create alert rules with contact points directly in the Grafana UI.

  3. Recording cleanup monitoring: Add a cron job that checks recording disk usage and alert when retention policy is not working.

  4. SIP quality scoring: Use the RTP metrics to compute a Mean Opinion Score (MOS) approximation:

    # Simplified R-factor → MOS conversion
    # R = 93.2 - packet_loss*2.5 - jitter*0.03 - latency*0.024
    # MOS = 1 + 0.035*R + R*(R-60)*(100-R)*7e-6
    
  5. Dashboard JSON exports: Export your best dashboards as JSON and commit them to the grafana/provisioning/dashboards/ directory for infrastructure-as-code.

  6. Log alerting in Loki: Use Grafana's log-based alerting to trigger on specific Asterisk log patterns (e.g., UNREACHABLE, chan_sip.c: Failed to authenticate).

  7. Uptime monitoring: Add an external uptime check (e.g., Uptime Kuma in Docker) that monitors the monitoring stack itself.


Summary

You now have a complete, production-grade VoIP monitoring stack:

The total resource footprint on the monitoring VPS is approximately 2-3 GB RAM and minimal CPU at idle, scaling linearly with the number of monitored servers. The remote agents use less than 100 MB RAM combined per server.

This stack has been running in production monitoring a multi-server VoIP call center fleet (4 Asterisk/ViciDial servers, 7 SIP providers, 50+ agents) with zero data loss and sub-second query times in Grafana.


Built from production experience. Every configuration in this tutorial has been tested under real VoIP traffic.

Need expert help with your setup?

VoIP infrastructure consulting, AI voice agent integration, monitoring stacks, scaling — I've done it all in production.

Get a Free Consultation