← All Tutorials

Building Custom Claude Code Skills for VoIP Infrastructure Operations

AI-Powered Operations Advanced 37 min read #11

Building Custom Claude Code Skills for VoIP Infrastructure Operations

15 Slash Commands for Monitoring, Investigation & Lookup


Audience: DevOps engineers, sysadmins, VoIP/telecom operators, and Claude Code power users.

What you will build: A complete AI-powered operations toolkit -- 15 custom slash commands that turn Claude Code into a senior infrastructure engineer who knows your servers, your databases, your SIP trunks, and your investigation playbooks by heart.

Prerequisites: Claude Code CLI installed, SSH access to your servers, basic familiarity with Asterisk/VoIP concepts.


Table of Contents

  1. Why AI-Assisted Operations Skills
  2. Architecture Overview
  3. The SKILL.md File Format
  4. Directory Structure
  5. Operations Skills (6)
  6. Investigation Skills (5)
  7. Lookup Skills (4)
  8. Complete Example Skills
  9. Production Safety Hook
  10. MCP Grafana Integration
  11. Settings Configuration
  12. Skill Design Patterns & Tips
  13. Investigation Workflow Patterns
  14. Permission Management
  15. Putting It All Together

1. Why AI-Assisted Operations Skills

Traditional infrastructure monitoring gives you dashboards. Runbooks give you procedures. But neither thinks. Neither correlates. Neither adapts.

When you build custom Claude Code skills for your infrastructure, you get something qualitatively different:

Context-aware investigation. Instead of checking five different tools manually, you type /call-investigate +44XXXXXXXXXX and Claude traces the call through DID routing, carrier logs, Asterisk dialplans, SIP traces, agent state, and audio recordings -- correlating everything into a single diagnosis.

Institutional knowledge embedded in code. Every skill file encodes your team's hard-won knowledge: which hangup cause means what, which server uses which MySQL credentials, where the recordings live, what "normal" looks like for your trunks. New team members get the senior engineer's playbook on day one.

The 10x multiplier is real. Here is what changes:

Task Without Skills With Skills
Health check across 5 servers 5-10 min (SSH each, run commands, compare) 15 sec (/health)
Investigate a dropped call 30-60 min (find logs, trace routing, check carrier) 2 min (/call-investigate)
Check why agent has no calls 15-20 min (check ranks, ingroups, login state) 30 sec (/agent-ranks agent123)
Diagnose audio quality complaint 1-2 hours (Homer, recordings, codecs, network) 5 min (/audio-quality)
Full server audit 45-60 min 3 min (/audit-server)

Each skill is a Markdown file. No plugins to install, no APIs to build, no code to compile. You write the investigation procedure in natural language, and Claude executes it using the tools you allow.


2. Architecture Overview

+------------------+     SSH (key-based)     +-------------------+
|                  |------------------------->| VoIP Server 1     |
|  VPS / Jump Box  |------------------------->| VoIP Server 2     |
|  (Claude Code)   |------------------------->| VoIP Server 3     |
|                  |------------------------->| Replica DB        |
|  ~/.claude/      |                          +-------------------+
|    skills/       |
|      health/     |     Docker (local)       +-------------------+
|      calls/      |------------------------->| Grafana           |
|      agents/     |------------------------->| Prometheus        |
|      ...         |------------------------->| Loki              |
|    hooks/        |------------------------->| Homer (SIP/RTCP)  |
|    settings.json |------------------------->| Smokeping         |
+------------------+                          +-------------------+
        |
        | MCP (Model Context Protocol)
        v
+------------------+
| Grafana MCP      |
| (mcp-grafana)    |
| - Dashboards     |
| - PromQL queries |
| - Loki log search|
+------------------+

Key principle: Claude Code runs on a central VPS/jump box that has SSH access to all production servers and Docker access to monitoring containers. Skills teach Claude how to use these access paths to answer operational questions.


3. The SKILL.md File Format

Every skill is a single Markdown file named SKILL.md inside its own directory under ~/.claude/skills/. The file has two parts: a YAML frontmatter header and a Markdown body.

Frontmatter (Required)

---
name: skill-name
description: One-line description shown in skill listings and used for matching.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *)
---
Field Purpose
name The slash command name. Users type /name to invoke.
description Shown in help listings. Also used by Claude to decide when to suggest the skill. Be specific -- mention the problem types this skill addresses.
user-invocable Set to true so users can trigger it directly with /name.
allowed-tools Whitelist of tools the skill can use. Uses glob patterns. Bash(ssh *) means "allow any Bash command starting with ssh".

Allowed-Tools Patterns

# SSH to any server
allowed-tools: Bash(ssh *)

# SSH + Docker + curl + ping
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)

# SSH + local audio tools
allowed-tools: Bash(ssh *), Bash(curl *), Bash(sox *), Bash(soxi *), Bash(ffprobe *)

The tool patterns act as a security boundary. A skill that only needs SSH cannot accidentally execute Docker commands or write files. Design skills with the minimum tools they need.

Body (The Investigation Procedure)

The Markdown body is the actual instruction set. Claude reads this as its playbook when the skill is invoked. It should contain:

  1. What to do -- step-by-step procedures
  2. How to access resources -- SSH commands, SQL queries, API calls
  3. How to interpret results -- reference tables, thresholds, known patterns
  4. Server-specific variations -- different credentials, paths, or versions per server
  5. Output formatting -- how to present results to the user

The body supports a special variable: $ARGUMENTS -- whatever the user typed after the slash command. For example, if the user types /health server-a, then $ARGUMENTS is server-a.


4. Directory Structure

~/.claude/
  settings.json              # Global settings (permissions, hooks, env)
  settings.local.json        # Per-machine permission overrides
  hooks/
    protect-production.sh    # Safety hook: blocks dangerous commands
  skills/
    health/
      SKILL.md               # /health skill
    calls/
      SKILL.md               # /calls skill
    agents/
      SKILL.md               # /agents skill
    replication/
      SKILL.md               # /replication skill
    audit-server/
      SKILL.md               # /audit-server skill
    trunk-status/
      SKILL.md               # /trunk-status skill
    audio-quality/
      SKILL.md               # /audio-quality skill
    call-investigate/
      SKILL.md               # /call-investigate skill
    call-drops/
      SKILL.md               # /call-drops skill
    lagged/
      SKILL.md               # /lagged skill
    network-check/
      SKILL.md               # /network-check skill
    agent-ranks/
      SKILL.md               # /agent-ranks skill
    did-lookup/
      SKILL.md               # /did-lookup skill
    reports/
      SKILL.md               # /reports skill
    listen-recording/
      SKILL.md               # /listen-recording skill

Each skill gets its own directory. This is a Claude Code convention -- the directory name matches the skill name.


5. Operations Skills

These six skills answer the question: "What is happening right now?"

5.1 /health -- Quick Health Check

Purpose: Single-command health sweep across all production servers.

What it checks per server:

Design pattern: One SSH command per server that gathers all metrics, minimizing round-trips. Results presented as a table with WARNING/CRITICAL flags.

Usage:

/health              # Check all servers
/health server-a     # Check specific server

5.2 /calls -- Live Calls

Purpose: Real-time view of active calls across the infrastructure.

What it shows:

5.3 /agents -- Agent Status

Purpose: All logged-in agents with detailed status.

What it shows per agent:

Flags:

5.4 /replication -- Database Replication

Purpose: Check MariaDB multi-source replication health.

What it checks:

Special feature: Pass fix as argument to get suggested repair commands.

5.5 /audit-server -- Deep Server Audit

Purpose: Comprehensive server audit covering system, Asterisk, database, security, ViciDial, and logs.

Sections: System resources, Asterisk health, database status, security posture, ViciDial process status, recent errors.

Output: Organized by severity -- CRITICAL, WARNING, INFO.

5.6 /trunk-status -- SIP Trunk Status

Purpose: Check SIP trunk registration and connectivity.

Includes: Trunk inventory per server, quick all-server check loop, and a troubleshooting workflow (ping, firewall, registration, DNS, qualify, carrier logs).


6. Investigation Skills

These five skills answer the question: "Why did this happen?"

6.1 /audio-quality -- Voice Quality Investigation

Tools used: Homer RTCP (PostgreSQL), audio analysis service (NISQA neural scoring + Silero VAD), Asterisk logs, SIP peer stats, Smokeping, codec verification.

Investigation flow:

  1. Find the calls
  2. Identify endpoints (agent IP, trunk IP)
  3. Query Homer RTCP for packet loss and jitter
  4. Check Asterisk logs for codec errors, RTP switching
  5. Check live SIP quality
  6. Download and analyze recording
  7. Check network (Smokeping, ping, UDP buffers)

6.2 /call-investigate -- Deep Call Tracing

The most detailed skill. Traces a call through its entire lifecycle:

  1. Find call records (inbound/outbound/archived)
  2. Check carrier log (SIP-level hangup causes)
  3. Check DID routing
  4. Trace in Asterisk logs
  5. Search Homer SIP traces
  6. Find and analyze recording
  7. Check agent state at time of call

Includes reference tables for hangup causes (16=Normal, 17=Busy, 18=No response, etc.) and problem statuses (DISMX, DCMX, DROP, TIMEOT, etc.).

6.3 /call-drops -- Drop & Failure Analysis

Purpose: Systematic analysis of problem dispositions.

Covers: DROP (queue timeout), DISMX/DCMX (mid-call disconnect), TIMEOT (agent timeout), AFTHRS (after hours), with carrier-level detail and historical baseline comparison.

6.4 /lagged -- Agent LAGGED Events

Purpose: Investigate ViciDial heartbeat failures that kick agents offline.

Correlation: Matches LAGGED timestamps against Homer RTCP data to determine if the cause was network (jitter spike, packet loss) or client-side (browser crash, PC freeze).

6.5 /network-check -- Network Quality

Tools: Homer RTCP analysis, Smokeping, direct ping, UDP buffer stats, SIP peer latency, live RTP channel stats, MTR traceroute.

Thresholds documented inline:


7. Lookup Skills

These four skills answer the question: "What is this configured to do?"

7.1 /agent-ranks -- Rank & Routing Diagnostics

Purpose: Understand why calls go to specific agents.

Checks: Ingroup assignments, routing method, rank/weight configuration, active closer campaigns, call distribution fairness, ranking inconsistencies, and can simulate "who would get the next call right now?"

7.2 /did-lookup -- DID Routing

Purpose: Trace how a phone number is routed through the system.

Covers: DID configuration, company name mapping, call history, dialplan routing path, and can manage company-to-DID mappings.

7.3 /reports -- ViciDial Report Generation

Purpose: Quick access to 15+ built-in ViciDial reports plus direct SQL.

Provides: URL templates with proper parameters for agent performance, inbound stats, carrier logs, LAGGED reports, call exports, DID stats, and more. Also includes custom SQL queries for when built-in reports are not enough.

7.4 /listen-recording -- Recording Analysis

Purpose: Download and analyze call recordings with neural quality scoring.

Tools: NISQA (neural audio quality model), Silero VAD (voice activity detection for silence analysis), SoX (waveform analysis), ffprobe (format inspection).

Supports: Both MIX (combined stereo) and ORIG (separate caller/agent legs) recording formats.


8. Complete Example Skills

Here are four complete skill files you can adapt for your infrastructure.

Example 1: /health -- Server Health Check

---
name: health
description: Quick health check across all VoIP production servers. Shows Asterisk, MySQL, disk, uptime, fail2ban, replication.
user-invocable: true
allowed-tools: Bash(ssh *)
---

# Server Health Check

Run a quick health check across all production VoIP servers.
Use SSH config names (server-a, server-b, server-c, etc.).

If $ARGUMENTS is provided, check only those servers.
Otherwise check all production servers.

For each server, run ONE ssh command that gathers:
1. `hostname` and `uptime`
2. `asterisk -rx "core show channels" | tail -1` (active calls)
3. `asterisk -rx "sip show peers" | tail -1` (SIP peers)
4. `mysqladmin status 2>/dev/null | head -1` (MySQL uptime/threads/queries)
5. `df -h / | tail -1` (disk usage)
6. `fail2ban-client status 2>/dev/null | head -2` (fail2ban)

Combine all into a single SSH command per server to minimize round-trips.

Present results in a clean table format. Flag any issues:
- Disk > 80% = WARNING
- No active Asterisk channels when agents should be online = WARNING
- fail2ban not running = CRITICAL
- MySQL not responding = CRITICAL

Also check replication on the replica server (ssh your-replica):
- `mysql -u YOUR_REPL_USER -pYOUR_REPL_PASS -e "SHOW ALL SLAVES STATUS\G" | grep -E "Connection_name|Slave_IO|Slave_SQL|Seconds_Behind"`

Server reference:
- server-a (YOUR_SERVER_IP) -- Primary, Asterisk 18
- server-b (YOUR_SERVER_IP) -- Secondary, Asterisk 16
- server-c (YOUR_SERVER_IP) -- Tertiary, Asterisk 13
- server-d (YOUR_SERVER_IP) -- Standalone

Example 2: /call-investigate -- Deep Call Tracing

---
name: call-investigate
description: Deep investigation of specific calls by phone number, uniqueid, or agent ID. Traces full call path from DID through routing to agent, checks carrier logs, SIP traces, recordings, and dispositions. Use for any call complaint or incident.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *)
---

# Call Investigation

Deep-dive into specific calls. $ARGUMENTS: phone number(s), uniqueid(s),
agent ID(s), or date range.

## Step 1: Find the Call Records

### Inbound calls (vicidial_closer_log)
```sql
SELECT call_date, phone_number, length_in_sec, status, term_reason,
       uniqueid, closecallid, user, campaign_id, queue_seconds,
       comments
FROM vicidial_closer_log
WHERE phone_number LIKE '%NUMBER%'
  AND call_date >= 'YYYY-MM-DD'
ORDER BY call_date DESC LIMIT 20;

Outbound calls (vicidial_log)

SELECT call_date, phone_number, length_in_sec, status, term_reason,
       uniqueid, user, campaign_id
FROM vicidial_log
WHERE phone_number LIKE '%NUMBER%'
  AND call_date >= 'YYYY-MM-DD'
ORDER BY call_date DESC LIMIT 20;

By agent

SELECT call_date, phone_number, length_in_sec, status, term_reason,
       uniqueid, campaign_id
FROM vicidial_closer_log WHERE user='AGENT_ID' AND call_date >= CURDATE()
UNION ALL
SELECT call_date, phone_number, length_in_sec, status, term_reason,
       uniqueid, campaign_id
FROM vicidial_log WHERE user='AGENT_ID' AND call_date >= CURDATE()
ORDER BY call_date DESC LIMIT 30;

Step 2: Check Carrier Log (SIP-level detail)

SELECT call_date, channel, server_ip, dialstatus,
       hangup_cause, sip_hangup_cause, sip_hangup_reason,
       dial_time, answered_time, dead_sec
FROM vicidial_carrier_log
WHERE uniqueid='UNIQUEID'
ORDER BY call_date;

Key hangup causes:

Key dialstatuses:

Step 3: Check DID Routing

SELECT did_id, did_pattern, did_description, did_route,
       did_agent_a, extension, exten_context, group_id
FROM vicidial_inbound_dids
WHERE did_pattern LIKE '%DID_NUMBER%';

Step 4: Trace in Asterisk Logs

# Find the call in Asterisk logs by uniqueid or phone number
ssh your-server "grep -E 'UNIQUEID|PHONE_NUMBER' /var/log/asterisk/messages | tail -30"

# Trace full SIP dialog by Call-ID
ssh your-server "grep 'CALL_ID' /var/log/asterisk/messages | tail -50"

What to look for:

Step 5: Check Homer SIP Traces (if available)

docker exec -i postgres psql -U homer -d homer_data -c "
SELECT create_date, protocol_header->>'method' as method,
       protocol_header->>'srcIp' as src,
       protocol_header->>'dstIp' as dst
FROM hep_proto_1_default_YYYYMMDD_HHMM
WHERE raw::text LIKE '%PHONE_NUMBER%'
ORDER BY create_date DESC LIMIT 20;
"

Note: SIP table is hep_proto_1_*, RTCP is hep_proto_5_*. Partitions are by UTC time (6-hour windows).

Step 6: Check Recording

ssh your-server "mysql asterisk -e \"SELECT recording_id, filename,
  location, length_in_sec FROM recording_log WHERE lead_id IN (
  SELECT lead_id FROM vicidial_closer_log WHERE uniqueid='UNIQUEID'
) ORDER BY start_time DESC LIMIT 5;\""

# Audio analysis (if analysis service is running)
curl -s "http://localhost:8084/analyze?uniqueid=UNIQUEID&server=SERVER_KEY" | jq .

Step 7: Check Agent State at Time of Call

SELECT event_time, user, pause_epoch, wait_epoch, talk_epoch,
       dispo_epoch, status, sub_status, pause_type, dead_sec
FROM vicidial_agent_log
WHERE user='AGENT_ID'
  AND event_time >= 'YYYY-MM-DD HH:MM:00'
  AND event_time <= 'YYYY-MM-DD HH:MM:59'
ORDER BY event_time;

Problem Status Reference

Status Meaning Investigation
DISMX Disconnect mid-call (inbound) Check carrier_log, network, agent connection
DCMX Disconnect mid-call (outbound) Same as above
DROP Call dropped from queue (timeout) Check queue timeout, agent availability
TIMEOT Agent didn't answer in time Check alert settings, softphone
ADCT Auto-disconnect Check dead_max campaign setting
AFTHRS After hours routing Check ingroup after_hours settings
NANQUE No agent, no queue Check no_agent_no_queue setting
HXFER Hangup during transfer Check transfer target availability
XDROP External drop Carrier/trunk issue
LAGGED Agent lagged out Network -- use /lagged skill

MySQL Access Per Server


### Example 3: `/audio-quality` -- Voice Quality Investigation

```markdown
---
name: audio-quality
description: Investigate audio quality issues for specific calls or agents. Uses Homer RTCP, audio analysis service, Asterisk logs, recording playback, codec checks. Use when agents or clients complain about voice quality, one-way audio, choppy audio, echo, or silence.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)
---

# Audio Quality Investigation

Investigate voice quality issues using ALL available tools.
$ARGUMENTS can be: phone number(s), agent ID(s), or "all" for a general sweep.

## Available Tools

### 1. Homer RTCP Analysis (PostgreSQL via Docker)

Query RTCP data from Homer to check packet loss and jitter.

```bash
# Connect to Homer DB
docker exec -i postgres psql -U homer -d homer_data

# Find RTCP table names (6-hour partitions, UTC time)
docker exec -i postgres psql -U homer -d homer_data -c "\dt hep_proto_5_default_*" | tail -20

# Query RTCP from a specific source IP
docker exec -i postgres psql -U homer -d homer_data -c "
SELECT
  create_date,
  protocol_header->>'srcIp' as src,
  protocol_header->>'dstIp' as dst,
  (raw::jsonb->'sender_information'->>'packets')::bigint as pkts,
  (raw::jsonb->'report_blocks'->0->>'fraction_lost')::bigint as frac_lost,
  (raw::jsonb->'report_blocks'->0->>'ia_jitter')::bigint as jitter,
  (raw::jsonb->'report_blocks'->0->>'packets_lost')::bigint as lost
FROM hep_proto_5_default_YYYYMMDD_HHMM
WHERE protocol_header->>'srcIp' LIKE 'IP_PATTERN%'
  AND create_date > NOW() - INTERVAL '2 hours'
ORDER BY create_date DESC LIMIT 50;
"

CRITICAL: Table partitions are by UTC time. If your VPS is in CET (UTC+1), and it is 14:00 CET, that is 13:00 UTC, so use the table *_1200 (covers 12:00-18:00 UTC).

Interpreting RTCP values:

2. Audio Analysis Service (FastAPI)

If you have a neural audio quality service running:

# Analyze a specific recording (by uniqueid)
curl -s "http://localhost:8084/analyze?uniqueid=UNIQUEID&server=SERVERNAME" | jq .

# AI-powered analysis (uses an LLM to interpret scores)
curl -s "http://localhost:8084/ai-analyze?uniqueid=UNIQUEID&server=SERVERNAME" | jq .

3. Asterisk Logs (on production servers via SSH)

# Check for codec issues
ssh your-server "grep 'Unknown RTP codec' /var/log/asterisk/messages | tail -20"

# Check for RTP source switching (NAT issues)
ssh your-server "grep 'Strict RTP' /var/log/asterisk/messages | tail -20"

# Check for jitter buffer resyncs
ssh your-server "grep 'Resyncing the jb' /var/log/asterisk/messages | tail -20"

4. SIP Peer Quality (live agent quality)

# Check agent SIP registration quality
ssh your-server "asterisk -rx 'sip show peer AGENT_EXT'"
# Look for: Status (latency), Useragent (softphone version), codecs

# Live RTP stats for all active channels
ssh your-server "asterisk -rx 'sip show channelstats'"
# Shows: Recv/Sent packets, Lost packets, Jitter, RTT per channel

5. Codec Verification

# Check what codecs an agent negotiated
ssh your-server "asterisk -rx 'core show channel SIP/AGENT-CHANNELID'"
# Look for: NativeFormats, ReadFormat, WriteFormat
# If Read != Write, there is transcoding (quality loss)

# Check trunk codec config
ssh your-server "grep -A5 'TRUNK_NAME' /etc/asterisk/sip-vicidial.conf"

6. Network Quality (Smokeping + Ping)

# Direct ping test
ping -c 10 TARGET_IP

# Check UDP buffer overflows (on production server)
ssh your-server "cat /proc/net/snmp | grep Udp"
# RcvbufErrors > 0 = packets dropped due to small UDP buffers

Investigation Workflow

  1. Find the calls: Query closer_log or call_log by phone number
  2. Identify endpoints: Agent ID -> SIP peer -> agent IP. Trunk -> trunk IP
  3. Check Homer RTCP: Query for both directions (trunk->server, server->agent)
  4. Check Asterisk logs: Codec errors, RTP switching, jitter resyncs
  5. Check live SIP quality: sip show peer, sip show channelstats
  6. Listen to recording: Download and analyze via audio analysis service
  7. Check network: Smokeping, ping, UDP buffers

Common Root Causes (from real investigations)


### Example 4: `/trunk-status` -- SIP Trunk Status

```markdown
---
name: trunk-status
description: Check SIP trunk status across all VoIP servers. Shows registration state, latency, active calls per trunk. Use when calls fail to connect, trunks go UNREACHABLE, or provider issues suspected.
user-invocable: true
allowed-tools: Bash(ssh *)
---

# SIP Trunk Status Check

Check all SIP trunks across production servers.
$ARGUMENTS: server name or "all".

## Per Server Check

```bash
# Show all SIP peers with status
ssh your-server "asterisk -rx 'sip show peers'"

# Show only trunks (filter out agent extensions)
ssh your-server "asterisk -rx 'sip show peers' | grep -E 'trunk_name|UNREACHABLE'"

# Detailed info for a specific trunk
ssh your-server "asterisk -rx 'sip show peer TRUNKNAME'"

Trunk Inventory by Server

Maintain a table mapping trunks to providers and purposes:

Server Trunk Provider IP Purpose
server-a provider1_de YOUR_PROVIDER_IP Primary inbound
server-a provider1_uk YOUR_PROVIDER_IP UK outbound
server-a provider2 YOUR_PROVIDER_IP Inbound
server-b provider3 YOUR_PROVIDER_IP Regional inbound
server-c provider1 YOUR_PROVIDER_IP General

Quick All-Server Trunk Check

for srv in server-a server-b server-c server-d; do
  echo "=== $srv ==="
  ssh $srv "asterisk -rx 'sip show peers' | grep -cE 'OK|UNREACHABLE|UNKNOWN'" 2>/dev/null
  ssh $srv "asterisk -rx 'sip show peers' | grep -E 'UNREACHABLE|UNKNOWN'" 2>/dev/null
  echo ""
done

Troubleshooting UNREACHABLE Trunks

  1. Ping the provider IP: ssh your-server "ping -c 3 PROVIDER_IP"

  2. Check firewall (must be whitelisted if final rule is DROP): ssh your-server "iptables -S INPUT | grep PROVIDER_IP"

  3. Check SIP registration: ssh your-server "asterisk -rx 'sip show registry'"

  4. Check if provider changed IP: ssh your-server "dig SIP_HOSTNAME"

  5. Test SIP OPTIONS: ssh your-server "asterisk -rx 'sip qualify peer TRUNKNAME'"

  6. Check carrier log for failures:

    SELECT call_date, dialstatus, hangup_cause, sip_hangup_cause
    FROM vicidial_carrier_log
    WHERE channel LIKE '%TRUNKNAME%'
    ORDER BY call_date DESC LIMIT 10;
    

---

## 9. Production Safety Hook

This is the most important file in your setup. The safety hook runs **before every Bash command** Claude executes, and blocks dangerous operations on production servers.

### Why You Need This

AI assistants are powerful but imperfect. Without guardrails:
- A "cleanup" task might `rm -rf` a critical directory
- A SQL query might accidentally `DROP TABLE` instead of `SELECT`
- A config edit might break live call routing
- An Asterisk restart might drop 50 active calls

### The Hook: `~/.claude/hooks/protect-production.sh`

```bash
#!/bin/bash
# Hook: Block dangerous operations on production servers
# Exit 0 = allow, Exit 2 = block (with stderr message)

INPUT=$(cat)
COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // empty')

# Production server IPs and SSH config names
PROD_SERVERS="YOUR_SERVER_IP_1|YOUR_SERVER_IP_2|YOUR_SERVER_IP_3|server-a|server-b|server-c"

# Check if command targets a production server
targets_prod() {
  echo "$COMMAND" | grep -qE "$PROD_SERVERS"
}

# Block dangerous patterns on production
if targets_prod; then
  # Block rm -rf on remote servers
  if echo "$COMMAND" | grep -qE 'rm\s+(-rf|-fr)\s+/'; then
    echo "BLOCKED: rm -rf on production server. This could delete critical data." >&2
    exit 2
  fi

  # Block DROP/TRUNCATE on production databases
  if echo "$COMMAND" | grep -qiE '(DROP\s+TABLE|TRUNCATE\s+TABLE|DROP\s+DATABASE|DELETE\s+FROM\s+vicidial)'; then
    echo "BLOCKED: Destructive SQL on production. Use SELECT first, then ask for explicit approval." >&2
    exit 2
  fi

  # Block modifying Asterisk config files
  if echo "$COMMAND" | grep -qE '(sed|awk|tee|>|>>)\s.*(extensions\.conf|extensions-vicidial\.conf|sip-vicidial\.conf|customexte\.conf|sip\.conf)'; then
    echo "BLOCKED: Modifying Asterisk config on production. These affect live calls. Ask for explicit approval first." >&2
    exit 2
  fi

  # Block systemctl stop/restart asterisk without approval
  if echo "$COMMAND" | grep -qE 'systemctl\s+(stop|restart)\s+asterisk'; then
    echo "BLOCKED: Stopping/restarting Asterisk on production will drop all active calls." >&2
    exit 2
  fi
fi

# Allow everything else
exit 0

Make it executable:

chmod +x ~/.claude/hooks/protect-production.sh

How It Works

The hook receives the command as JSON on stdin. It extracts the Bash command, checks if it targets any production server (by IP or SSH config name), and blocks four categories of dangerous operations:

Pattern What It Blocks Why
rm -rf / on prod Recursive file deletion Could delete recordings, configs, databases
DROP TABLE, TRUNCATE, DELETE FROM vicidial* Destructive SQL Call logs, agent data, routing config
sed/awk/tee/> on Asterisk .conf files Config file modification Changes affect live call routing
systemctl stop/restart asterisk Asterisk service control Drops all active calls immediately

Exit codes:

Registering the Hook

The hook is registered in ~/.claude/settings.json under the hooks section (see Settings Configuration below).

Real Incident That Prompted This

On a production system, an AI assistant replaced a ring group extension in customexte.conf with a Hangup() command as part of a "cleanup." This caused all after-hours and no-agent calls to be silently dropped instead of ringing backup phones. At least 11 calls were lost overnight before the issue was noticed. The safety hook prevents this class of error entirely.


10. MCP Grafana Integration

The Model Context Protocol (MCP) lets Claude Code interact directly with Grafana for dashboards, Prometheus queries, and Loki log searches -- without needing to SSH anywhere.

Setup

Install the Grafana MCP server:

pip install mcp-grafana
# or via uvx:
uvx mcp-grafana

Create a Grafana API service account and token:

  1. In Grafana, go to Administration > Service Accounts
  2. Create a new service account with Viewer role
  3. Generate a token

Configure the MCP server. Add to your project's .mcp.json or configure in Claude Code settings:

{
  "mcpServers": {
    "grafana": {
      "command": "uvx",
      "args": ["mcp-grafana"],
      "env": {
        "GRAFANA_URL": "http://localhost:3000",
        "GRAFANA_API_KEY": "YOUR_GRAFANA_SERVICE_ACCOUNT_TOKEN"
      }
    }
  }
}

Available MCP Tools

Once configured, Claude Code gains these tools:

Tool Purpose
mcp__grafana__search_dashboards Find dashboards by name
mcp__grafana__get_dashboard_by_uid Get full dashboard JSON
mcp__grafana__get_dashboard_panel_queries Extract panel queries
mcp__grafana__query_prometheus Execute PromQL queries directly
mcp__grafana__query_loki_logs Search Loki logs
mcp__grafana__list_loki_label_names Browse Loki label taxonomy
mcp__grafana__list_loki_label_values Get values for a label
mcp__grafana__list_datasources List all configured datasources
mcp__grafana__list_prometheus_metric_names Browse available metrics
mcp__grafana__list_prometheus_label_values Query Prometheus labels
mcp__grafana__create_annotation Add annotations to dashboards
mcp__grafana__get_panel_image Render panel as image

How Skills Use MCP

Your skills can reference MCP tools for deeper investigation. For example, in a /network-check skill, after checking Homer RTCP directly, you might also query Prometheus for historical latency data:

# In the skill body, you can note:
If Prometheus node_exporter is available, also check:
- mcp__grafana__query_prometheus with query: rate(node_network_receive_drop_total[5m])
- mcp__grafana__query_prometheus with query: node_network_mtu_bytes

The MCP tools are available alongside Bash/SSH tools, giving skills access to both real-time (SSH to servers) and historical (Prometheus/Loki time series) data.

Example: Datasource Configuration

A typical Grafana setup for VoIP monitoring includes:

Datasource Type Purpose
Prometheus prometheus Server metrics (CPU, memory, disk, network)
Loki loki Aggregated log streams from all servers
Homer postgres SIP/RTCP data for call quality analysis
ViciDial mysql Direct database queries for call center data

11. Settings Configuration

~/.claude/settings.json

This is the global settings file. It controls permissions, hooks, environment variables, and plugins.

{
  "permissions": {
    "allow": [
      "Bash(*)",
      "Read(*)",
      "Write(*)",
      "Edit(*)"
    ],
    "deny": [
      "Bash(rm -rf /*)",
      "Bash(dd if=*)",
      "Bash(mkfs*)"
    ]
  },
  "effortLevel": "high",
  "env": {
    "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "70",
    "PATH": "/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
  },
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "/root/.claude/hooks/protect-production.sh"
          }
        ]
      }
    ],
    "Notification": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "echo '\\a'"
          }
        ]
      }
    ]
  }
}

Key Settings Explained

effortLevel: "high" -- Tells Claude to be thorough. For infrastructure operations, you want Claude to check multiple sources, correlate data, and provide detailed analysis rather than quick surface-level answers.

CLAUDE_AUTOCOMPACT_PCT_OVERRIDE: "70" -- Auto-compact context at 70% of the context window. Infrastructure investigations can generate a lot of output (SQL results, log excerpts, RTCP data). Setting this to 70% gives Claude room to work without losing important context from earlier in the conversation.

hooks.PreToolUse -- The safety hook runs before every Bash command. The matcher: "Bash" means it only triggers for Bash tool calls (not Read, Write, or Edit).

hooks.Notification -- Terminal bell when Claude needs attention. Useful when running long investigations -- you can switch to another terminal and get notified when Claude has results.

permissions.deny -- Hard blocks on truly catastrophic commands. These cannot be overridden by skills or conversation.

~/.claude/settings.local.json

This file stores per-machine permission overrides. As you use Claude Code and approve tool permissions, they accumulate here. For a VoIP operations setup, you will typically see patterns like:

{
  "permissions": {
    "allow": [
      "Bash(ssh:*)",
      "Bash(docker exec:*)",
      "Bash(docker ps:*)",
      "Bash(curl:*)",
      "Bash(ping:*)",
      "Bash(python3:*)",
      "mcp__grafana__list_datasources",
      "mcp__grafana__query_prometheus",
      "mcp__grafana__query_loki_logs",
      "mcp__grafana__search_dashboards",
      "Skill(health)",
      "Skill(call-investigate)",
      "Skill(listen-recording)"
    ]
  }
}

Tip: Review this file periodically. Remove permissions you no longer need. The principle of least privilege applies to AI assistants too.


12. Skill Design Patterns & Tips

Pattern 1: Minimize SSH Round-Trips

Bad: Five separate SSH commands to one server.

# Bad - 5 round trips
ssh your-server "hostname"
ssh your-server "uptime"
ssh your-server "df -h /"
ssh your-server "asterisk -rx 'core show channels'"
ssh your-server "fail2ban-client status"

Good: One SSH command that gathers everything.

# Good - 1 round trip
ssh your-server "hostname; uptime; df -h / | tail -1; asterisk -rx 'core show channels' | tail -1; fail2ban-client status 2>/dev/null | head -2"

When checking multiple servers, run the SSH commands in parallel (Claude Code can make multiple Bash calls simultaneously).

Pattern 2: Conditional Arguments with $ARGUMENTS

Skills should handle both specific and broad requests:

If $ARGUMENTS is provided (e.g., "server-a" or "agent123"), check only
that target. Otherwise check all servers/agents.

This makes skills flexible -- /health checks everything, /health server-a checks one server.

Pattern 3: Embed Interpretation Tables

Do not just show raw data. Include reference tables so Claude can interpret results:

**Key hangup causes:**
- 16 = Normal clearing (good)
- 17 = User busy
- 34 = No circuit available (trunk congestion)
- 38 = Network out of order
**NISQA Scores (1-5 scale):**
| Score | Quality |
|-------|---------|
| 4.0+  | Excellent |
| 3.5-4.0 | Good |
| 3.0-3.5 | Fair |
| < 3.0 | Poor |

Pattern 4: Include Server-Specific Variations

Real infrastructure is messy. Servers run different versions, use different credentials, have different paths:

## MySQL Access Per Server
- **server-a/server-b**: `ssh server-name "mysql asterisk -e '...'"`
  (root, no password needed via SSH)
- **server-c (older)**: `ssh server-c "mysql -u YOUR_USER -pYOUR_PASS asterisk -e '...'"`
- **replica**: `ssh replica "mysql -u YOUR_USER -pYOUR_PASS dbname -e '...'"`

## Server-Specific Notes
- **server-a/server-b**: ConfBridge, newer Asterisk, ORIG recordings available
- **server-c**: MeetMe (requires DAHDI), Asterisk 11, older OS

Pattern 5: Severity Flags

Define clear severity levels in investigation skills:

Flag any issues:
- Disk > 80% = WARNING
- Disk > 95% = CRITICAL
- fail2ban not running = CRITICAL
- Replication lag > 60s = WARNING
- Replication lag > 300s = CRITICAL
- SIP peer UNREACHABLE = CRITICAL

Claude will use these to prioritize its output, putting critical items at the top.

Pattern 6: Cross-Skill References

Skills can suggest other skills for deeper investigation:

| LAGGED | Agent lagged out | Network issue -- use /lagged skill |
If audio quality issues are found, suggest using /listen-recording
for detailed NISQA analysis.

Pattern 7: Include Both URL and SQL Approaches

For report-type skills, provide both the web URL (for sharing with non-technical users) and the direct SQL (for programmatic access):

### Agent Performance
URL: http://YOUR_SERVER_IP/vicidial/AST_agent_performance_detail.php?query_date=...

### Direct SQL equivalent:
```sql
SELECT user, COUNT(*) as calls, SUM(length_in_sec) as total_talk
FROM vicidial_closer_log WHERE call_date >= CURDATE()
GROUP BY user ORDER BY calls DESC;

Pattern 8: Cleanup After Investigation

Skills that download files should include cleanup instructions:

## Cleanup
```bash
rm -f /tmp/call*.mp3 /tmp/call*.wav /tmp/trimmed.wav /tmp/spectrogram.png

Tip: Test Skills Incrementally

When building a new skill:

  1. Start with the frontmatter and one step
  2. Test it with /your-skill
  3. Add more steps, testing after each
  4. Add interpretation tables last
  5. Add server-specific notes as you discover edge cases

Tip: Keep Skills Focused

Each skill should do one thing well. If a skill is growing beyond ~150 lines, consider splitting it. The /audio-quality skill is one of the largest at ~140 lines because audio investigation genuinely requires that many tools. But /health is only ~40 lines because it has a focused purpose.

Tip: Use Allowed-Tools as Documentation

The allowed-tools field is not just security -- it tells you (and future maintainers) what resources a skill needs:

# This skill only needs SSH -- it is a pure remote investigation
allowed-tools: Bash(ssh *)

# This skill needs SSH + Docker + HTTP -- it correlates multiple data sources
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)

# This skill needs SSH + local audio tools -- it processes files locally
allowed-tools: Bash(ssh *), Bash(curl *), Bash(sox *), Bash(soxi *), Bash(ffprobe *)

13. Investigation Workflow Patterns

The Funnel Pattern

Start broad, narrow down. Most investigation skills follow this pattern:

Step 1: Find the records (broad SQL query)
    |
Step 2: Get SIP/carrier-level detail (narrower)
    |
Step 3: Check infrastructure state (Asterisk logs, SIP peers)
    |
Step 4: Check external data (Homer, Smokeping, recordings)
    |
Step 5: Correlate and diagnose

The Correlation Pattern

The real power of AI-assisted investigation is correlation. A human checking Homer RTCP, then Asterisk logs, then ViciDial tables has to hold all that context in their head. Claude does this naturally:

Agent LAGGED at 14:32:15
  + Homer RTCP shows jitter spike at 14:32:10 from agent IP
  + 3 other agents on same IP also LAGGED within 30 seconds
  = Diagnosis: Office internet dropout

Design skills to gather correlated data points and let Claude connect the dots.

The Baseline Pattern

For drop/failure analysis, always compare against historical baselines:

-- Is today's drop rate abnormal?
SELECT DATE(call_date) as day,
       COUNT(*) as total,
       SUM(CASE WHEN status='DROP' THEN 1 ELSE 0 END) as drops,
       ROUND(SUM(CASE WHEN status='DROP' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) as pct
FROM vicidial_closer_log
WHERE call_date >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)
GROUP BY day ORDER BY day;

Without a baseline, "50 drops today" means nothing. With a baseline, "50 drops today vs. average 12" is a clear alarm.


14. Permission Management

Three Layers of Permission Control

  1. settings.json deny list -- Hard blocks. Cannot be overridden.
  2. Safety hook -- Context-aware blocks (only on production servers).
  3. Skill allowed-tools -- Per-skill tool whitelist.

Permission Flow

User types /health
  |
  v
Claude reads SKILL.md
  -> allowed-tools: Bash(ssh *)
  |
  v
Claude generates: ssh server-a "hostname; uptime; ..."
  |
  v
PreToolUse hook runs (protect-production.sh)
  -> Command targets production? Yes
  -> Is it dangerous? (rm -rf, DROP TABLE, etc.)
    -> No: exit 0 (allow)
  |
  v
settings.json permissions check
  -> Bash(ssh *) matches allow list? Yes
  |
  v
Command executes

When to Allow vs. Deny

Allow broadly in settings.json for your normal workflow tools:

"allow": ["Bash(*)", "Read(*)", "Write(*)", "Edit(*)"]

Deny specifically for catastrophic operations:

"deny": ["Bash(rm -rf /*)", "Bash(dd if=*)", "Bash(mkfs*)"]

Use hooks for context-dependent blocking (same command might be fine on a test server but dangerous on production).

Use allowed-tools in skills to prevent scope creep (a health check skill should not need to write files).


15. Putting It All Together

Quick Start: Build Your First Three Skills

If you are starting from scratch, build these three skills first:

  1. /health -- Gives you immediate value. One command to check all servers.
  2. /calls -- Real-time situational awareness.
  3. /call-investigate -- The workhorse for any incident.

Checklist

The Feedback Loop

The best skills come from real incidents. Every time you investigate a problem manually, ask yourself:

Write that down as a SKILL.md. Next time, you type a slash command instead of spending 30 minutes.

Scaling to New Infrastructure

The skill pattern works for any infrastructure that Claude Code can reach via SSH, Docker, HTTP, or MCP:

The framework is the same: a SKILL.md file that teaches Claude your operational procedures, reference data, and interpretation rules. The tools change; the pattern does not.


Appendix A: All 15 Skills at a Glance

# Skill Type Tools Key Data Sources
1 /health Operations SSH Asterisk, MySQL, disk, fail2ban
2 /calls Operations SSH Asterisk channels, live_agents, auto_calls
3 /agents Operations SSH vicidial_live_agents, vicidial_users
4 /replication Operations SSH SHOW ALL SLAVES STATUS
5 /audit-server Operations SSH System, Asterisk, DB, security, logs
6 /trunk-status Operations SSH SIP peers, carrier_log, iptables
7 /audio-quality Investigation SSH, Docker, curl, ping Homer RTCP, NISQA, Asterisk logs, codecs
8 /call-investigate Investigation SSH, Docker, curl closer_log, carrier_log, DIDs, Homer, recordings
9 /call-drops Investigation SSH, Docker Problem statuses, carrier detail, baselines
10 /lagged Investigation SSH, Docker agent_log, Homer RTCP, SIP peers
11 /network-check Investigation SSH, Docker, curl, ping Homer RTCP, Smokeping, UDP buffers, MTR
12 /agent-ranks Lookup SSH inbound_group_agents, routing config
13 /did-lookup Lookup SSH inbound_dids, company mapping, dialplan
14 /reports Lookup SSH, curl ViciDial PHP reports + direct SQL
15 /listen-recording Lookup SSH, curl, sox, ffprobe recording_log, NISQA, Silero VAD, SoX

Appendix B: Common ViciDial Status Codes

For reference in your skills:

Call Disposition Statuses

Status Meaning
SALE Successful sale/conversion
NI Not interested
A Answering machine
CALLBK Callback scheduled
DNC Do not call
XFER Transferred
DROP Dropped from queue (no agent)
DISMX Disconnect mid-call (inbound)
DCMX Disconnect mid-call (outbound)
TIMEOT Agent timeout
ADCT Auto-disconnect (dead channel)
AFTHRS After hours
NANQUE No agent, no queue
HXFER Hangup during transfer
XDROP External drop
LAGGED Agent heartbeat failure

SIP Hangup Causes

Code Meaning
16 Normal clearing
17 User busy
18 No user responding
20 Subscriber absent
21 Call rejected
31 Normal, unspecified
34 No circuit available
38 Network out of order
127 Internal error

RTCP Quality Thresholds

Metric Good Warning Critical
Packet loss (fraction_lost) 0 >5 (of 255) >25
Jitter <20ms >50ms >100ms
Latency (RTT) <100ms >200ms >300ms
UDP RcvbufErrors 0 >0 Increasing

NISQA Audio Quality Scores

Score Quality
4.0+ Excellent
3.5-4.0 Good
3.0-3.5 Fair
2.5-3.0 Poor
< 2.5 Bad

Appendix C: SSH Configuration for Multi-Server Access

For skills to work efficiently, configure SSH with ControlMaster for persistent connections:

# ~/.ssh/config

Host server-a
    HostName YOUR_SERVER_IP
    Port 9322
    User root
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/sockets/%r@%h-%p
    ControlPersist 600

Host server-b
    HostName YOUR_SERVER_IP
    Port 9322
    User root
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/sockets/%r@%h-%p
    ControlPersist 600

# Repeat for each server...

Host replica
    HostName YOUR_REPLICA_IP
    Port 9322
    User root
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/sockets/%r@%h-%p
    ControlPersist 600

Create the sockets directory:

mkdir -p ~/.ssh/sockets

ControlPersist 600 keeps SSH connections open for 10 minutes after the last use. This means the first /health check opens connections to all servers, and subsequent skill invocations reuse them -- making everything feel instant.


This tutorial documents a production system managing 5 VoIP servers, 1,500+ DIDs, 50+ agents, and thousands of daily calls. The skills were developed iteratively over weeks of real operations, each one born from an actual incident or repeated manual investigation. The framework scales to any infrastructure that Claude Code can reach.

Need expert help with your setup?

VoIP infrastructure consulting, AI voice agent integration, monitoring stacks, scaling — I've done it all in production.

Get a Free Consultation