Building Custom Claude Code Skills for VoIP Infrastructure Operations
15 Slash Commands for Monitoring, Investigation & Lookup
Audience: DevOps engineers, sysadmins, VoIP/telecom operators, and Claude Code power users.
What you will build: A complete AI-powered operations toolkit -- 15 custom slash commands that turn Claude Code into a senior infrastructure engineer who knows your servers, your databases, your SIP trunks, and your investigation playbooks by heart.
Prerequisites: Claude Code CLI installed, SSH access to your servers, basic familiarity with Asterisk/VoIP concepts.
Table of Contents
- Why AI-Assisted Operations Skills
- Architecture Overview
- The SKILL.md File Format
- Directory Structure
- Operations Skills (6)
- Investigation Skills (5)
- Lookup Skills (4)
- Complete Example Skills
- Production Safety Hook
- MCP Grafana Integration
- Settings Configuration
- Skill Design Patterns & Tips
- Investigation Workflow Patterns
- Permission Management
- Putting It All Together
1. Why AI-Assisted Operations Skills
Traditional infrastructure monitoring gives you dashboards. Runbooks give you procedures. But neither thinks. Neither correlates. Neither adapts.
When you build custom Claude Code skills for your infrastructure, you get something qualitatively different:
Context-aware investigation. Instead of checking five different tools manually, you type /call-investigate +44XXXXXXXXXX and Claude traces the call through DID routing, carrier logs, Asterisk dialplans, SIP traces, agent state, and audio recordings -- correlating everything into a single diagnosis.
Institutional knowledge embedded in code. Every skill file encodes your team's hard-won knowledge: which hangup cause means what, which server uses which MySQL credentials, where the recordings live, what "normal" looks like for your trunks. New team members get the senior engineer's playbook on day one.
The 10x multiplier is real. Here is what changes:
| Task | Without Skills | With Skills |
|---|---|---|
| Health check across 5 servers | 5-10 min (SSH each, run commands, compare) | 15 sec (/health) |
| Investigate a dropped call | 30-60 min (find logs, trace routing, check carrier) | 2 min (/call-investigate) |
| Check why agent has no calls | 15-20 min (check ranks, ingroups, login state) | 30 sec (/agent-ranks agent123) |
| Diagnose audio quality complaint | 1-2 hours (Homer, recordings, codecs, network) | 5 min (/audio-quality) |
| Full server audit | 45-60 min | 3 min (/audit-server) |
Each skill is a Markdown file. No plugins to install, no APIs to build, no code to compile. You write the investigation procedure in natural language, and Claude executes it using the tools you allow.
2. Architecture Overview
+------------------+ SSH (key-based) +-------------------+
| |------------------------->| VoIP Server 1 |
| VPS / Jump Box |------------------------->| VoIP Server 2 |
| (Claude Code) |------------------------->| VoIP Server 3 |
| |------------------------->| Replica DB |
| ~/.claude/ | +-------------------+
| skills/ |
| health/ | Docker (local) +-------------------+
| calls/ |------------------------->| Grafana |
| agents/ |------------------------->| Prometheus |
| ... |------------------------->| Loki |
| hooks/ |------------------------->| Homer (SIP/RTCP) |
| settings.json |------------------------->| Smokeping |
+------------------+ +-------------------+
|
| MCP (Model Context Protocol)
v
+------------------+
| Grafana MCP |
| (mcp-grafana) |
| - Dashboards |
| - PromQL queries |
| - Loki log search|
+------------------+
Key principle: Claude Code runs on a central VPS/jump box that has SSH access to all production servers and Docker access to monitoring containers. Skills teach Claude how to use these access paths to answer operational questions.
3. The SKILL.md File Format
Every skill is a single Markdown file named SKILL.md inside its own directory under ~/.claude/skills/. The file has two parts: a YAML frontmatter header and a Markdown body.
Frontmatter (Required)
---
name: skill-name
description: One-line description shown in skill listings and used for matching.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *)
---
| Field | Purpose |
|---|---|
name |
The slash command name. Users type /name to invoke. |
description |
Shown in help listings. Also used by Claude to decide when to suggest the skill. Be specific -- mention the problem types this skill addresses. |
user-invocable |
Set to true so users can trigger it directly with /name. |
allowed-tools |
Whitelist of tools the skill can use. Uses glob patterns. Bash(ssh *) means "allow any Bash command starting with ssh". |
Allowed-Tools Patterns
# SSH to any server
allowed-tools: Bash(ssh *)
# SSH + Docker + curl + ping
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)
# SSH + local audio tools
allowed-tools: Bash(ssh *), Bash(curl *), Bash(sox *), Bash(soxi *), Bash(ffprobe *)
The tool patterns act as a security boundary. A skill that only needs SSH cannot accidentally execute Docker commands or write files. Design skills with the minimum tools they need.
Body (The Investigation Procedure)
The Markdown body is the actual instruction set. Claude reads this as its playbook when the skill is invoked. It should contain:
- What to do -- step-by-step procedures
- How to access resources -- SSH commands, SQL queries, API calls
- How to interpret results -- reference tables, thresholds, known patterns
- Server-specific variations -- different credentials, paths, or versions per server
- Output formatting -- how to present results to the user
The body supports a special variable: $ARGUMENTS -- whatever the user typed after the slash command. For example, if the user types /health server-a, then $ARGUMENTS is server-a.
4. Directory Structure
~/.claude/
settings.json # Global settings (permissions, hooks, env)
settings.local.json # Per-machine permission overrides
hooks/
protect-production.sh # Safety hook: blocks dangerous commands
skills/
health/
SKILL.md # /health skill
calls/
SKILL.md # /calls skill
agents/
SKILL.md # /agents skill
replication/
SKILL.md # /replication skill
audit-server/
SKILL.md # /audit-server skill
trunk-status/
SKILL.md # /trunk-status skill
audio-quality/
SKILL.md # /audio-quality skill
call-investigate/
SKILL.md # /call-investigate skill
call-drops/
SKILL.md # /call-drops skill
lagged/
SKILL.md # /lagged skill
network-check/
SKILL.md # /network-check skill
agent-ranks/
SKILL.md # /agent-ranks skill
did-lookup/
SKILL.md # /did-lookup skill
reports/
SKILL.md # /reports skill
listen-recording/
SKILL.md # /listen-recording skill
Each skill gets its own directory. This is a Claude Code convention -- the directory name matches the skill name.
5. Operations Skills
These six skills answer the question: "What is happening right now?"
5.1 /health -- Quick Health Check
Purpose: Single-command health sweep across all production servers.
What it checks per server:
- Hostname and uptime
- Asterisk active channels and SIP peer count
- MySQL status (uptime, threads, queries)
- Disk usage (flags >80%)
- fail2ban status (flags if not running)
- Replication status on the replica server
Design pattern: One SSH command per server that gathers all metrics, minimizing round-trips. Results presented as a table with WARNING/CRITICAL flags.
Usage:
/health # Check all servers
/health server-a # Check specific server
5.2 /calls -- Live Calls
Purpose: Real-time view of active calls across the infrastructure.
What it shows:
- Active Asterisk channels per server
- Agents currently on calls (from
vicidial_live_agentswhere status IN INCALL, QUEUE, CLOSER) - Calls waiting in queue (from
vicidial_auto_calls) - Problem statuses (DROP, etc.)
5.3 /agents -- Agent Status
Purpose: All logged-in agents with detailed status.
What it shows per agent:
- Agent ID, name, status, campaign, pause code, time in current state, calls today
Flags:
- Paused >30 minutes = highlighted
- LAGGED status = CRITICAL
- 0 calls after 2+ hours logged in = noted
5.4 /replication -- Database Replication
Purpose: Check MariaDB multi-source replication health.
What it checks:
- IO_Running and SQL_Running per connection
- Seconds_Behind_Master (>60s WARNING, >300s CRITICAL)
- Last errors
- Disk space on replica
Special feature: Pass fix as argument to get suggested repair commands.
5.5 /audit-server -- Deep Server Audit
Purpose: Comprehensive server audit covering system, Asterisk, database, security, ViciDial, and logs.
Sections: System resources, Asterisk health, database status, security posture, ViciDial process status, recent errors.
Output: Organized by severity -- CRITICAL, WARNING, INFO.
5.6 /trunk-status -- SIP Trunk Status
Purpose: Check SIP trunk registration and connectivity.
Includes: Trunk inventory per server, quick all-server check loop, and a troubleshooting workflow (ping, firewall, registration, DNS, qualify, carrier logs).
6. Investigation Skills
These five skills answer the question: "Why did this happen?"
6.1 /audio-quality -- Voice Quality Investigation
Tools used: Homer RTCP (PostgreSQL), audio analysis service (NISQA neural scoring + Silero VAD), Asterisk logs, SIP peer stats, Smokeping, codec verification.
Investigation flow:
- Find the calls
- Identify endpoints (agent IP, trunk IP)
- Query Homer RTCP for packet loss and jitter
- Check Asterisk logs for codec errors, RTP switching
- Check live SIP quality
- Download and analyze recording
- Check network (Smokeping, ping, UDP buffers)
6.2 /call-investigate -- Deep Call Tracing
The most detailed skill. Traces a call through its entire lifecycle:
- Find call records (inbound/outbound/archived)
- Check carrier log (SIP-level hangup causes)
- Check DID routing
- Trace in Asterisk logs
- Search Homer SIP traces
- Find and analyze recording
- Check agent state at time of call
Includes reference tables for hangup causes (16=Normal, 17=Busy, 18=No response, etc.) and problem statuses (DISMX, DCMX, DROP, TIMEOT, etc.).
6.3 /call-drops -- Drop & Failure Analysis
Purpose: Systematic analysis of problem dispositions.
Covers: DROP (queue timeout), DISMX/DCMX (mid-call disconnect), TIMEOT (agent timeout), AFTHRS (after hours), with carrier-level detail and historical baseline comparison.
6.4 /lagged -- Agent LAGGED Events
Purpose: Investigate ViciDial heartbeat failures that kick agents offline.
Correlation: Matches LAGGED timestamps against Homer RTCP data to determine if the cause was network (jitter spike, packet loss) or client-side (browser crash, PC freeze).
6.5 /network-check -- Network Quality
Tools: Homer RTCP analysis, Smokeping, direct ping, UDP buffer stats, SIP peer latency, live RTP channel stats, MTR traceroute.
Thresholds documented inline:
- Packet loss >1% = quality degraded
- Jitter >50ms = choppy audio
- Latency >200ms = noticeable delay
- UDP RcvbufErrors increasing = server dropping packets
7. Lookup Skills
These four skills answer the question: "What is this configured to do?"
7.1 /agent-ranks -- Rank & Routing Diagnostics
Purpose: Understand why calls go to specific agents.
Checks: Ingroup assignments, routing method, rank/weight configuration, active closer campaigns, call distribution fairness, ranking inconsistencies, and can simulate "who would get the next call right now?"
7.2 /did-lookup -- DID Routing
Purpose: Trace how a phone number is routed through the system.
Covers: DID configuration, company name mapping, call history, dialplan routing path, and can manage company-to-DID mappings.
7.3 /reports -- ViciDial Report Generation
Purpose: Quick access to 15+ built-in ViciDial reports plus direct SQL.
Provides: URL templates with proper parameters for agent performance, inbound stats, carrier logs, LAGGED reports, call exports, DID stats, and more. Also includes custom SQL queries for when built-in reports are not enough.
7.4 /listen-recording -- Recording Analysis
Purpose: Download and analyze call recordings with neural quality scoring.
Tools: NISQA (neural audio quality model), Silero VAD (voice activity detection for silence analysis), SoX (waveform analysis), ffprobe (format inspection).
Supports: Both MIX (combined stereo) and ORIG (separate caller/agent legs) recording formats.
8. Complete Example Skills
Here are four complete skill files you can adapt for your infrastructure.
Example 1: /health -- Server Health Check
---
name: health
description: Quick health check across all VoIP production servers. Shows Asterisk, MySQL, disk, uptime, fail2ban, replication.
user-invocable: true
allowed-tools: Bash(ssh *)
---
# Server Health Check
Run a quick health check across all production VoIP servers.
Use SSH config names (server-a, server-b, server-c, etc.).
If $ARGUMENTS is provided, check only those servers.
Otherwise check all production servers.
For each server, run ONE ssh command that gathers:
1. `hostname` and `uptime`
2. `asterisk -rx "core show channels" | tail -1` (active calls)
3. `asterisk -rx "sip show peers" | tail -1` (SIP peers)
4. `mysqladmin status 2>/dev/null | head -1` (MySQL uptime/threads/queries)
5. `df -h / | tail -1` (disk usage)
6. `fail2ban-client status 2>/dev/null | head -2` (fail2ban)
Combine all into a single SSH command per server to minimize round-trips.
Present results in a clean table format. Flag any issues:
- Disk > 80% = WARNING
- No active Asterisk channels when agents should be online = WARNING
- fail2ban not running = CRITICAL
- MySQL not responding = CRITICAL
Also check replication on the replica server (ssh your-replica):
- `mysql -u YOUR_REPL_USER -pYOUR_REPL_PASS -e "SHOW ALL SLAVES STATUS\G" | grep -E "Connection_name|Slave_IO|Slave_SQL|Seconds_Behind"`
Server reference:
- server-a (YOUR_SERVER_IP) -- Primary, Asterisk 18
- server-b (YOUR_SERVER_IP) -- Secondary, Asterisk 16
- server-c (YOUR_SERVER_IP) -- Tertiary, Asterisk 13
- server-d (YOUR_SERVER_IP) -- Standalone
Example 2: /call-investigate -- Deep Call Tracing
---
name: call-investigate
description: Deep investigation of specific calls by phone number, uniqueid, or agent ID. Traces full call path from DID through routing to agent, checks carrier logs, SIP traces, recordings, and dispositions. Use for any call complaint or incident.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *)
---
# Call Investigation
Deep-dive into specific calls. $ARGUMENTS: phone number(s), uniqueid(s),
agent ID(s), or date range.
## Step 1: Find the Call Records
### Inbound calls (vicidial_closer_log)
```sql
SELECT call_date, phone_number, length_in_sec, status, term_reason,
uniqueid, closecallid, user, campaign_id, queue_seconds,
comments
FROM vicidial_closer_log
WHERE phone_number LIKE '%NUMBER%'
AND call_date >= 'YYYY-MM-DD'
ORDER BY call_date DESC LIMIT 20;
Outbound calls (vicidial_log)
SELECT call_date, phone_number, length_in_sec, status, term_reason,
uniqueid, user, campaign_id
FROM vicidial_log
WHERE phone_number LIKE '%NUMBER%'
AND call_date >= 'YYYY-MM-DD'
ORDER BY call_date DESC LIMIT 20;
By agent
SELECT call_date, phone_number, length_in_sec, status, term_reason,
uniqueid, campaign_id
FROM vicidial_closer_log WHERE user='AGENT_ID' AND call_date >= CURDATE()
UNION ALL
SELECT call_date, phone_number, length_in_sec, status, term_reason,
uniqueid, campaign_id
FROM vicidial_log WHERE user='AGENT_ID' AND call_date >= CURDATE()
ORDER BY call_date DESC LIMIT 30;
Step 2: Check Carrier Log (SIP-level detail)
SELECT call_date, channel, server_ip, dialstatus,
hangup_cause, sip_hangup_cause, sip_hangup_reason,
dial_time, answered_time, dead_sec
FROM vicidial_carrier_log
WHERE uniqueid='UNIQUEID'
ORDER BY call_date;
Key hangup causes:
- 16 = Normal clearing (good)
- 17 = User busy
- 18 = No user responding
- 20 = Subscriber absent
- 21 = Call rejected
- 31 = Normal, unspecified
- 34 = No circuit available (trunk congestion)
- 38 = Network out of order
- 127 = Internal error
Key dialstatuses:
- ANSWER = call connected
- BUSY = far end busy
- NOANSWER = ring timeout
- CANCEL = caller hung up during ring
- CHANUNAVAIL = trunk/channel problem
- CONGESTION = network congestion
Step 3: Check DID Routing
SELECT did_id, did_pattern, did_description, did_route,
did_agent_a, extension, exten_context, group_id
FROM vicidial_inbound_dids
WHERE did_pattern LIKE '%DID_NUMBER%';
Step 4: Trace in Asterisk Logs
# Find the call in Asterisk logs by uniqueid or phone number
ssh your-server "grep -E 'UNIQUEID|PHONE_NUMBER' /var/log/asterisk/messages | tail -30"
# Trace full SIP dialog by Call-ID
ssh your-server "grep 'CALL_ID' /var/log/asterisk/messages | tail -50"
What to look for:
Ringing()->Wait()-> AGI routing = normal flowDISMX/DCMX= disconnect mid-call (abnormal)func_hangupcause.c: Unable to find= abnormal hangupchan_sip.c: Failed to authenticate= SIP auth issueStrict RTP switching= NAT/media IP mismatchbridge_channel.c: Channel left= check timing
Step 5: Check Homer SIP Traces (if available)
docker exec -i postgres psql -U homer -d homer_data -c "
SELECT create_date, protocol_header->>'method' as method,
protocol_header->>'srcIp' as src,
protocol_header->>'dstIp' as dst
FROM hep_proto_1_default_YYYYMMDD_HHMM
WHERE raw::text LIKE '%PHONE_NUMBER%'
ORDER BY create_date DESC LIMIT 20;
"
Note: SIP table is hep_proto_1_*, RTCP is hep_proto_5_*.
Partitions are by UTC time (6-hour windows).
Step 6: Check Recording
ssh your-server "mysql asterisk -e \"SELECT recording_id, filename,
location, length_in_sec FROM recording_log WHERE lead_id IN (
SELECT lead_id FROM vicidial_closer_log WHERE uniqueid='UNIQUEID'
) ORDER BY start_time DESC LIMIT 5;\""
# Audio analysis (if analysis service is running)
curl -s "http://localhost:8084/analyze?uniqueid=UNIQUEID&server=SERVER_KEY" | jq .
Step 7: Check Agent State at Time of Call
SELECT event_time, user, pause_epoch, wait_epoch, talk_epoch,
dispo_epoch, status, sub_status, pause_type, dead_sec
FROM vicidial_agent_log
WHERE user='AGENT_ID'
AND event_time >= 'YYYY-MM-DD HH:MM:00'
AND event_time <= 'YYYY-MM-DD HH:MM:59'
ORDER BY event_time;
Problem Status Reference
| Status | Meaning | Investigation |
|---|---|---|
| DISMX | Disconnect mid-call (inbound) | Check carrier_log, network, agent connection |
| DCMX | Disconnect mid-call (outbound) | Same as above |
| DROP | Call dropped from queue (timeout) | Check queue timeout, agent availability |
| TIMEOT | Agent didn't answer in time | Check alert settings, softphone |
| ADCT | Auto-disconnect | Check dead_max campaign setting |
| AFTHRS | After hours routing | Check ingroup after_hours settings |
| NANQUE | No agent, no queue | Check no_agent_no_queue setting |
| HXFER | Hangup during transfer | Check transfer target availability |
| XDROP | External drop | Carrier/trunk issue |
| LAGGED | Agent lagged out | Network -- use /lagged skill |
MySQL Access Per Server
- server-a/server-b:
ssh your-server "mysql asterisk -e '...'" - server-c (older):
ssh your-server "mysql -u YOUR_USER -pYOUR_PASS asterisk -e '...'" - replica (read-only):
ssh your-replica "mysql -u YOUR_USER -pYOUR_PASS dbname -e '...'"
### Example 3: `/audio-quality` -- Voice Quality Investigation
```markdown
---
name: audio-quality
description: Investigate audio quality issues for specific calls or agents. Uses Homer RTCP, audio analysis service, Asterisk logs, recording playback, codec checks. Use when agents or clients complain about voice quality, one-way audio, choppy audio, echo, or silence.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)
---
# Audio Quality Investigation
Investigate voice quality issues using ALL available tools.
$ARGUMENTS can be: phone number(s), agent ID(s), or "all" for a general sweep.
## Available Tools
### 1. Homer RTCP Analysis (PostgreSQL via Docker)
Query RTCP data from Homer to check packet loss and jitter.
```bash
# Connect to Homer DB
docker exec -i postgres psql -U homer -d homer_data
# Find RTCP table names (6-hour partitions, UTC time)
docker exec -i postgres psql -U homer -d homer_data -c "\dt hep_proto_5_default_*" | tail -20
# Query RTCP from a specific source IP
docker exec -i postgres psql -U homer -d homer_data -c "
SELECT
create_date,
protocol_header->>'srcIp' as src,
protocol_header->>'dstIp' as dst,
(raw::jsonb->'sender_information'->>'packets')::bigint as pkts,
(raw::jsonb->'report_blocks'->0->>'fraction_lost')::bigint as frac_lost,
(raw::jsonb->'report_blocks'->0->>'ia_jitter')::bigint as jitter,
(raw::jsonb->'report_blocks'->0->>'packets_lost')::bigint as lost
FROM hep_proto_5_default_YYYYMMDD_HHMM
WHERE protocol_header->>'srcIp' LIKE 'IP_PATTERN%'
AND create_date > NOW() - INTERVAL '2 hours'
ORDER BY create_date DESC LIMIT 50;
"
CRITICAL: Table partitions are by UTC time. If your VPS is in
CET (UTC+1), and it is 14:00 CET, that is 13:00 UTC, so use
the table *_1200 (covers 12:00-18:00 UTC).
Interpreting RTCP values:
fraction_lost: 0-255 scale (0=perfect, 255=100% loss). >5 is bad.ia_jitter: In timestamp units. Divide by 8 for milliseconds.50ms is bad.
packets_lostvalues of 16777215 (2^24 - 1) are overflow -- treat as 0.
2. Audio Analysis Service (FastAPI)
If you have a neural audio quality service running:
# Analyze a specific recording (by uniqueid)
curl -s "http://localhost:8084/analyze?uniqueid=UNIQUEID&server=SERVERNAME" | jq .
# AI-powered analysis (uses an LLM to interpret scores)
curl -s "http://localhost:8084/ai-analyze?uniqueid=UNIQUEID&server=SERVERNAME" | jq .
3. Asterisk Logs (on production servers via SSH)
# Check for codec issues
ssh your-server "grep 'Unknown RTP codec' /var/log/asterisk/messages | tail -20"
# Check for RTP source switching (NAT issues)
ssh your-server "grep 'Strict RTP' /var/log/asterisk/messages | tail -20"
# Check for jitter buffer resyncs
ssh your-server "grep 'Resyncing the jb' /var/log/asterisk/messages | tail -20"
4. SIP Peer Quality (live agent quality)
# Check agent SIP registration quality
ssh your-server "asterisk -rx 'sip show peer AGENT_EXT'"
# Look for: Status (latency), Useragent (softphone version), codecs
# Live RTP stats for all active channels
ssh your-server "asterisk -rx 'sip show channelstats'"
# Shows: Recv/Sent packets, Lost packets, Jitter, RTT per channel
5. Codec Verification
# Check what codecs an agent negotiated
ssh your-server "asterisk -rx 'core show channel SIP/AGENT-CHANNELID'"
# Look for: NativeFormats, ReadFormat, WriteFormat
# If Read != Write, there is transcoding (quality loss)
# Check trunk codec config
ssh your-server "grep -A5 'TRUNK_NAME' /etc/asterisk/sip-vicidial.conf"
6. Network Quality (Smokeping + Ping)
# Direct ping test
ping -c 10 TARGET_IP
# Check UDP buffer overflows (on production server)
ssh your-server "cat /proc/net/snmp | grep Udp"
# RcvbufErrors > 0 = packets dropped due to small UDP buffers
Investigation Workflow
- Find the calls: Query closer_log or call_log by phone number
- Identify endpoints: Agent ID -> SIP peer -> agent IP. Trunk -> trunk IP
- Check Homer RTCP: Query for both directions (trunk->server, server->agent)
- Check Asterisk logs: Codec errors, RTP switching, jitter resyncs
- Check live SIP quality:
sip show peer,sip show channelstats - Listen to recording: Download and analyze via audio analysis service
- Check network: Smokeping, ping, UDP buffers
Common Root Causes (from real investigations)
- Codec mismatch: Agent softphone doesn't offer alaw, causing ulaw-to-alaw transcoding through conference bridge = quality loss
- Trunk provider packet loss: Transient. Check Homer RTCP from provider IP ranges.
- Old softphone versions: May send unknown codec IDs, have high latency, or lack proper codec support
- RTP Keepalive disabled: NAT binding timeout causes intermittent one-way audio
- UDP buffer overflow: Default rmem_default too small for busy servers. Should be at least 2MB.
- Conference bridge overhead: MeetMe always transcodes to slin internally, adding overhead vs. ConfBridge
### Example 4: `/trunk-status` -- SIP Trunk Status
```markdown
---
name: trunk-status
description: Check SIP trunk status across all VoIP servers. Shows registration state, latency, active calls per trunk. Use when calls fail to connect, trunks go UNREACHABLE, or provider issues suspected.
user-invocable: true
allowed-tools: Bash(ssh *)
---
# SIP Trunk Status Check
Check all SIP trunks across production servers.
$ARGUMENTS: server name or "all".
## Per Server Check
```bash
# Show all SIP peers with status
ssh your-server "asterisk -rx 'sip show peers'"
# Show only trunks (filter out agent extensions)
ssh your-server "asterisk -rx 'sip show peers' | grep -E 'trunk_name|UNREACHABLE'"
# Detailed info for a specific trunk
ssh your-server "asterisk -rx 'sip show peer TRUNKNAME'"
Trunk Inventory by Server
Maintain a table mapping trunks to providers and purposes:
| Server | Trunk | Provider IP | Purpose |
|---|---|---|---|
| server-a | provider1_de | YOUR_PROVIDER_IP | Primary inbound |
| server-a | provider1_uk | YOUR_PROVIDER_IP | UK outbound |
| server-a | provider2 | YOUR_PROVIDER_IP | Inbound |
| server-b | provider3 | YOUR_PROVIDER_IP | Regional inbound |
| server-c | provider1 | YOUR_PROVIDER_IP | General |
Quick All-Server Trunk Check
for srv in server-a server-b server-c server-d; do
echo "=== $srv ==="
ssh $srv "asterisk -rx 'sip show peers' | grep -cE 'OK|UNREACHABLE|UNKNOWN'" 2>/dev/null
ssh $srv "asterisk -rx 'sip show peers' | grep -E 'UNREACHABLE|UNKNOWN'" 2>/dev/null
echo ""
done
Troubleshooting UNREACHABLE Trunks
Ping the provider IP:
ssh your-server "ping -c 3 PROVIDER_IP"Check firewall (must be whitelisted if final rule is DROP):
ssh your-server "iptables -S INPUT | grep PROVIDER_IP"Check SIP registration:
ssh your-server "asterisk -rx 'sip show registry'"Check if provider changed IP:
ssh your-server "dig SIP_HOSTNAME"Test SIP OPTIONS:
ssh your-server "asterisk -rx 'sip qualify peer TRUNKNAME'"Check carrier log for failures:
SELECT call_date, dialstatus, hangup_cause, sip_hangup_cause FROM vicidial_carrier_log WHERE channel LIKE '%TRUNKNAME%' ORDER BY call_date DESC LIMIT 10;
---
## 9. Production Safety Hook
This is the most important file in your setup. The safety hook runs **before every Bash command** Claude executes, and blocks dangerous operations on production servers.
### Why You Need This
AI assistants are powerful but imperfect. Without guardrails:
- A "cleanup" task might `rm -rf` a critical directory
- A SQL query might accidentally `DROP TABLE` instead of `SELECT`
- A config edit might break live call routing
- An Asterisk restart might drop 50 active calls
### The Hook: `~/.claude/hooks/protect-production.sh`
```bash
#!/bin/bash
# Hook: Block dangerous operations on production servers
# Exit 0 = allow, Exit 2 = block (with stderr message)
INPUT=$(cat)
COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // empty')
# Production server IPs and SSH config names
PROD_SERVERS="YOUR_SERVER_IP_1|YOUR_SERVER_IP_2|YOUR_SERVER_IP_3|server-a|server-b|server-c"
# Check if command targets a production server
targets_prod() {
echo "$COMMAND" | grep -qE "$PROD_SERVERS"
}
# Block dangerous patterns on production
if targets_prod; then
# Block rm -rf on remote servers
if echo "$COMMAND" | grep -qE 'rm\s+(-rf|-fr)\s+/'; then
echo "BLOCKED: rm -rf on production server. This could delete critical data." >&2
exit 2
fi
# Block DROP/TRUNCATE on production databases
if echo "$COMMAND" | grep -qiE '(DROP\s+TABLE|TRUNCATE\s+TABLE|DROP\s+DATABASE|DELETE\s+FROM\s+vicidial)'; then
echo "BLOCKED: Destructive SQL on production. Use SELECT first, then ask for explicit approval." >&2
exit 2
fi
# Block modifying Asterisk config files
if echo "$COMMAND" | grep -qE '(sed|awk|tee|>|>>)\s.*(extensions\.conf|extensions-vicidial\.conf|sip-vicidial\.conf|customexte\.conf|sip\.conf)'; then
echo "BLOCKED: Modifying Asterisk config on production. These affect live calls. Ask for explicit approval first." >&2
exit 2
fi
# Block systemctl stop/restart asterisk without approval
if echo "$COMMAND" | grep -qE 'systemctl\s+(stop|restart)\s+asterisk'; then
echo "BLOCKED: Stopping/restarting Asterisk on production will drop all active calls." >&2
exit 2
fi
fi
# Allow everything else
exit 0
Make it executable:
chmod +x ~/.claude/hooks/protect-production.sh
How It Works
The hook receives the command as JSON on stdin. It extracts the Bash command, checks if it targets any production server (by IP or SSH config name), and blocks four categories of dangerous operations:
| Pattern | What It Blocks | Why |
|---|---|---|
rm -rf / on prod |
Recursive file deletion | Could delete recordings, configs, databases |
DROP TABLE, TRUNCATE, DELETE FROM vicidial* |
Destructive SQL | Call logs, agent data, routing config |
sed/awk/tee/> on Asterisk .conf files |
Config file modification | Changes affect live call routing |
systemctl stop/restart asterisk |
Asterisk service control | Drops all active calls immediately |
Exit codes:
- 0 = Allow the command
- 2 = Block the command (message shown to user via stderr)
Registering the Hook
The hook is registered in ~/.claude/settings.json under the hooks section (see Settings Configuration below).
Real Incident That Prompted This
On a production system, an AI assistant replaced a ring group extension in customexte.conf with a Hangup() command as part of a "cleanup." This caused all after-hours and no-agent calls to be silently dropped instead of ringing backup phones. At least 11 calls were lost overnight before the issue was noticed. The safety hook prevents this class of error entirely.
10. MCP Grafana Integration
The Model Context Protocol (MCP) lets Claude Code interact directly with Grafana for dashboards, Prometheus queries, and Loki log searches -- without needing to SSH anywhere.
Setup
Install the Grafana MCP server:
pip install mcp-grafana
# or via uvx:
uvx mcp-grafana
Create a Grafana API service account and token:
- In Grafana, go to Administration > Service Accounts
- Create a new service account with Viewer role
- Generate a token
Configure the MCP server. Add to your project's .mcp.json or configure in Claude Code settings:
{
"mcpServers": {
"grafana": {
"command": "uvx",
"args": ["mcp-grafana"],
"env": {
"GRAFANA_URL": "http://localhost:3000",
"GRAFANA_API_KEY": "YOUR_GRAFANA_SERVICE_ACCOUNT_TOKEN"
}
}
}
}
Available MCP Tools
Once configured, Claude Code gains these tools:
| Tool | Purpose |
|---|---|
mcp__grafana__search_dashboards |
Find dashboards by name |
mcp__grafana__get_dashboard_by_uid |
Get full dashboard JSON |
mcp__grafana__get_dashboard_panel_queries |
Extract panel queries |
mcp__grafana__query_prometheus |
Execute PromQL queries directly |
mcp__grafana__query_loki_logs |
Search Loki logs |
mcp__grafana__list_loki_label_names |
Browse Loki label taxonomy |
mcp__grafana__list_loki_label_values |
Get values for a label |
mcp__grafana__list_datasources |
List all configured datasources |
mcp__grafana__list_prometheus_metric_names |
Browse available metrics |
mcp__grafana__list_prometheus_label_values |
Query Prometheus labels |
mcp__grafana__create_annotation |
Add annotations to dashboards |
mcp__grafana__get_panel_image |
Render panel as image |
How Skills Use MCP
Your skills can reference MCP tools for deeper investigation. For example, in a /network-check skill, after checking Homer RTCP directly, you might also query Prometheus for historical latency data:
# In the skill body, you can note:
If Prometheus node_exporter is available, also check:
- mcp__grafana__query_prometheus with query: rate(node_network_receive_drop_total[5m])
- mcp__grafana__query_prometheus with query: node_network_mtu_bytes
The MCP tools are available alongside Bash/SSH tools, giving skills access to both real-time (SSH to servers) and historical (Prometheus/Loki time series) data.
Example: Datasource Configuration
A typical Grafana setup for VoIP monitoring includes:
| Datasource | Type | Purpose |
|---|---|---|
| Prometheus | prometheus | Server metrics (CPU, memory, disk, network) |
| Loki | loki | Aggregated log streams from all servers |
| Homer | postgres | SIP/RTCP data for call quality analysis |
| ViciDial | mysql | Direct database queries for call center data |
11. Settings Configuration
~/.claude/settings.json
This is the global settings file. It controls permissions, hooks, environment variables, and plugins.
{
"permissions": {
"allow": [
"Bash(*)",
"Read(*)",
"Write(*)",
"Edit(*)"
],
"deny": [
"Bash(rm -rf /*)",
"Bash(dd if=*)",
"Bash(mkfs*)"
]
},
"effortLevel": "high",
"env": {
"CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "70",
"PATH": "/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
},
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "/root/.claude/hooks/protect-production.sh"
}
]
}
],
"Notification": [
{
"matcher": "",
"hooks": [
{
"type": "command",
"command": "echo '\\a'"
}
]
}
]
}
}
Key Settings Explained
effortLevel: "high" -- Tells Claude to be thorough. For infrastructure operations, you want Claude to check multiple sources, correlate data, and provide detailed analysis rather than quick surface-level answers.
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE: "70" -- Auto-compact context at 70% of the context window. Infrastructure investigations can generate a lot of output (SQL results, log excerpts, RTCP data). Setting this to 70% gives Claude room to work without losing important context from earlier in the conversation.
hooks.PreToolUse -- The safety hook runs before every Bash command. The matcher: "Bash" means it only triggers for Bash tool calls (not Read, Write, or Edit).
hooks.Notification -- Terminal bell when Claude needs attention. Useful when running long investigations -- you can switch to another terminal and get notified when Claude has results.
permissions.deny -- Hard blocks on truly catastrophic commands. These cannot be overridden by skills or conversation.
~/.claude/settings.local.json
This file stores per-machine permission overrides. As you use Claude Code and approve tool permissions, they accumulate here. For a VoIP operations setup, you will typically see patterns like:
{
"permissions": {
"allow": [
"Bash(ssh:*)",
"Bash(docker exec:*)",
"Bash(docker ps:*)",
"Bash(curl:*)",
"Bash(ping:*)",
"Bash(python3:*)",
"mcp__grafana__list_datasources",
"mcp__grafana__query_prometheus",
"mcp__grafana__query_loki_logs",
"mcp__grafana__search_dashboards",
"Skill(health)",
"Skill(call-investigate)",
"Skill(listen-recording)"
]
}
}
Tip: Review this file periodically. Remove permissions you no longer need. The principle of least privilege applies to AI assistants too.
12. Skill Design Patterns & Tips
Pattern 1: Minimize SSH Round-Trips
Bad: Five separate SSH commands to one server.
# Bad - 5 round trips
ssh your-server "hostname"
ssh your-server "uptime"
ssh your-server "df -h /"
ssh your-server "asterisk -rx 'core show channels'"
ssh your-server "fail2ban-client status"
Good: One SSH command that gathers everything.
# Good - 1 round trip
ssh your-server "hostname; uptime; df -h / | tail -1; asterisk -rx 'core show channels' | tail -1; fail2ban-client status 2>/dev/null | head -2"
When checking multiple servers, run the SSH commands in parallel (Claude Code can make multiple Bash calls simultaneously).
Pattern 2: Conditional Arguments with $ARGUMENTS
Skills should handle both specific and broad requests:
If $ARGUMENTS is provided (e.g., "server-a" or "agent123"), check only
that target. Otherwise check all servers/agents.
This makes skills flexible -- /health checks everything, /health server-a checks one server.
Pattern 3: Embed Interpretation Tables
Do not just show raw data. Include reference tables so Claude can interpret results:
**Key hangup causes:**
- 16 = Normal clearing (good)
- 17 = User busy
- 34 = No circuit available (trunk congestion)
- 38 = Network out of order
**NISQA Scores (1-5 scale):**
| Score | Quality |
|-------|---------|
| 4.0+ | Excellent |
| 3.5-4.0 | Good |
| 3.0-3.5 | Fair |
| < 3.0 | Poor |
Pattern 4: Include Server-Specific Variations
Real infrastructure is messy. Servers run different versions, use different credentials, have different paths:
## MySQL Access Per Server
- **server-a/server-b**: `ssh server-name "mysql asterisk -e '...'"`
(root, no password needed via SSH)
- **server-c (older)**: `ssh server-c "mysql -u YOUR_USER -pYOUR_PASS asterisk -e '...'"`
- **replica**: `ssh replica "mysql -u YOUR_USER -pYOUR_PASS dbname -e '...'"`
## Server-Specific Notes
- **server-a/server-b**: ConfBridge, newer Asterisk, ORIG recordings available
- **server-c**: MeetMe (requires DAHDI), Asterisk 11, older OS
Pattern 5: Severity Flags
Define clear severity levels in investigation skills:
Flag any issues:
- Disk > 80% = WARNING
- Disk > 95% = CRITICAL
- fail2ban not running = CRITICAL
- Replication lag > 60s = WARNING
- Replication lag > 300s = CRITICAL
- SIP peer UNREACHABLE = CRITICAL
Claude will use these to prioritize its output, putting critical items at the top.
Pattern 6: Cross-Skill References
Skills can suggest other skills for deeper investigation:
| LAGGED | Agent lagged out | Network issue -- use /lagged skill |
If audio quality issues are found, suggest using /listen-recording
for detailed NISQA analysis.
Pattern 7: Include Both URL and SQL Approaches
For report-type skills, provide both the web URL (for sharing with non-technical users) and the direct SQL (for programmatic access):
### Agent Performance
URL: http://YOUR_SERVER_IP/vicidial/AST_agent_performance_detail.php?query_date=...
### Direct SQL equivalent:
```sql
SELECT user, COUNT(*) as calls, SUM(length_in_sec) as total_talk
FROM vicidial_closer_log WHERE call_date >= CURDATE()
GROUP BY user ORDER BY calls DESC;
Pattern 8: Cleanup After Investigation
Skills that download files should include cleanup instructions:
## Cleanup
```bash
rm -f /tmp/call*.mp3 /tmp/call*.wav /tmp/trimmed.wav /tmp/spectrogram.png
Tip: Test Skills Incrementally
When building a new skill:
- Start with the frontmatter and one step
- Test it with
/your-skill - Add more steps, testing after each
- Add interpretation tables last
- Add server-specific notes as you discover edge cases
Tip: Keep Skills Focused
Each skill should do one thing well. If a skill is growing beyond ~150 lines, consider splitting it. The /audio-quality skill is one of the largest at ~140 lines because audio investigation genuinely requires that many tools. But /health is only ~40 lines because it has a focused purpose.
Tip: Use Allowed-Tools as Documentation
The allowed-tools field is not just security -- it tells you (and future maintainers) what resources a skill needs:
# This skill only needs SSH -- it is a pure remote investigation
allowed-tools: Bash(ssh *)
# This skill needs SSH + Docker + HTTP -- it correlates multiple data sources
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)
# This skill needs SSH + local audio tools -- it processes files locally
allowed-tools: Bash(ssh *), Bash(curl *), Bash(sox *), Bash(soxi *), Bash(ffprobe *)
13. Investigation Workflow Patterns
The Funnel Pattern
Start broad, narrow down. Most investigation skills follow this pattern:
Step 1: Find the records (broad SQL query)
|
Step 2: Get SIP/carrier-level detail (narrower)
|
Step 3: Check infrastructure state (Asterisk logs, SIP peers)
|
Step 4: Check external data (Homer, Smokeping, recordings)
|
Step 5: Correlate and diagnose
The Correlation Pattern
The real power of AI-assisted investigation is correlation. A human checking Homer RTCP, then Asterisk logs, then ViciDial tables has to hold all that context in their head. Claude does this naturally:
Agent LAGGED at 14:32:15
+ Homer RTCP shows jitter spike at 14:32:10 from agent IP
+ 3 other agents on same IP also LAGGED within 30 seconds
= Diagnosis: Office internet dropout
Design skills to gather correlated data points and let Claude connect the dots.
The Baseline Pattern
For drop/failure analysis, always compare against historical baselines:
-- Is today's drop rate abnormal?
SELECT DATE(call_date) as day,
COUNT(*) as total,
SUM(CASE WHEN status='DROP' THEN 1 ELSE 0 END) as drops,
ROUND(SUM(CASE WHEN status='DROP' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) as pct
FROM vicidial_closer_log
WHERE call_date >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)
GROUP BY day ORDER BY day;
Without a baseline, "50 drops today" means nothing. With a baseline, "50 drops today vs. average 12" is a clear alarm.
14. Permission Management
Three Layers of Permission Control
settings.jsondeny list -- Hard blocks. Cannot be overridden.- Safety hook -- Context-aware blocks (only on production servers).
- Skill
allowed-tools-- Per-skill tool whitelist.
Permission Flow
User types /health
|
v
Claude reads SKILL.md
-> allowed-tools: Bash(ssh *)
|
v
Claude generates: ssh server-a "hostname; uptime; ..."
|
v
PreToolUse hook runs (protect-production.sh)
-> Command targets production? Yes
-> Is it dangerous? (rm -rf, DROP TABLE, etc.)
-> No: exit 0 (allow)
|
v
settings.json permissions check
-> Bash(ssh *) matches allow list? Yes
|
v
Command executes
When to Allow vs. Deny
Allow broadly in settings.json for your normal workflow tools:
"allow": ["Bash(*)", "Read(*)", "Write(*)", "Edit(*)"]
Deny specifically for catastrophic operations:
"deny": ["Bash(rm -rf /*)", "Bash(dd if=*)", "Bash(mkfs*)"]
Use hooks for context-dependent blocking (same command might be fine on a test server but dangerous on production).
Use allowed-tools in skills to prevent scope creep (a health check skill should not need to write files).
15. Putting It All Together
Quick Start: Build Your First Three Skills
If you are starting from scratch, build these three skills first:
/health-- Gives you immediate value. One command to check all servers./calls-- Real-time situational awareness./call-investigate-- The workhorse for any incident.
Checklist
- Create
~/.claude/skills/directory structure - Write your safety hook at
~/.claude/hooks/protect-production.sh - Configure
~/.claude/settings.jsonwith hooks, permissions, and env - Set up SSH config with ControlMaster for fast connections
- Create your first skill (
/health) and test it - Add server-specific notes as you discover differences
- Build investigation skills as incidents teach you what to check
- Set up MCP Grafana for dashboard and metric access
- Review and prune
settings.local.jsonperiodically
The Feedback Loop
The best skills come from real incidents. Every time you investigate a problem manually, ask yourself:
- What steps did I follow?
- What data sources did I check?
- What reference information did I need to look up?
- What thresholds told me something was wrong?
Write that down as a SKILL.md. Next time, you type a slash command instead of spending 30 minutes.
Scaling to New Infrastructure
The skill pattern works for any infrastructure that Claude Code can reach via SSH, Docker, HTTP, or MCP:
- Kubernetes clusters: Skills that run
kubectlcommands and interpret pod/node status - Cloud infrastructure: Skills that use
aws,gcloud, orazCLIs - Network equipment: Skills that SSH to switches/routers and parse show commands
- CI/CD pipelines: Skills that check build status, deployment logs, test results
- Database clusters: Skills that check replication, slow queries, connection pools
The framework is the same: a SKILL.md file that teaches Claude your operational procedures, reference data, and interpretation rules. The tools change; the pattern does not.
Appendix A: All 15 Skills at a Glance
| # | Skill | Type | Tools | Key Data Sources |
|---|---|---|---|---|
| 1 | /health |
Operations | SSH | Asterisk, MySQL, disk, fail2ban |
| 2 | /calls |
Operations | SSH | Asterisk channels, live_agents, auto_calls |
| 3 | /agents |
Operations | SSH | vicidial_live_agents, vicidial_users |
| 4 | /replication |
Operations | SSH | SHOW ALL SLAVES STATUS |
| 5 | /audit-server |
Operations | SSH | System, Asterisk, DB, security, logs |
| 6 | /trunk-status |
Operations | SSH | SIP peers, carrier_log, iptables |
| 7 | /audio-quality |
Investigation | SSH, Docker, curl, ping | Homer RTCP, NISQA, Asterisk logs, codecs |
| 8 | /call-investigate |
Investigation | SSH, Docker, curl | closer_log, carrier_log, DIDs, Homer, recordings |
| 9 | /call-drops |
Investigation | SSH, Docker | Problem statuses, carrier detail, baselines |
| 10 | /lagged |
Investigation | SSH, Docker | agent_log, Homer RTCP, SIP peers |
| 11 | /network-check |
Investigation | SSH, Docker, curl, ping | Homer RTCP, Smokeping, UDP buffers, MTR |
| 12 | /agent-ranks |
Lookup | SSH | inbound_group_agents, routing config |
| 13 | /did-lookup |
Lookup | SSH | inbound_dids, company mapping, dialplan |
| 14 | /reports |
Lookup | SSH, curl | ViciDial PHP reports + direct SQL |
| 15 | /listen-recording |
Lookup | SSH, curl, sox, ffprobe | recording_log, NISQA, Silero VAD, SoX |
Appendix B: Common ViciDial Status Codes
For reference in your skills:
Call Disposition Statuses
| Status | Meaning |
|---|---|
| SALE | Successful sale/conversion |
| NI | Not interested |
| A | Answering machine |
| CALLBK | Callback scheduled |
| DNC | Do not call |
| XFER | Transferred |
| DROP | Dropped from queue (no agent) |
| DISMX | Disconnect mid-call (inbound) |
| DCMX | Disconnect mid-call (outbound) |
| TIMEOT | Agent timeout |
| ADCT | Auto-disconnect (dead channel) |
| AFTHRS | After hours |
| NANQUE | No agent, no queue |
| HXFER | Hangup during transfer |
| XDROP | External drop |
| LAGGED | Agent heartbeat failure |
SIP Hangup Causes
| Code | Meaning |
|---|---|
| 16 | Normal clearing |
| 17 | User busy |
| 18 | No user responding |
| 20 | Subscriber absent |
| 21 | Call rejected |
| 31 | Normal, unspecified |
| 34 | No circuit available |
| 38 | Network out of order |
| 127 | Internal error |
RTCP Quality Thresholds
| Metric | Good | Warning | Critical |
|---|---|---|---|
| Packet loss (fraction_lost) | 0 | >5 (of 255) | >25 |
| Jitter | <20ms | >50ms | >100ms |
| Latency (RTT) | <100ms | >200ms | >300ms |
| UDP RcvbufErrors | 0 | >0 | Increasing |
NISQA Audio Quality Scores
| Score | Quality |
|---|---|
| 4.0+ | Excellent |
| 3.5-4.0 | Good |
| 3.0-3.5 | Fair |
| 2.5-3.0 | Poor |
| < 2.5 | Bad |
Appendix C: SSH Configuration for Multi-Server Access
For skills to work efficiently, configure SSH with ControlMaster for persistent connections:
# ~/.ssh/config
Host server-a
HostName YOUR_SERVER_IP
Port 9322
User root
IdentityFile ~/.ssh/id_ed25519
ControlMaster auto
ControlPath ~/.ssh/sockets/%r@%h-%p
ControlPersist 600
Host server-b
HostName YOUR_SERVER_IP
Port 9322
User root
IdentityFile ~/.ssh/id_ed25519
ControlMaster auto
ControlPath ~/.ssh/sockets/%r@%h-%p
ControlPersist 600
# Repeat for each server...
Host replica
HostName YOUR_REPLICA_IP
Port 9322
User root
IdentityFile ~/.ssh/id_ed25519
ControlMaster auto
ControlPath ~/.ssh/sockets/%r@%h-%p
ControlPersist 600
Create the sockets directory:
mkdir -p ~/.ssh/sockets
ControlPersist 600 keeps SSH connections open for 10 minutes after the last use. This means the first /health check opens connections to all servers, and subsequent skill invocations reuse them -- making everything feel instant.
This tutorial documents a production system managing 5 VoIP servers, 1,500+ DIDs, 50+ agents, and thousands of daily calls. The skills were developed iteratively over weeks of real operations, each one born from an actual incident or repeated manual investigation. The framework scales to any infrastructure that Claude Code can reach.