Building Custom Claude Code Skills for VoIP Infrastructure Operations

15 Slash Commands for Monitoring, Investigation & Lookup

Audience: DevOps engineers, sysadmins, VoIP/telecom operators, and Claude Code power users.

What you will build: A complete AI-powered operations toolkit -- 15 custom slash commands that turn Claude Code into a senior infrastructure engineer who knows your servers, your databases, your SIP trunks, and your investigation playbooks by heart.

Prerequisites: Claude Code CLI installed, SSH access to your servers, basic familiarity with Asterisk/VoIP concepts.

Why AI-Assisted Operations Skills
Architecture Overview
The SKILL.md File Format
Directory Structure
Operations Skills (6)
Investigation Skills (5)
Lookup Skills (4)
Complete Example Skills
Production Safety Hook
MCP Grafana Integration
Settings Configuration
Skill Design Patterns & Tips
Investigation Workflow Patterns
Permission Management
Putting It All Together

1. Why AI-Assisted Operations Skills

Traditional infrastructure monitoring gives you dashboards. Runbooks give you procedures. But neither thinks. Neither correlates. Neither adapts.

When you build custom Claude Code skills for your infrastructure, you get something qualitatively different:

Context-aware investigation. Instead of checking five different tools manually, you type /call-investigate +44XXXXXXXXXX and Claude traces the call through DID routing, carrier logs, Asterisk dialplans, SIP traces, agent state, and audio recordings -- correlating everything into a single diagnosis.

Institutional knowledge embedded in code. Every skill file encodes your team's hard-won knowledge: which hangup cause means what, which server uses which MySQL credentials, where the recordings live, what "normal" looks like for your trunks. New team members get the senior engineer's playbook on day one.

The 10x multiplier is real. Here is what changes:

Task	Without Skills	With Skills
Health check across 5 servers	5-10 min (SSH each, run commands, compare)	15 sec (`/health`)
Investigate a dropped call	30-60 min (find logs, trace routing, check carrier)	2 min (`/call-investigate`)
Check why agent has no calls	15-20 min (check ranks, ingroups, login state)	30 sec (`/agent-ranks agent123`)
Diagnose audio quality complaint	1-2 hours (Homer, recordings, codecs, network)	5 min (`/audio-quality`)
Full server audit	45-60 min	3 min (`/audit-server`)

Each skill is a Markdown file. No plugins to install, no APIs to build, no code to compile. You write the investigation procedure in natural language, and Claude executes it using the tools you allow.

2. Architecture Overview

+------------------+     SSH (key-based)     +-------------------+
|                  |------------------------->| VoIP Server 1     |
|  VPS / Jump Box  |------------------------->| VoIP Server 2     |
|  (Claude Code)   |------------------------->| VoIP Server 3     |
|                  |------------------------->| Replica DB        |
|  ~/.claude/      |                          +-------------------+
|    skills/       |
|      health/     |     Docker (local)       +-------------------+
|      calls/      |------------------------->| Grafana           |
|      agents/     |------------------------->| Prometheus        |
|      ...         |------------------------->| Loki              |
|    hooks/        |------------------------->| Homer (SIP/RTCP)  |
|    settings.json |------------------------->| Smokeping         |
+------------------+                          +-------------------+
        |
        | MCP (Model Context Protocol)
        v
+------------------+
| Grafana MCP      |
| (mcp-grafana)    |
| - Dashboards     |
| - PromQL queries |
| - Loki log search|
+------------------+

Key principle: Claude Code runs on a central VPS/jump box that has SSH access to all production servers and Docker access to monitoring containers. Skills teach Claude how to use these access paths to answer operational questions.

3. The SKILL.md File Format

Every skill is a single Markdown file named SKILL.md inside its own directory under ~/.claude/skills/. The file has two parts: a YAML frontmatter header and a Markdown body.

Frontmatter (Required)

---
name: skill-name
description: One-line description shown in skill listings and used for matching.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *)
---

Field	Purpose
`name`	The slash command name. Users type `/name` to invoke.
`description`	Shown in help listings. Also used by Claude to decide when to suggest the skill. Be specific -- mention the problem types this skill addresses.
`user-invocable`	Set to `true` so users can trigger it directly with `/name`.
`allowed-tools`	Whitelist of tools the skill can use. Uses glob patterns. `Bash(ssh *)` means "allow any Bash command starting with `ssh`".

Allowed-Tools Patterns

# SSH to any server
allowed-tools: Bash(ssh *)

# SSH + Docker + curl + ping
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)

# SSH + local audio tools
allowed-tools: Bash(ssh *), Bash(curl *), Bash(sox *), Bash(soxi *), Bash(ffprobe *)

The tool patterns act as a security boundary. A skill that only needs SSH cannot accidentally execute Docker commands or write files. Design skills with the minimum tools they need.

Body (The Investigation Procedure)

The Markdown body is the actual instruction set. Claude reads this as its playbook when the skill is invoked. It should contain:

What to do -- step-by-step procedures
How to access resources -- SSH commands, SQL queries, API calls
How to interpret results -- reference tables, thresholds, known patterns
Server-specific variations -- different credentials, paths, or versions per server
Output formatting -- how to present results to the user

The body supports a special variable: $ARGUMENTS -- whatever the user typed after the slash command. For example, if the user types /health server-a, then $ARGUMENTS is server-a.

4. Directory Structure

~/.claude/
  settings.json              # Global settings (permissions, hooks, env)
  settings.local.json        # Per-machine permission overrides
  hooks/
    protect-production.sh    # Safety hook: blocks dangerous commands
  skills/
    health/
      SKILL.md               # /health skill
    calls/
      SKILL.md               # /calls skill
    agents/
      SKILL.md               # /agents skill
    replication/
      SKILL.md               # /replication skill
    audit-server/
      SKILL.md               # /audit-server skill
    trunk-status/
      SKILL.md               # /trunk-status skill
    audio-quality/
      SKILL.md               # /audio-quality skill
    call-investigate/
      SKILL.md               # /call-investigate skill
    call-drops/
      SKILL.md               # /call-drops skill
    lagged/
      SKILL.md               # /lagged skill
    network-check/
      SKILL.md               # /network-check skill
    agent-ranks/
      SKILL.md               # /agent-ranks skill
    did-lookup/
      SKILL.md               # /did-lookup skill
    reports/
      SKILL.md               # /reports skill
    listen-recording/
      SKILL.md               # /listen-recording skill

Each skill gets its own directory. This is a Claude Code convention -- the directory name matches the skill name.

5. Operations Skills

These six skills answer the question: "What is happening right now?"

5.1 `/health` -- Quick Health Check

Purpose: Single-command health sweep across all production servers.

What it checks per server:

Hostname and uptime
Asterisk active channels and SIP peer count
MySQL status (uptime, threads, queries)
Disk usage (flags >80%)
fail2ban status (flags if not running)
Replication status on the replica server

Design pattern: One SSH command per server that gathers all metrics, minimizing round-trips. Results presented as a table with WARNING/CRITICAL flags.

Usage:

/health              # Check all servers
/health server-a     # Check specific server

5.2 `/calls` -- Live Calls

Purpose: Real-time view of active calls across the infrastructure.

What it shows:

Active Asterisk channels per server
Agents currently on calls (from vicidial_live_agents where status IN INCALL, QUEUE, CLOSER)
Calls waiting in queue (from vicidial_auto_calls)
Problem statuses (DROP, etc.)

5.3 `/agents` -- Agent Status

Purpose: All logged-in agents with detailed status.

What it shows per agent:

Agent ID, name, status, campaign, pause code, time in current state, calls today

Flags:

Paused >30 minutes = highlighted
LAGGED status = CRITICAL
0 calls after 2+ hours logged in = noted

5.4 `/replication` -- Database Replication

Purpose: Check MariaDB multi-source replication health.

What it checks:

IO_Running and SQL_Running per connection
Seconds_Behind_Master (>60s WARNING, >300s CRITICAL)
Last errors
Disk space on replica

Special feature: Pass fix as argument to get suggested repair commands.

5.5 `/audit-server` -- Deep Server Audit

Purpose: Comprehensive server audit covering system, Asterisk, database, security, ViciDial, and logs.

Sections: System resources, Asterisk health, database status, security posture, ViciDial process status, recent errors.

Output: Organized by severity -- CRITICAL, WARNING, INFO.

5.6 `/trunk-status` -- SIP Trunk Status

Purpose: Check SIP trunk registration and connectivity.

Includes: Trunk inventory per server, quick all-server check loop, and a troubleshooting workflow (ping, firewall, registration, DNS, qualify, carrier logs).

6. Investigation Skills

These five skills answer the question: "Why did this happen?"

6.1 `/audio-quality` -- Voice Quality Investigation

Tools used: Homer RTCP (PostgreSQL), audio analysis service (NISQA neural scoring + Silero VAD), Asterisk logs, SIP peer stats, Smokeping, codec verification.

Investigation flow:

Find the calls
Identify endpoints (agent IP, trunk IP)
Query Homer RTCP for packet loss and jitter
Check Asterisk logs for codec errors, RTP switching
Check live SIP quality
Download and analyze recording
Check network (Smokeping, ping, UDP buffers)

6.2 `/call-investigate` -- Deep Call Tracing

The most detailed skill. Traces a call through its entire lifecycle:

Find call records (inbound/outbound/archived)
Check carrier log (SIP-level hangup causes)
Check DID routing
Trace in Asterisk logs
Search Homer SIP traces
Find and analyze recording
Check agent state at time of call

Includes reference tables for hangup causes (16=Normal, 17=Busy, 18=No response, etc.) and problem statuses (DISMX, DCMX, DROP, TIMEOT, etc.).

6.3 `/call-drops` -- Drop & Failure Analysis

Purpose: Systematic analysis of problem dispositions.

Covers: DROP (queue timeout), DISMX/DCMX (mid-call disconnect), TIMEOT (agent timeout), AFTHRS (after hours), with carrier-level detail and historical baseline comparison.

6.4 `/lagged` -- Agent LAGGED Events

Purpose: Investigate ViciDial heartbeat failures that kick agents offline.

Correlation: Matches LAGGED timestamps against Homer RTCP data to determine if the cause was network (jitter spike, packet loss) or client-side (browser crash, PC freeze).

6.5 `/network-check` -- Network Quality

Tools: Homer RTCP analysis, Smokeping, direct ping, UDP buffer stats, SIP peer latency, live RTP channel stats, MTR traceroute.

Thresholds documented inline:

Packet loss >1% = quality degraded
Jitter >50ms = choppy audio
Latency >200ms = noticeable delay
UDP RcvbufErrors increasing = server dropping packets

7. Lookup Skills

These four skills answer the question: "What is this configured to do?"

7.1 `/agent-ranks` -- Rank & Routing Diagnostics

Purpose: Understand why calls go to specific agents.

Checks: Ingroup assignments, routing method, rank/weight configuration, active closer campaigns, call distribution fairness, ranking inconsistencies, and can simulate "who would get the next call right now?"

7.2 `/did-lookup` -- DID Routing

Purpose: Trace how a phone number is routed through the system.

Covers: DID configuration, company name mapping, call history, dialplan routing path, and can manage company-to-DID mappings.

7.3 `/reports` -- ViciDial Report Generation

Purpose: Quick access to 15+ built-in ViciDial reports plus direct SQL.

Provides: URL templates with proper parameters for agent performance, inbound stats, carrier logs, LAGGED reports, call exports, DID stats, and more. Also includes custom SQL queries for when built-in reports are not enough.

7.4 `/listen-recording` -- Recording Analysis

Purpose: Download and analyze call recordings with neural quality scoring.

Tools: NISQA (neural audio quality model), Silero VAD (voice activity detection for silence analysis), SoX (waveform analysis), ffprobe (format inspection).

Supports: Both MIX (combined stereo) and ORIG (separate caller/agent legs) recording formats.

8. Complete Example Skills

Here are four complete skill files you can adapt for your infrastructure.

Example 1: `/health` -- Server Health Check

---
name: health
description: Quick health check across all VoIP production servers. Shows Asterisk, MySQL, disk, uptime, fail2ban, replication.
user-invocable: true
allowed-tools: Bash(ssh *)
---

# Server Health Check

Run a quick health check across all production VoIP servers.
Use SSH config names (server-a, server-b, server-c, etc.).

If $ARGUMENTS is provided, check only those servers.
Otherwise check all production servers.

For each server, run ONE ssh command that gathers:
1. `hostname` and `uptime`
2. `asterisk -rx "core show channels" | tail -1` (active calls)
3. `asterisk -rx "sip show peers" | tail -1` (SIP peers)
4. `mysqladmin status 2>/dev/null | head -1` (MySQL uptime/threads/queries)
5. `df -h / | tail -1` (disk usage)
6. `fail2ban-client status 2>/dev/null | head -2` (fail2ban)

Combine all into a single SSH command per server to minimize round-trips.

Present results in a clean table format. Flag any issues:
- Disk > 80% = WARNING
- No active Asterisk channels when agents should be online = WARNING
- fail2ban not running = CRITICAL
- MySQL not responding = CRITICAL

Also check replication on the replica server (ssh your-replica):
- `mysql -u YOUR_REPL_USER -pYOUR_REPL_PASS -e "SHOW ALL SLAVES STATUS\G" | grep -E "Connection_name|Slave_IO|Slave_SQL|Seconds_Behind"`

Server reference:
- server-a (YOUR_SERVER_IP) -- Primary, Asterisk 18
- server-b (YOUR_SERVER_IP) -- Secondary, Asterisk 16
- server-c (YOUR_SERVER_IP) -- Tertiary, Asterisk 13
- server-d (YOUR_SERVER_IP) -- Standalone

Example 2: `/call-investigate` -- Deep Call Tracing

---
name: call-investigate
description: Deep investigation of specific calls by phone number, uniqueid, or agent ID. Traces full call path from DID through routing to agent, checks carrier logs, SIP traces, recordings, and dispositions. Use for any call complaint or incident.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *)
---

# Call Investigation

Deep-dive into specific calls. $ARGUMENTS: phone number(s), uniqueid(s),
agent ID(s), or date range.

## Step 1: Find the Call Records

### Inbound calls (vicidial_closer_log)
```sql
SELECT call_date, phone_number, length_in_sec, status, term_reason,
       uniqueid, closecallid, user, campaign_id, queue_seconds,
       comments
FROM vicidial_closer_log
WHERE phone_number LIKE '%NUMBER%'
  AND call_date >= 'YYYY-MM-DD'
ORDER BY call_date DESC LIMIT 20;

Outbound calls (vicidial_log)

SELECT call_date, phone_number, length_in_sec, status, term_reason,
       uniqueid, user, campaign_id
FROM vicidial_log
WHERE phone_number LIKE '%NUMBER%'
  AND call_date >= 'YYYY-MM-DD'
ORDER BY call_date DESC LIMIT 20;

By agent

SELECT call_date, phone_number, length_in_sec, status, term_reason,
       uniqueid, campaign_id
FROM vicidial_closer_log WHERE user='AGENT_ID' AND call_date >= CURDATE()
UNION ALL
SELECT call_date, phone_number, length_in_sec, status, term_reason,
       uniqueid, campaign_id
FROM vicidial_log WHERE user='AGENT_ID' AND call_date >= CURDATE()
ORDER BY call_date DESC LIMIT 30;

Step 2: Check Carrier Log (SIP-level detail)

SELECT call_date, channel, server_ip, dialstatus,
       hangup_cause, sip_hangup_cause, sip_hangup_reason,
       dial_time, answered_time, dead_sec
FROM vicidial_carrier_log
WHERE uniqueid='UNIQUEID'
ORDER BY call_date;

Key hangup causes:

16 = Normal clearing (good)
17 = User busy
18 = No user responding
20 = Subscriber absent
21 = Call rejected
31 = Normal, unspecified
34 = No circuit available (trunk congestion)
38 = Network out of order
127 = Internal error

Key dialstatuses:

ANSWER = call connected
BUSY = far end busy
NOANSWER = ring timeout
CANCEL = caller hung up during ring
CHANUNAVAIL = trunk/channel problem
CONGESTION = network congestion

Step 3: Check DID Routing

SELECT did_id, did_pattern, did_description, did_route,
       did_agent_a, extension, exten_context, group_id
FROM vicidial_inbound_dids
WHERE did_pattern LIKE '%DID_NUMBER%';

Step 4: Trace in Asterisk Logs

# Find the call in Asterisk logs by uniqueid or phone number
ssh your-server "grep -E 'UNIQUEID|PHONE_NUMBER' /var/log/asterisk/messages | tail -30"

# Trace full SIP dialog by Call-ID
ssh your-server "grep 'CALL_ID' /var/log/asterisk/messages | tail -50"

What to look for:

Ringing() -> Wait() -> AGI routing = normal flow
DISMX / DCMX = disconnect mid-call (abnormal)
func_hangupcause.c: Unable to find = abnormal hangup
chan_sip.c: Failed to authenticate = SIP auth issue
Strict RTP switching = NAT/media IP mismatch
bridge_channel.c: Channel left = check timing

Step 5: Check Homer SIP Traces (if available)

docker exec -i postgres psql -U homer -d homer_data -c "
SELECT create_date, protocol_header->>'method' as method,
       protocol_header->>'srcIp' as src,
       protocol_header->>'dstIp' as dst
FROM hep_proto_1_default_YYYYMMDD_HHMM
WHERE raw::text LIKE '%PHONE_NUMBER%'
ORDER BY create_date DESC LIMIT 20;
"

Note: SIP table is hep_proto_1_*, RTCP is hep_proto_5_*. Partitions are by UTC time (6-hour windows).

Step 6: Check Recording

ssh your-server "mysql asterisk -e \"SELECT recording_id, filename,
  location, length_in_sec FROM recording_log WHERE lead_id IN (
  SELECT lead_id FROM vicidial_closer_log WHERE uniqueid='UNIQUEID'
) ORDER BY start_time DESC LIMIT 5;\""

# Audio analysis (if analysis service is running)
curl -s "http://localhost:8084/analyze?uniqueid=UNIQUEID&server=SERVER_KEY" | jq .

Step 7: Check Agent State at Time of Call

SELECT event_time, user, pause_epoch, wait_epoch, talk_epoch,
       dispo_epoch, status, sub_status, pause_type, dead_sec
FROM vicidial_agent_log
WHERE user='AGENT_ID'
  AND event_time >= 'YYYY-MM-DD HH:MM:00'
  AND event_time <= 'YYYY-MM-DD HH:MM:59'
ORDER BY event_time;

Problem Status Reference

Status	Meaning	Investigation
DISMX	Disconnect mid-call (inbound)	Check carrier_log, network, agent connection
DCMX	Disconnect mid-call (outbound)	Same as above
DROP	Call dropped from queue (timeout)	Check queue timeout, agent availability
TIMEOT	Agent didn't answer in time	Check alert settings, softphone
ADCT	Auto-disconnect	Check dead_max campaign setting
AFTHRS	After hours routing	Check ingroup after_hours settings
NANQUE	No agent, no queue	Check no_agent_no_queue setting
HXFER	Hangup during transfer	Check transfer target availability
XDROP	External drop	Carrier/trunk issue
LAGGED	Agent lagged out	Network -- use /lagged skill

MySQL Access Per Server

server-a/server-b: ssh your-server "mysql asterisk -e '...'"
server-c (older): ssh your-server "mysql -u YOUR_USER -pYOUR_PASS asterisk -e '...'"
replica (read-only): ssh your-replica "mysql -u YOUR_USER -pYOUR_PASS dbname -e '...'"


### Example 3: `/audio-quality` -- Voice Quality Investigation

```markdown
---
name: audio-quality
description: Investigate audio quality issues for specific calls or agents. Uses Homer RTCP, audio analysis service, Asterisk logs, recording playback, codec checks. Use when agents or clients complain about voice quality, one-way audio, choppy audio, echo, or silence.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)
---

# Audio Quality Investigation

Investigate voice quality issues using ALL available tools.
$ARGUMENTS can be: phone number(s), agent ID(s), or "all" for a general sweep.

## Available Tools

### 1. Homer RTCP Analysis (PostgreSQL via Docker)

Query RTCP data from Homer to check packet loss and jitter.

```bash
# Connect to Homer DB
docker exec -i postgres psql -U homer -d homer_data

# Find RTCP table names (6-hour partitions, UTC time)
docker exec -i postgres psql -U homer -d homer_data -c "\dt hep_proto_5_default_*" | tail -20

# Query RTCP from a specific source IP
docker exec -i postgres psql -U homer -d homer_data -c "
SELECT
  create_date,
  protocol_header->>'srcIp' as src,
  protocol_header->>'dstIp' as dst,
  (raw::jsonb->'sender_information'->>'packets')::bigint as pkts,
  (raw::jsonb->'report_blocks'->0->>'fraction_lost')::bigint as frac_lost,
  (raw::jsonb->'report_blocks'->0->>'ia_jitter')::bigint as jitter,
  (raw::jsonb->'report_blocks'->0->>'packets_lost')::bigint as lost
FROM hep_proto_5_default_YYYYMMDD_HHMM
WHERE protocol_header->>'srcIp' LIKE 'IP_PATTERN%'
  AND create_date > NOW() - INTERVAL '2 hours'
ORDER BY create_date DESC LIMIT 50;
"

CRITICAL: Table partitions are by UTC time. If your VPS is in CET (UTC+1), and it is 14:00 CET, that is 13:00 UTC, so use the table *_1200 (covers 12:00-18:00 UTC).

Interpreting RTCP values:

fraction_lost: 0-255 scale (0=perfect, 255=100% loss). >5 is bad.
ia_jitter: In timestamp units. Divide by 8 for milliseconds.

50ms is bad.
packets_lost values of 16777215 (2^24 - 1) are overflow -- treat as 0.

2. Audio Analysis Service (FastAPI)

If you have a neural audio quality service running:

# Analyze a specific recording (by uniqueid)
curl -s "http://localhost:8084/analyze?uniqueid=UNIQUEID&server=SERVERNAME" | jq .

# AI-powered analysis (uses an LLM to interpret scores)
curl -s "http://localhost:8084/ai-analyze?uniqueid=UNIQUEID&server=SERVERNAME" | jq .

3. Asterisk Logs (on production servers via SSH)

# Check for codec issues
ssh your-server "grep 'Unknown RTP codec' /var/log/asterisk/messages | tail -20"

# Check for RTP source switching (NAT issues)
ssh your-server "grep 'Strict RTP' /var/log/asterisk/messages | tail -20"

# Check for jitter buffer resyncs
ssh your-server "grep 'Resyncing the jb' /var/log/asterisk/messages | tail -20"

4. SIP Peer Quality (live agent quality)

# Check agent SIP registration quality
ssh your-server "asterisk -rx 'sip show peer AGENT_EXT'"
# Look for: Status (latency), Useragent (softphone version), codecs

# Live RTP stats for all active channels
ssh your-server "asterisk -rx 'sip show channelstats'"
# Shows: Recv/Sent packets, Lost packets, Jitter, RTT per channel

5. Codec Verification

# Check what codecs an agent negotiated
ssh your-server "asterisk -rx 'core show channel SIP/AGENT-CHANNELID'"
# Look for: NativeFormats, ReadFormat, WriteFormat
# If Read != Write, there is transcoding (quality loss)

# Check trunk codec config
ssh your-server "grep -A5 'TRUNK_NAME' /etc/asterisk/sip-vicidial.conf"

6. Network Quality (Smokeping + Ping)

# Direct ping test
ping -c 10 TARGET_IP

# Check UDP buffer overflows (on production server)
ssh your-server "cat /proc/net/snmp | grep Udp"
# RcvbufErrors > 0 = packets dropped due to small UDP buffers

Investigation Workflow

Find the calls: Query closer_log or call_log by phone number
Identify endpoints: Agent ID -> SIP peer -> agent IP. Trunk -> trunk IP
Check Homer RTCP: Query for both directions (trunk->server, server->agent)
Check Asterisk logs: Codec errors, RTP switching, jitter resyncs
Check live SIP quality: sip show peer, sip show channelstats
Listen to recording: Download and analyze via audio analysis service
Check network: Smokeping, ping, UDP buffers

Common Root Causes (from real investigations)

Codec mismatch: Agent softphone doesn't offer alaw, causing ulaw-to-alaw transcoding through conference bridge = quality loss
Trunk provider packet loss: Transient. Check Homer RTCP from provider IP ranges.
Old softphone versions: May send unknown codec IDs, have high latency, or lack proper codec support
RTP Keepalive disabled: NAT binding timeout causes intermittent one-way audio
UDP buffer overflow: Default rmem_default too small for busy servers. Should be at least 2MB.
Conference bridge overhead: MeetMe always transcodes to slin internally, adding overhead vs. ConfBridge


### Example 4: `/trunk-status` -- SIP Trunk Status

```markdown
---
name: trunk-status
description: Check SIP trunk status across all VoIP servers. Shows registration state, latency, active calls per trunk. Use when calls fail to connect, trunks go UNREACHABLE, or provider issues suspected.
user-invocable: true
allowed-tools: Bash(ssh *)
---

# SIP Trunk Status Check

Check all SIP trunks across production servers.
$ARGUMENTS: server name or "all".

## Per Server Check

```bash
# Show all SIP peers with status
ssh your-server "asterisk -rx 'sip show peers'"

# Show only trunks (filter out agent extensions)
ssh your-server "asterisk -rx 'sip show peers' | grep -E 'trunk_name|UNREACHABLE'"

# Detailed info for a specific trunk
ssh your-server "asterisk -rx 'sip show peer TRUNKNAME'"

Trunk Inventory by Server

Maintain a table mapping trunks to providers and purposes:

Server	Trunk	Provider IP	Purpose
server-a	provider1_de	YOUR_PROVIDER_IP	Primary inbound
server-a	provider1_uk	YOUR_PROVIDER_IP	UK outbound
server-a	provider2	YOUR_PROVIDER_IP	Inbound
server-b	provider3	YOUR_PROVIDER_IP	Regional inbound
server-c	provider1	YOUR_PROVIDER_IP	General

Quick All-Server Trunk Check

for srv in server-a server-b server-c server-d; do
  echo "=== $srv ==="
  ssh $srv "asterisk -rx 'sip show peers' | grep -cE 'OK|UNREACHABLE|UNKNOWN'" 2>/dev/null
  ssh $srv "asterisk -rx 'sip show peers' | grep -E 'UNREACHABLE|UNKNOWN'" 2>/dev/null
  echo ""
done

Troubleshooting UNREACHABLE Trunks

Ping the provider IP: ssh your-server "ping -c 3 PROVIDER_IP"
Check firewall (must be whitelisted if final rule is DROP): ssh your-server "iptables -S INPUT | grep PROVIDER_IP"
Check SIP registration: ssh your-server "asterisk -rx 'sip show registry'"
Check if provider changed IP: ssh your-server "dig SIP_HOSTNAME"
Test SIP OPTIONS: ssh your-server "asterisk -rx 'sip qualify peer TRUNKNAME'"

Check carrier log for failures:

SELECT call_date, dialstatus, hangup_cause, sip_hangup_cause
FROM vicidial_carrier_log
WHERE channel LIKE '%TRUNKNAME%'
ORDER BY call_date DESC LIMIT 10;


---

## 9. Production Safety Hook

This is the most important file in your setup. The safety hook runs **before every Bash command** Claude executes, and blocks dangerous operations on production servers.

### Why You Need This

AI assistants are powerful but imperfect. Without guardrails:
- A "cleanup" task might `rm -rf` a critical directory
- A SQL query might accidentally `DROP TABLE` instead of `SELECT`
- A config edit might break live call routing
- An Asterisk restart might drop 50 active calls

### The Hook: `~/.claude/hooks/protect-production.sh`

```bash
#!/bin/bash
# Hook: Block dangerous operations on production servers
# Exit 0 = allow, Exit 2 = block (with stderr message)

INPUT=$(cat)
COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // empty')

# Production server IPs and SSH config names
PROD_SERVERS="YOUR_SERVER_IP_1|YOUR_SERVER_IP_2|YOUR_SERVER_IP_3|server-a|server-b|server-c"

# Check if command targets a production server
targets_prod() {
  echo "$COMMAND" | grep -qE "$PROD_SERVERS"
}

# Block dangerous patterns on production
if targets_prod; then
  # Block rm -rf on remote servers
  if echo "$COMMAND" | grep -qE 'rm\s+(-rf|-fr)\s+/'; then
    echo "BLOCKED: rm -rf on production server. This could delete critical data." >&2
    exit 2
  fi

  # Block DROP/TRUNCATE on production databases
  if echo "$COMMAND" | grep -qiE '(DROP\s+TABLE|TRUNCATE\s+TABLE|DROP\s+DATABASE|DELETE\s+FROM\s+vicidial)'; then
    echo "BLOCKED: Destructive SQL on production. Use SELECT first, then ask for explicit approval." >&2
    exit 2
  fi

  # Block modifying Asterisk config files
  if echo "$COMMAND" | grep -qE '(sed|awk|tee|>|>>)\s.*(extensions\.conf|extensions-vicidial\.conf|sip-vicidial\.conf|customexte\.conf|sip\.conf)'; then
    echo "BLOCKED: Modifying Asterisk config on production. These affect live calls. Ask for explicit approval first." >&2
    exit 2
  fi

  # Block systemctl stop/restart asterisk without approval
  if echo "$COMMAND" | grep -qE 'systemctl\s+(stop|restart)\s+asterisk'; then
    echo "BLOCKED: Stopping/restarting Asterisk on production will drop all active calls." >&2
    exit 2
  fi
fi

# Allow everything else
exit 0

Make it executable:

chmod +x ~/.claude/hooks/protect-production.sh

How It Works

The hook receives the command as JSON on stdin. It extracts the Bash command, checks if it targets any production server (by IP or SSH config name), and blocks four categories of dangerous operations:

Pattern	What It Blocks	Why
`rm -rf /` on prod	Recursive file deletion	Could delete recordings, configs, databases
`DROP TABLE`, `TRUNCATE`, `DELETE FROM vicidial*`	Destructive SQL	Call logs, agent data, routing config
`sed`/`awk`/`tee`/`>` on Asterisk `.conf` files	Config file modification	Changes affect live call routing
`systemctl stop/restart asterisk`	Asterisk service control	Drops all active calls immediately

Exit codes:

0 = Allow the command
2 = Block the command (message shown to user via stderr)

Registering the Hook

The hook is registered in ~/.claude/settings.json under the hooks section (see Settings Configuration below).

Real Incident That Prompted This

On a production system, an AI assistant replaced a ring group extension in customexte.conf with a Hangup() command as part of a "cleanup." This caused all after-hours and no-agent calls to be silently dropped instead of ringing backup phones. At least 11 calls were lost overnight before the issue was noticed. The safety hook prevents this class of error entirely.

10. MCP Grafana Integration

The Model Context Protocol (MCP) lets Claude Code interact directly with Grafana for dashboards, Prometheus queries, and Loki log searches -- without needing to SSH anywhere.

Setup

Install the Grafana MCP server:

pip install mcp-grafana
# or via uvx:
uvx mcp-grafana

Create a Grafana API service account and token:

In Grafana, go to Administration > Service Accounts
Create a new service account with Viewer role
Generate a token

Configure the MCP server. Add to your project's .mcp.json or configure in Claude Code settings:

{
  "mcpServers": {
    "grafana": {
      "command": "uvx",
      "args": ["mcp-grafana"],
      "env": {
        "GRAFANA_URL": "http://localhost:3000",
        "GRAFANA_API_KEY": "YOUR_GRAFANA_SERVICE_ACCOUNT_TOKEN"
      }
    }
  }
}

Available MCP Tools

Once configured, Claude Code gains these tools:

Tool	Purpose
`mcp__grafana__search_dashboards`	Find dashboards by name
`mcp__grafana__get_dashboard_by_uid`	Get full dashboard JSON
`mcp__grafana__get_dashboard_panel_queries`	Extract panel queries
`mcp__grafana__query_prometheus`	Execute PromQL queries directly
`mcp__grafana__query_loki_logs`	Search Loki logs
`mcp__grafana__list_loki_label_names`	Browse Loki label taxonomy
`mcp__grafana__list_loki_label_values`	Get values for a label
`mcp__grafana__list_datasources`	List all configured datasources
`mcp__grafana__list_prometheus_metric_names`	Browse available metrics
`mcp__grafana__list_prometheus_label_values`	Query Prometheus labels
`mcp__grafana__create_annotation`	Add annotations to dashboards
`mcp__grafana__get_panel_image`	Render panel as image

How Skills Use MCP

Your skills can reference MCP tools for deeper investigation. For example, in a /network-check skill, after checking Homer RTCP directly, you might also query Prometheus for historical latency data:

# In the skill body, you can note:
If Prometheus node_exporter is available, also check:
- mcp__grafana__query_prometheus with query: rate(node_network_receive_drop_total[5m])
- mcp__grafana__query_prometheus with query: node_network_mtu_bytes

The MCP tools are available alongside Bash/SSH tools, giving skills access to both real-time (SSH to servers) and historical (Prometheus/Loki time series) data.

Example: Datasource Configuration

A typical Grafana setup for VoIP monitoring includes:

Datasource	Type	Purpose
Prometheus	prometheus	Server metrics (CPU, memory, disk, network)
Loki	loki	Aggregated log streams from all servers
Homer	postgres	SIP/RTCP data for call quality analysis
ViciDial	mysql	Direct database queries for call center data

11. Settings Configuration

`~/.claude/settings.json`

This is the global settings file. It controls permissions, hooks, environment variables, and plugins.

{
  "permissions": {
    "allow": [
      "Bash(*)",
      "Read(*)",
      "Write(*)",
      "Edit(*)"
    ],
    "deny": [
      "Bash(rm -rf /*)",
      "Bash(dd if=*)",
      "Bash(mkfs*)"
    ]
  },
  "effortLevel": "high",
  "env": {
    "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "70",
    "PATH": "/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
  },
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "/root/.claude/hooks/protect-production.sh"
          }
        ]
      }
    ],
    "Notification": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "echo '\\a'"
          }
        ]
      }
    ]
  }
}

Key Settings Explained

effortLevel: "high" -- Tells Claude to be thorough. For infrastructure operations, you want Claude to check multiple sources, correlate data, and provide detailed analysis rather than quick surface-level answers.

CLAUDE_AUTOCOMPACT_PCT_OVERRIDE: "70" -- Auto-compact context at 70% of the context window. Infrastructure investigations can generate a lot of output (SQL results, log excerpts, RTCP data). Setting this to 70% gives Claude room to work without losing important context from earlier in the conversation.

hooks.PreToolUse -- The safety hook runs before every Bash command. The matcher: "Bash" means it only triggers for Bash tool calls (not Read, Write, or Edit).

hooks.Notification -- Terminal bell when Claude needs attention. Useful when running long investigations -- you can switch to another terminal and get notified when Claude has results.

permissions.deny -- Hard blocks on truly catastrophic commands. These cannot be overridden by skills or conversation.

`~/.claude/settings.local.json`

This file stores per-machine permission overrides. As you use Claude Code and approve tool permissions, they accumulate here. For a VoIP operations setup, you will typically see patterns like:

{
  "permissions": {
    "allow": [
      "Bash(ssh:*)",
      "Bash(docker exec:*)",
      "Bash(docker ps:*)",
      "Bash(curl:*)",
      "Bash(ping:*)",
      "Bash(python3:*)",
      "mcp__grafana__list_datasources",
      "mcp__grafana__query_prometheus",
      "mcp__grafana__query_loki_logs",
      "mcp__grafana__search_dashboards",
      "Skill(health)",
      "Skill(call-investigate)",
      "Skill(listen-recording)"
    ]
  }
}

Tip: Review this file periodically. Remove permissions you no longer need. The principle of least privilege applies to AI assistants too.

12. Skill Design Patterns & Tips

Pattern 1: Minimize SSH Round-Trips

Bad: Five separate SSH commands to one server.

# Bad - 5 round trips
ssh your-server "hostname"
ssh your-server "uptime"
ssh your-server "df -h /"
ssh your-server "asterisk -rx 'core show channels'"
ssh your-server "fail2ban-client status"

Good: One SSH command that gathers everything.

# Good - 1 round trip
ssh your-server "hostname; uptime; df -h / | tail -1; asterisk -rx 'core show channels' | tail -1; fail2ban-client status 2>/dev/null | head -2"

When checking multiple servers, run the SSH commands in parallel (Claude Code can make multiple Bash calls simultaneously).

Pattern 2: Conditional Arguments with `$ARGUMENTS`

Skills should handle both specific and broad requests:

If $ARGUMENTS is provided (e.g., "server-a" or "agent123"), check only
that target. Otherwise check all servers/agents.

This makes skills flexible -- /health checks everything, /health server-a checks one server.

Pattern 3: Embed Interpretation Tables

Do not just show raw data. Include reference tables so Claude can interpret results:

**Key hangup causes:**
- 16 = Normal clearing (good)
- 17 = User busy
- 34 = No circuit available (trunk congestion)
- 38 = Network out of order

**NISQA Scores (1-5 scale):**
| Score | Quality |
|-------|---------|
| 4.0+  | Excellent |
| 3.5-4.0 | Good |
| 3.0-3.5 | Fair |
| < 3.0 | Poor |

Pattern 4: Include Server-Specific Variations

Real infrastructure is messy. Servers run different versions, use different credentials, have different paths:

## MySQL Access Per Server
- **server-a/server-b**: `ssh server-name "mysql asterisk -e '...'"`
  (root, no password needed via SSH)
- **server-c (older)**: `ssh server-c "mysql -u YOUR_USER -pYOUR_PASS asterisk -e '...'"`
- **replica**: `ssh replica "mysql -u YOUR_USER -pYOUR_PASS dbname -e '...'"`

## Server-Specific Notes
- **server-a/server-b**: ConfBridge, newer Asterisk, ORIG recordings available
- **server-c**: MeetMe (requires DAHDI), Asterisk 11, older OS

Pattern 5: Severity Flags

Define clear severity levels in investigation skills:

Flag any issues:
- Disk > 80% = WARNING
- Disk > 95% = CRITICAL
- fail2ban not running = CRITICAL
- Replication lag > 60s = WARNING
- Replication lag > 300s = CRITICAL
- SIP peer UNREACHABLE = CRITICAL

Claude will use these to prioritize its output, putting critical items at the top.

Pattern 6: Cross-Skill References

Skills can suggest other skills for deeper investigation:

| LAGGED | Agent lagged out | Network issue -- use /lagged skill |

If audio quality issues are found, suggest using /listen-recording
for detailed NISQA analysis.

Pattern 7: Include Both URL and SQL Approaches

For report-type skills, provide both the web URL (for sharing with non-technical users) and the direct SQL (for programmatic access):

### Agent Performance
URL: http://YOUR_SERVER_IP/vicidial/AST_agent_performance_detail.php?query_date=...

### Direct SQL equivalent:
```sql
SELECT user, COUNT(*) as calls, SUM(length_in_sec) as total_talk
FROM vicidial_closer_log WHERE call_date >= CURDATE()
GROUP BY user ORDER BY calls DESC;

Pattern 8: Cleanup After Investigation

Skills that download files should include cleanup instructions:

## Cleanup
```bash
rm -f /tmp/call*.mp3 /tmp/call*.wav /tmp/trimmed.wav /tmp/spectrogram.png

Tip: Test Skills Incrementally

When building a new skill:

Start with the frontmatter and one step
Test it with /your-skill
Add more steps, testing after each
Add interpretation tables last
Add server-specific notes as you discover edge cases

Tip: Keep Skills Focused

Each skill should do one thing well. If a skill is growing beyond ~150 lines, consider splitting it. The /audio-quality skill is one of the largest at ~140 lines because audio investigation genuinely requires that many tools. But /health is only ~40 lines because it has a focused purpose.

Tip: Use Allowed-Tools as Documentation

The allowed-tools field is not just security -- it tells you (and future maintainers) what resources a skill needs:

# This skill only needs SSH -- it is a pure remote investigation
allowed-tools: Bash(ssh *)

# This skill needs SSH + Docker + HTTP -- it correlates multiple data sources
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)

# This skill needs SSH + local audio tools -- it processes files locally
allowed-tools: Bash(ssh *), Bash(curl *), Bash(sox *), Bash(soxi *), Bash(ffprobe *)

13. Investigation Workflow Patterns

The Funnel Pattern

Start broad, narrow down. Most investigation skills follow this pattern:

Step 1: Find the records (broad SQL query)
    |
Step 2: Get SIP/carrier-level detail (narrower)
    |
Step 3: Check infrastructure state (Asterisk logs, SIP peers)
    |
Step 4: Check external data (Homer, Smokeping, recordings)
    |
Step 5: Correlate and diagnose

The Correlation Pattern

The real power of AI-assisted investigation is correlation. A human checking Homer RTCP, then Asterisk logs, then ViciDial tables has to hold all that context in their head. Claude does this naturally:

Agent LAGGED at 14:32:15
  + Homer RTCP shows jitter spike at 14:32:10 from agent IP
  + 3 other agents on same IP also LAGGED within 30 seconds
  = Diagnosis: Office internet dropout

Design skills to gather correlated data points and let Claude connect the dots.

The Baseline Pattern

For drop/failure analysis, always compare against historical baselines:

-- Is today's drop rate abnormal?
SELECT DATE(call_date) as day,
       COUNT(*) as total,
       SUM(CASE WHEN status='DROP' THEN 1 ELSE 0 END) as drops,
       ROUND(SUM(CASE WHEN status='DROP' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) as pct
FROM vicidial_closer_log
WHERE call_date >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)
GROUP BY day ORDER BY day;

Without a baseline, "50 drops today" means nothing. With a baseline, "50 drops today vs. average 12" is a clear alarm.

14. Permission Management

Three Layers of Permission Control

settings.json deny list -- Hard blocks. Cannot be overridden.
Safety hook -- Context-aware blocks (only on production servers).
Skill allowed-tools -- Per-skill tool whitelist.

Permission Flow

User types /health
  |
  v
Claude reads SKILL.md
  -> allowed-tools: Bash(ssh *)
  |
  v
Claude generates: ssh server-a "hostname; uptime; ..."
  |
  v
PreToolUse hook runs (protect-production.sh)
  -> Command targets production? Yes
  -> Is it dangerous? (rm -rf, DROP TABLE, etc.)
    -> No: exit 0 (allow)
  |
  v
settings.json permissions check
  -> Bash(ssh *) matches allow list? Yes
  |
  v
Command executes

When to Allow vs. Deny

Allow broadly in settings.json for your normal workflow tools:

"allow": ["Bash(*)", "Read(*)", "Write(*)", "Edit(*)"]

Deny specifically for catastrophic operations:

"deny": ["Bash(rm -rf /*)", "Bash(dd if=*)", "Bash(mkfs*)"]

Use hooks for context-dependent blocking (same command might be fine on a test server but dangerous on production).

Use allowed-tools in skills to prevent scope creep (a health check skill should not need to write files).

15. Putting It All Together

Quick Start: Build Your First Three Skills

If you are starting from scratch, build these three skills first:

/health -- Gives you immediate value. One command to check all servers.
/calls -- Real-time situational awareness.
/call-investigate -- The workhorse for any incident.

Checklist

Create ~/.claude/skills/ directory structure
Write your safety hook at ~/.claude/hooks/protect-production.sh
Configure ~/.claude/settings.json with hooks, permissions, and env
Set up SSH config with ControlMaster for fast connections
Create your first skill (/health) and test it
Add server-specific notes as you discover differences
Build investigation skills as incidents teach you what to check
Set up MCP Grafana for dashboard and metric access
Review and prune settings.local.json periodically

The Feedback Loop

The best skills come from real incidents. Every time you investigate a problem manually, ask yourself:

What steps did I follow?
What data sources did I check?
What reference information did I need to look up?
What thresholds told me something was wrong?

Write that down as a SKILL.md. Next time, you type a slash command instead of spending 30 minutes.

Scaling to New Infrastructure

The skill pattern works for any infrastructure that Claude Code can reach via SSH, Docker, HTTP, or MCP:

Kubernetes clusters: Skills that run kubectl commands and interpret pod/node status
Cloud infrastructure: Skills that use aws, gcloud, or az CLIs
Network equipment: Skills that SSH to switches/routers and parse show commands
CI/CD pipelines: Skills that check build status, deployment logs, test results
Database clusters: Skills that check replication, slow queries, connection pools

The framework is the same: a SKILL.md file that teaches Claude your operational procedures, reference data, and interpretation rules. The tools change; the pattern does not.

Appendix A: All 15 Skills at a Glance

#	Skill	Type	Tools	Key Data Sources
1	`/health`	Operations	SSH	Asterisk, MySQL, disk, fail2ban
2	`/calls`	Operations	SSH	Asterisk channels, live_agents, auto_calls
3	`/agents`	Operations	SSH	vicidial_live_agents, vicidial_users
4	`/replication`	Operations	SSH	SHOW ALL SLAVES STATUS
5	`/audit-server`	Operations	SSH	System, Asterisk, DB, security, logs
6	`/trunk-status`	Operations	SSH	SIP peers, carrier_log, iptables
7	`/audio-quality`	Investigation	SSH, Docker, curl, ping	Homer RTCP, NISQA, Asterisk logs, codecs
8	`/call-investigate`	Investigation	SSH, Docker, curl	closer_log, carrier_log, DIDs, Homer, recordings
9	`/call-drops`	Investigation	SSH, Docker	Problem statuses, carrier detail, baselines
10	`/lagged`	Investigation	SSH, Docker	agent_log, Homer RTCP, SIP peers
11	`/network-check`	Investigation	SSH, Docker, curl, ping	Homer RTCP, Smokeping, UDP buffers, MTR
12	`/agent-ranks`	Lookup	SSH	inbound_group_agents, routing config
13	`/did-lookup`	Lookup	SSH	inbound_dids, company mapping, dialplan
14	`/reports`	Lookup	SSH, curl	ViciDial PHP reports + direct SQL
15	`/listen-recording`	Lookup	SSH, curl, sox, ffprobe	recording_log, NISQA, Silero VAD, SoX

Appendix B: Common ViciDial Status Codes

For reference in your skills:

Call Disposition Statuses

Status	Meaning
SALE	Successful sale/conversion
NI	Not interested
A	Answering machine
CALLBK	Callback scheduled
DNC	Do not call
XFER	Transferred
DROP	Dropped from queue (no agent)
DISMX	Disconnect mid-call (inbound)
DCMX	Disconnect mid-call (outbound)
TIMEOT	Agent timeout
ADCT	Auto-disconnect (dead channel)
AFTHRS	After hours
NANQUE	No agent, no queue
HXFER	Hangup during transfer
XDROP	External drop
LAGGED	Agent heartbeat failure

SIP Hangup Causes

Code	Meaning
16	Normal clearing
17	User busy
18	No user responding
20	Subscriber absent
21	Call rejected
31	Normal, unspecified
34	No circuit available
38	Network out of order
127	Internal error

RTCP Quality Thresholds

Metric	Good	Warning	Critical
Packet loss (fraction_lost)	0	>5 (of 255)	>25
Jitter	<20ms	>50ms	>100ms
Latency (RTT)	<100ms	>200ms	>300ms
UDP RcvbufErrors	0	>0	Increasing

NISQA Audio Quality Scores

Score	Quality
4.0+	Excellent
3.5-4.0	Good
3.0-3.5	Fair
2.5-3.0	Poor
< 2.5	Bad

Appendix C: SSH Configuration for Multi-Server Access

For skills to work efficiently, configure SSH with ControlMaster for persistent connections:

# ~/.ssh/config

Host server-a
    HostName YOUR_SERVER_IP
    Port 9322
    User root
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/sockets/%r@%h-%p
    ControlPersist 600

Host server-b
    HostName YOUR_SERVER_IP
    Port 9322
    User root
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/sockets/%r@%h-%p
    ControlPersist 600

# Repeat for each server...

Host replica
    HostName YOUR_REPLICA_IP
    Port 9322
    User root
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/sockets/%r@%h-%p
    ControlPersist 600

Create the sockets directory:

mkdir -p ~/.ssh/sockets

ControlPersist 600 keeps SSH connections open for 10 minutes after the last use. This means the first /health check opens connections to all servers, and subsequent skill invocations reuse them -- making everything feel instant.

This tutorial documents a production system managing 5 VoIP servers, 1,500+ DIDs, 50+ agents, and thousands of daily calls. The skills were developed iteratively over weeks of real operations, each one born from an actual incident or repeated manual investigation. The framework scales to any infrastructure that Claude Code can reach.