← All Tutorials

MariaDB Multi-Source Replication for VoIP Call Centers

Infrastructure & DevOps Advanced 34 min read #05

MariaDB Multi-Source Replication for VoIP Call Centers

3 Masters to 1 Read-Only Replica with Database Renaming


Table of Contents

  1. Introduction
  2. Architecture Overview
  3. Prerequisites
  4. Master-Side Configuration
  5. Replica-Side Configuration
  6. Multi-Source Filter Configuration
  7. Initial Data Load
  8. Starting Replication
  9. Health Monitoring
  10. Cron Setup
  11. Firewall and Security
  12. Parallel Replication Tuning
  13. Handling Schema Differences
  14. Error Skipping Strategy
  15. Troubleshooting
  16. Maintenance Operations
  17. Performance Considerations

Introduction

VoIP call centers running ViciDial or similar Asterisk-based dialers present a unique database challenge: each production server maintains its own asterisk database containing call detail records (CDRs), agent performance logs, recording metadata, and campaign statistics. These databases are under constant heavy write pressure from live call traffic -- thousands of INSERT and UPDATE operations per minute during peak hours.

Running analytical queries directly on production is dangerous. A slow report query can lock MyISAM tables, block agent screen updates, and cause call handling delays. Multiply that risk across three or more production servers and you have a recipe for outages.

Multi-source replication solves this cleanly. By replicating all production databases into a single read-only replica, you get:

MariaDB's multi-source replication (available since 10.0) is purpose-built for this fan-in topology. Each master connection is independently configured, monitored, and recoverable. Combined with replicate_rewrite_db, you can map every master's identically-named asterisk database to a unique name on the replica -- no schema changes needed on production.

This tutorial walks through the complete setup as deployed in a production VoIP environment: three ViciDial servers (running MariaDB 10.6 through 10.11) replicating to a Debian 12 replica running MariaDB 10.11. Every configuration file, monitoring script, and troubleshooting procedure comes from a working deployment handling tens of millions of replicated events.


Architecture Overview

 ┌──────────────────────┐    ┌──────────────────────┐    ┌──────────────────────┐
 │   MASTER 1 (master1) │    │   MASTER 2 (master2) │    │   MASTER 3 (master3) │
 │   MariaDB 10.11      │    │   MariaDB 10.11      │    │   MariaDB 10.6       │
 │   server_id = 1      │    │   server_id = 2      │    │   server_id = 3      │
 │   DB: asterisk        │    │   DB: asterisk        │    │   DB: asterisk        │
 │   binlog: MIXED       │    │   binlog: MIXED       │    │   binlog: MIXED       │
 │                       │    │                       │    │                       │
 │  ViciDial + Asterisk  │    │  ViciDial + Asterisk  │    │  ViciDial + Asterisk  │
 │  (Live call traffic)  │    │  (Live call traffic)  │    │  (Live call traffic)  │
 └──────────┬────────────┘    └──────────┬────────────┘    └──────────┬────────────┘
            │ :3306                      │ :3306                      │ :3306
            │ binlog stream              │ binlog stream              │ binlog stream
            │ (compressed)               │ (compressed)               │ (compressed)
            │                            │                            │
            └────────────┐               │               ┌───────────┘
                         │               │               │
                         ▼               ▼               ▼
              ┌──────────────────────────────────────────────────┐
              │              REPLICA SERVER                       │
              │              MariaDB 10.11.14                     │
              │              Debian 12                            │
              │              server_id = 10                       │
              │              read_only = ON                       │
              │                                                   │
              │   Connection "master1":                            │
              │     asterisk  ──rewrite──▶  master1_asterisk      │
              │     IO thread + 4 parallel SQL threads            │
              │                                                   │
              │   Connection "master2":                            │
              │     asterisk  ──rewrite──▶  master2_asterisk      │
              │     IO thread + 4 parallel SQL threads            │
              │                                                   │
              │   Connection "master3":                            │
              │     asterisk  ──rewrite──▶  master3_asterisk      │
              │     IO thread + 4 parallel SQL threads            │
              │                                                   │
              │   No binary log (terminal slave)                  │
              │   log_slave_updates = OFF                          │
              │                                                   │
              │   Consumers:                                       │
              │     • Grafana dashboards                           │
              │     • Report scripts (PHP/Python)                  │
              │     • Read-only user accounts                      │
              │                                                   │
              └──────────────────────────────────────────────────────┘

Key design decisions:

Decision Rationale
Position-based replication (not GTID) Simpler to set up; GTID adds complexity with no benefit for a terminal read-only slave
log_slave_updates = OFF Replica never needs to act as a master; disabling saves disk I/O
skip_slave_start = ON Prevents broken connections from auto-starting after a crash; manual start allows inspection
slave_compressed_protocol = ON Reduces bandwidth 60-70% for WAN replication between data centers
4 parallel SQL threads, optimistic mode Maximizes apply throughput on multi-core hardware without strict ordering overhead
slave_skip_errors for transient errors ViciDial's MEMORY tables and non-transactional MyISAM writes produce harmless conflicts
12 tables filtered per connection Transient/MEMORY tables that reset on restart; replicating them wastes resources and causes errors

Prerequisites

Software Versions

Version compatibility note: MariaDB handles cross-version replication well when the replica runs the same or newer version than the masters. Replicating from 10.6 masters to a 10.11 replica works without issues. Avoid replicating from a newer master to an older replica.

Hardware Sizing (Replica)

For a call center replicating ~100 million events across 3 connections:

Resource Minimum Recommended
CPU 4 cores 8 cores
RAM 8 GB 16 GB
Storage 2x production data size 3x production data size (SSD)
Network 100 Mbps 1 Gbps

The replica needs enough RAM for MyISAM key buffers (ViciDial uses MyISAM extensively) and InnoDB buffer pool (for any InnoDB tables). Budget 2 GB for key_buffer_size and 1 GB for innodb_buffer_pool_size as starting points.

Network Requirements


Master-Side Configuration

Each master server needs three things: binary logging enabled, a unique server ID, and a dedicated replication user.

1. Enable Binary Logging

Add or verify these settings in your master's MariaDB configuration (typically /etc/my.cnf or /etc/mysql/mariadb.conf.d/50-server.cnf):

[mysqld]
server-id              = 1                  # Unique per master: 1, 2, 3, etc.
log_bin                = mysql-bin
binlog_format          = MIXED              # MIXED is the ViciDial default
expire_logs_days       = 7                  # Keep binlogs for 7 days
log_bin_compress       = ON                 # Compress binlog events (10.2.3+)
log_bin_compress_min_len = 256              # Only compress events > 256 bytes

Important: Every server in the replication topology (masters and replica) must have a globally unique server_id. A common scheme:

Server server_id
Master 1 1
Master 2 2
Master 3 3
Replica 10

Restart MariaDB after changing the configuration:

systemctl restart mariadb

Verify binary logging is active:

SHOW MASTER STATUS\G

You should see output like:

File: mysql-bin.000042
Position: 12345678
Binlog_Do_DB:
Binlog_Ignore_DB:

2. Create the Replication User

On each master, create a dedicated user with only REPLICATION SLAVE privilege. Lock it down to the replica's IP address:

CREATE USER 'repl_replica'@'REPLICA_IP'
  IDENTIFIED BY 'REPL_PASSWORD';

GRANT REPLICATION SLAVE ON *.*
  TO 'repl_replica'@'REPLICA_IP';

FLUSH PRIVILEGES;

Security notes:

Verify the grant:

SHOW GRANTS FOR 'repl_replica'@'REPLICA_IP';

Expected output:

GRANT REPLICATION SLAVE ON *.* TO `repl_replica`@`REPLICA_IP`

3. Record Binary Log Position

Before setting up replication, note each master's current binary log file and position. You will need these values during the CHANGE MASTER step.

SHOW MASTER STATUS\G

Record the File and Position values for each master. If you are doing an initial data load with mysqldump --master-data, the dump file will contain the position automatically.


Replica-Side Configuration

The replica configuration is split across two files for clarity:

Main Server Configuration: 50-server.cnf

Place this file at /etc/mysql/mariadb.conf.d/50-server.cnf (Debian/Ubuntu) or include it from your main my.cnf.

[mysqld]

# ─── Identity ───
server-id               = 10
pid-file                = /run/mysqld/mysqld.pid
basedir                 = /usr

# ─── Network ───
bind-address            = 0.0.0.0
skip-name-resolve
port                    = 3306
max_connections         = 200
connect_timeout         = 60
wait_timeout            = 28800
interactive_timeout     = 28800
max_allowed_packet      = 16M

# ─── Security ───
read_only               = ON

# ─── Character Set ───
character_set_server    = utf8mb4
collation_server        = utf8mb4_unicode_520_ci

# ─── MyISAM Tuning ───
# ViciDial uses MyISAM extensively. Size key_buffer to ~25% of RAM.
default_storage_engine  = MyISAM
key_buffer_size         = 2G
myisam_sort_buffer_size = 128M
myisam_recover_options  = BACKUP,FORCE
myisam_use_mmap         = 1

# ─── InnoDB (for any InnoDB tables that replicate) ───
innodb_buffer_pool_size = 1G
innodb_file_per_table   = ON
innodb_flush_method     = O_DIRECT
innodb_flush_log_at_trx_commit = 2
innodb_log_buffer_size  = 8M

# ─── Memory & Cache ───
table_open_cache        = 4096
table_definition_cache  = 16384
sort_buffer_size        = 4M
read_buffer_size        = 4M
read_rnd_buffer_size    = 16M
join_buffer_size        = 1M
thread_cache_size       = 50
max_heap_table_size     = 512M
tmp_table_size          = 256M
open_files_limit        = 24576
query_cache_size        = 0
query_cache_type        = 0

# ─── Replication — Relay Logs ───
relay_log               = /var/lib/mysql/relay-bin
relay_log_recovery      = ON
relay_log_purge         = ON
max_relay_log_size      = 256M

# ─── Replication — Parallel Apply ───
slave_parallel_threads  = 4
slave_parallel_mode     = optimistic

# ─── Replication — Network ───
slave_net_timeout       = 60
slave_compressed_protocol = ON

# ─── Replication — Safety ───
skip_slave_start        = ON
slave_skip_errors       = 1032,1062,1146,1269,1396,1932

# ─── Replication — GTID ───
gtid_strict_mode        = OFF

# ─── Replication — No binlog on replica (terminal slave) ───
log_slave_updates       = OFF

# ─── Binary Log Expiry ───
expire_logs_days        = 3

# ─── Slow Query Log ───
slow_query_log          = 1
slow_query_log_file     = /var/lib/mysql/replica-slow.log
long_query_time         = 5
log_slow_verbosity      = query_plan,explain

# ─── Query Timeout Safety Net ───
max_statement_time      = 300

# ─── Error Log ───
log_error               = /var/log/mysql/error.log
log_warnings            = 2

Key settings explained:

Setting Value Purpose
read_only = ON Prevents accidental writes from non-SUPER users Safety net for a reporting replica
skip_slave_start = ON Replication does not auto-start on server boot Allows you to inspect state before starting
slave_compressed_protocol = ON Compresses the replication stream Reduces WAN bandwidth by 60-70%
slave_parallel_threads = 4 4 SQL applier threads per connection Increases apply throughput
slave_parallel_mode = optimistic Threads apply in parallel, roll back on conflict Best throughput for MyISAM workloads
log_slave_updates = OFF Do not write applied events to local binlog Saves disk I/O; this is a terminal slave
relay_log_recovery = ON Reprocess relay logs after crash Prevents corruption from incomplete relay logs
max_statement_time = 300 Kill queries running longer than 5 minutes Protects against runaway report queries
key_buffer_size = 2G MyISAM index cache Must be large enough for ViciDial's index-heavy tables
query_cache_size = 0 Disabled Query cache causes mutex contention under replication writes
slave_skip_errors 1032,1062,1146,... Skip transient errors from MEMORY tables and MyISAM quirks

Multi-Source Filter Configuration

Filter Configuration: 60-multisource.cnf

Place this file at /etc/mysql/mariadb.conf.d/60-multisource.cnf:

[mysqld]
# ─── Multi-Source Replication Filters ───
# Map each master's 'asterisk' database to a unique local database

master1.replicate_rewrite_db = asterisk->master1_asterisk
master2.replicate_rewrite_db = asterisk->master2_asterisk
master3.replicate_rewrite_db = asterisk->master3_asterisk

# Skip transient/MEMORY tables for all connections
master1.replicate_ignore_table = asterisk.vicidial_live_agents
master1.replicate_ignore_table = asterisk.live_sip_channels
master1.replicate_ignore_table = asterisk.live_channels
master1.replicate_ignore_table = asterisk.vicidial_auto_calls
master1.replicate_ignore_table = asterisk.server_updater
master1.replicate_ignore_table = asterisk.web_client_sessions
master1.replicate_ignore_table = asterisk.vicidial_hopper
master1.replicate_ignore_table = asterisk.vicidial_campaign_server_status
master1.replicate_ignore_table = asterisk.parked_channels
master1.replicate_ignore_table = asterisk.vicidial_manager
master1.replicate_ignore_table = asterisk.cid_channels_recent
master1.replicate_ignore_table = asterisk.sip_packets

master2.replicate_ignore_table = asterisk.vicidial_live_agents
master2.replicate_ignore_table = asterisk.live_sip_channels
master2.replicate_ignore_table = asterisk.live_channels
master2.replicate_ignore_table = asterisk.vicidial_auto_calls
master2.replicate_ignore_table = asterisk.server_updater
master2.replicate_ignore_table = asterisk.web_client_sessions
master2.replicate_ignore_table = asterisk.vicidial_hopper
master2.replicate_ignore_table = asterisk.vicidial_campaign_server_status
master2.replicate_ignore_table = asterisk.parked_channels
master2.replicate_ignore_table = asterisk.vicidial_manager
master2.replicate_ignore_table = asterisk.cid_channels_recent
master2.replicate_ignore_table = asterisk.sip_packets

master3.replicate_ignore_table = asterisk.vicidial_live_agents
master3.replicate_ignore_table = asterisk.live_sip_channels
master3.replicate_ignore_table = asterisk.live_channels
master3.replicate_ignore_table = asterisk.vicidial_auto_calls
master3.replicate_ignore_table = asterisk.server_updater
master3.replicate_ignore_table = asterisk.web_client_sessions
master3.replicate_ignore_table = asterisk.vicidial_hopper
master3.replicate_ignore_table = asterisk.vicidial_campaign_server_status
master3.replicate_ignore_table = asterisk.parked_channels
master3.replicate_ignore_table = asterisk.vicidial_manager
master3.replicate_ignore_table = asterisk.cid_channels_recent
master3.replicate_ignore_table = asterisk.sip_packets

Understanding the Filters

replicate_rewrite_db is the cornerstone of this setup. Every master has a database called asterisk. Without rewriting, all three masters would try to write into the same asterisk database on the replica, causing collisions. The rewrite rule transparently maps:

master1: asterisk  →  master1_asterisk
master2: asterisk  →  master2_asterisk
master3: asterisk  →  master3_asterisk

The connection name prefix (e.g., master1.) scopes each filter to its specific replication connection. This is the MariaDB multi-source syntax -- each connection has its own independent filter set.

replicate_ignore_table filters out tables that should never be replicated:

Table Type Why Skip
vicidial_live_agents MEMORY Volatile agent state; rebuilt on Asterisk restart
live_sip_channels MEMORY Active SIP channel tracking; changes thousands of times per minute
live_channels MEMORY Active Asterisk channel list
vicidial_auto_calls MEMORY Predictive dialer's active call queue
server_updater MyISAM Server heartbeat timestamps; master-specific
web_client_sessions MEMORY Agent web UI sessions
vicidial_hopper MyISAM Dialer lead hopper; extremely high churn
vicidial_campaign_server_status MEMORY Campaign runtime stats
parked_channels MEMORY Parked call tracking
vicidial_manager MEMORY Asterisk AMI command queue
cid_channels_recent MEMORY Recent caller ID cache
sip_packets MyISAM SIP capture data; extremely high volume (can generate GB/day)

Critical detail: The replicate_ignore_table filter uses the original database name (asterisk), not the rewritten name. MariaDB applies ignore filters before the rewrite. If you used master1_asterisk.vicidial_live_agents, the filter would not match.

Why These Tables Cause Replication Errors

MEMORY tables are emptied when the master's MariaDB restarts. The master then logs DELETE or TRUNCATE statements in the binlog. If the replica has already lost those rows (or never had them), you get:

The vicidial_hopper table sees thousands of INSERT/DELETE cycles per minute as the dialer loads and processes leads. Replicating it is pointless for reporting and a constant source of errors.


Initial Data Load

Before starting replication, you need to seed the replica with a snapshot of each master's data. This establishes the baseline from which replication will continue.

Step 1: Dump Each Master

On each master, run mysqldump with --master-data to embed the binary log position in the dump file:

# On Master 1
mysqldump \
  --master-data=2 \
  --single-transaction \
  --routines \
  --triggers \
  --databases asterisk \
  --ignore-table=asterisk.vicidial_live_agents \
  --ignore-table=asterisk.live_sip_channels \
  --ignore-table=asterisk.live_channels \
  --ignore-table=asterisk.vicidial_auto_calls \
  --ignore-table=asterisk.server_updater \
  --ignore-table=asterisk.web_client_sessions \
  --ignore-table=asterisk.vicidial_hopper \
  --ignore-table=asterisk.vicidial_campaign_server_status \
  --ignore-table=asterisk.parked_channels \
  --ignore-table=asterisk.vicidial_manager \
  --ignore-table=asterisk.cid_channels_recent \
  --ignore-table=asterisk.sip_packets \
  | gzip > /tmp/master1_dump.sql.gz

Important flags:

For large databases (50GB+): Consider using mariadb-backup (Mariabackup) instead of mysqldump. It performs a physical backup that is significantly faster for large datasets. The binary log position is recorded in xtrabackup_binlog_info.

Step 2: Transfer Dumps to Replica

scp -P 9322 /tmp/master1_dump.sql.gz REPLICA_IP:/tmp/

Repeat for each master.

Step 3: Create Target Databases and Import

On the replica, create the renamed databases and import each dump:

# Create the target databases
mariadb -e "CREATE DATABASE IF NOT EXISTS master1_asterisk;"
mariadb -e "CREATE DATABASE IF NOT EXISTS master2_asterisk;"
mariadb -e "CREATE DATABASE IF NOT EXISTS master3_asterisk;"

# Import Master 1 (rewrite database name during import)
zcat /tmp/master1_dump.sql.gz \
  | sed 's/`asterisk`/`master1_asterisk`/g' \
  | mariadb

# Import Master 2
zcat /tmp/master2_dump.sql.gz \
  | sed 's/`asterisk`/`master2_asterisk`/g' \
  | mariadb

# Import Master 3
zcat /tmp/master3_dump.sql.gz \
  | sed 's/`asterisk`/`master3_asterisk`/g' \
  | mariadb

Step 4: Extract Binary Log Positions

Find the binary log coordinates embedded in each dump file:

zcat /tmp/master1_dump.sql.gz | head -30 | grep "CHANGE MASTER"

You will see a line like:

-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000042', MASTER_LOG_POS=12345678;

Record the MASTER_LOG_FILE and MASTER_LOG_POS values for each master. You will need them in the next section.


Starting Replication

Step 1: Configure Each Connection

MariaDB multi-source replication uses named connections. Each CHANGE MASTER statement includes a connection name that matches the prefixes in your 60-multisource.cnf filter configuration.

-- Connection: master1
CHANGE MASTER 'master1' TO
  MASTER_HOST     = 'MASTER1_IP',
  MASTER_PORT     = 3306,
  MASTER_USER     = 'repl_replica',
  MASTER_PASSWORD = 'REPL_PASSWORD',
  MASTER_LOG_FILE = 'mysql-bin.000042',
  MASTER_LOG_POS  = 12345678,
  MASTER_CONNECT_RETRY = 10,
  MASTER_HEARTBEAT_PERIOD = 30;

-- Connection: master2
CHANGE MASTER 'master2' TO
  MASTER_HOST     = 'MASTER2_IP',
  MASTER_PORT     = 3306,
  MASTER_USER     = 'repl_replica',
  MASTER_PASSWORD = 'REPL_PASSWORD',
  MASTER_LOG_FILE = 'mysql-bin.000055',
  MASTER_LOG_POS  = 87654321,
  MASTER_CONNECT_RETRY = 10,
  MASTER_HEARTBEAT_PERIOD = 30;

-- Connection: master3
CHANGE MASTER 'master3' TO
  MASTER_HOST     = 'MASTER3_IP',
  MASTER_PORT     = 3306,
  MASTER_USER     = 'repl_replica',
  MASTER_PASSWORD = 'REPL_PASSWORD',
  MASTER_LOG_FILE = 'mysql-bin.000371',
  MASTER_LOG_POS  = 44556677,
  MASTER_CONNECT_RETRY = 10,
  MASTER_HEARTBEAT_PERIOD = 30;

Parameters explained:

Parameter Value Purpose
MASTER_CONNECT_RETRY 10 Retry connection every 10 seconds if it drops
MASTER_HEARTBEAT_PERIOD 30 Master sends a heartbeat every 30 seconds if idle, preventing timeout disconnects
MASTER_LOG_FILE / MASTER_LOG_POS From dump The exact position to start reading binlog events

Step 2: Start All Connections

-- Start all replication connections at once
START ALL SLAVES;

Or start them individually for more control:

START SLAVE 'master1';
START SLAVE 'master2';
START SLAVE 'master3';

Step 3: Verify Replication Status

Check the status of all connections:

SHOW ALL SLAVES STATUS\G

For each connection, verify:

Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0
Last_IO_Error:
Last_SQL_Error:
Replicate_Rewrite_DB: asterisk->master1_asterisk

The key indicators:

Field Expected Problem If
Slave_IO_Running Yes Connecting = network/auth issue; No = fatal error
Slave_SQL_Running Yes No = SQL apply error (check Last_SQL_Error)
Seconds_Behind_Master 0 High number = replica is lagging
Last_IO_Error (empty) Any text here indicates a connection problem
Last_SQL_Error (empty) Any text here indicates a data apply problem

You can also check a single connection:

SHOW SLAVE 'master1' STATUS\G

Health Monitoring

Replication Health Check Script

Save this script as /usr/local/bin/check_replication.sh:

#!/bin/bash
# ─────────────────────────────────────────────────────────
# MariaDB Multi-Source Replication Health Monitor
# Checks all named replication connections and logs alerts
# ─────────────────────────────────────────────────────────

# Configuration
CONNECTIONS=("master1" "master2" "master3")
ALERT_LAG=60          # seconds behind master before alerting
LOG_FILE="/var/log/replication_alerts.log"
ERRORS=0

for CONN in "${CONNECTIONS[@]}"; do
    STATUS=$(mariadb -e "SHOW SLAVE '$CONN' STATUS\G" 2>/dev/null)

    if [ -z "$STATUS" ]; then
        continue  # Connection not configured yet
    fi

    IO=$(echo "$STATUS" | grep 'Slave_IO_Running:' | awk '{print $2}')
    SQL=$(echo "$STATUS" | grep 'Slave_SQL_Running:' | awk '{print $2}')
    LAG=$(echo "$STATUS" | grep 'Seconds_Behind_Master:' | awk '{print $2}')
    ERR=$(echo "$STATUS" | grep 'Last_SQL_Error:' | sed 's/.*Last_SQL_Error: //')

    # Check IO and SQL threads
    if [ "$IO" != "Yes" ] || [ "$SQL" != "Yes" ]; then
        echo "$(date): REPLICATION BROKEN [$CONN] — IO=$IO SQL=$SQL Error: $ERR" \
            >> "$LOG_FILE"
        ERRORS=1
    fi

    # Check replication lag
    if [ "$LAG" != "NULL" ] && [ "$LAG" -gt "$ALERT_LAG" ] 2>/dev/null; then
        echo "$(date): REPLICATION LAG [$CONN] — ${LAG}s behind" \
            >> "$LOG_FILE"
    fi
done

if [ $ERRORS -eq 1 ]; then
    exit 1
fi
exit 0

Make it executable:

chmod +x /usr/local/bin/check_replication.sh

What the Script Checks

For each named connection:

  1. IO Thread -- Is the connection to the master alive and receiving binlog events?
  2. SQL Thread -- Is the replica successfully applying events to local tables?
  3. Replication Lag -- Is the replica falling behind? Alerts if lag exceeds 60 seconds.
  4. SQL Errors -- Captures and logs the specific error text for broken connections.

The script exits with code 1 if any connection is broken, making it compatible with monitoring systems (Nagios, Zabbix, Prometheus node_exporter textfile collector) that check exit codes.

Extending the Monitor

For production deployments, consider adding:

# Email alert (requires mailutils or msmtp)
if [ $ERRORS -eq 1 ]; then
    tail -5 "$LOG_FILE" | mail -s "REPLICATION ALERT: $(hostname)" [email protected]
fi

# Prometheus textfile metric
echo "mariadb_replication_ok{connection=\"$CONN\"} $([ \"$IO\" = \"Yes\" ] && [ \"$SQL\" = \"Yes\" ] && echo 1 || echo 0)" \
    > /var/lib/prometheus/node-exporter/replication.prom

Cron Setup

Schedule the health monitor to run every 5 minutes:

crontab -e

Add the following line:

*/5 * * * * /usr/local/bin/check_replication.sh

Log Rotation

Create /etc/logrotate.d/replication-alerts:

/var/log/replication_alerts.log {
    weekly
    rotate 8
    compress
    missingok
    notifempty
    create 640 root root
}

This keeps 8 weeks of alert history before rotating out old logs.


Firewall and Security

Replica-Side Firewall (UFW)

The replica should only accept MariaDB connections from the master servers and your monitoring/reporting infrastructure:

# Allow SSH (adjust port as needed)
ufw allow 9322/tcp comment 'SSH'

# Allow MariaDB from each master
ufw allow from MASTER1_IP to any port 3306 comment 'Master1 MariaDB'
ufw allow from MASTER2_IP to any port 3306 comment 'Master2 MariaDB'
ufw allow from MASTER3_IP to any port 3306 comment 'Master3 MariaDB'

# Allow MariaDB from reporting tools (Grafana, report servers, etc.)
ufw allow from GRAFANA_IP to any port 3306 comment 'Grafana'

# Enable the firewall
ufw enable

Verify:

ufw status numbered

Expected output:

     To                         Action      From
     --                         ------      ----
[ 1] 9322/tcp                   ALLOW IN    Anywhere
[ 2] 3306                       ALLOW IN    MASTER1_IP
[ 3] 3306                       ALLOW IN    MASTER2_IP
[ 4] 3306                       ALLOW IN    MASTER3_IP
[ 5] 3306                       ALLOW IN    GRAFANA_IP

Master-Side Firewall

On each master, allow the replica to connect on port 3306:

# On each master server
ufw allow from REPLICA_IP to any port 3306 comment 'Replication replica'

Database User Security

Create separate users on the replica for different consumers:

-- Read-only reporting user with query timeout
CREATE USER 'report_user'@'ALLOWED_IP'
  IDENTIFIED BY 'REPORT_PASSWORD';

GRANT SELECT ON master1_asterisk.* TO 'report_user'@'ALLOWED_IP';
GRANT SELECT ON master2_asterisk.* TO 'report_user'@'ALLOWED_IP';
GRANT SELECT ON master3_asterisk.* TO 'report_user'@'ALLOWED_IP';

-- Apply a per-user query timeout (prevents runaway queries)
ALTER USER 'report_user'@'ALLOWED_IP'
  WITH MAX_STATEMENT_TIME 5;

FLUSH PRIVILEGES;

User separation guidelines:

User Access Timeout Purpose
repl_replica (on masters) REPLICATION SLAVE only N/A Replication connection
report_user (on replica) SELECT on all replica DBs 5 seconds Automated report scripts
grafana_ro (on replica) SELECT on all replica DBs 5 seconds Grafana dashboards
root (on replica) Full local access 300 seconds Administration only

Parallel Replication Tuning

MariaDB's parallel replication applies binlog events using multiple threads, significantly improving throughput for write-heavy workloads like ViciDial.

Configuration

The two key settings in 50-server.cnf:

slave_parallel_threads  = 4
slave_parallel_mode     = optimistic

Parallel Mode Options

Mode Behavior Best For
conservative Only parallelizes transactions from the same group commit InnoDB-heavy workloads; safest
optimistic Applies transactions in parallel, retries on conflict Mixed MyISAM/InnoDB; best throughput
aggressive Maximizes parallelism; higher conflict rate Benchmarking; not recommended for production
minimal Very limited parallelism Debugging

Why optimistic for ViciDial:

ViciDial's workload is predominantly MyISAM with non-transactional writes. The optimistic mode allows the replica to apply events in parallel and only serializes when it detects a conflict (e.g., two events modifying the same row). With ViciDial's write patterns -- different agents, different calls, different campaigns -- conflicts are rare, and throughput is excellent.

Thread Count Guidelines

Replica CPU Cores Suggested slave_parallel_threads Notes
2-4 2 Conservative; leave cores for queries
4-8 4 Good balance for 3 master connections
8-16 8 Higher throughput if lag is a concern
16+ 12 Diminishing returns above 12

The slave_parallel_threads setting is global, not per-connection. With 3 connections and 4 threads, each connection's events are processed by the shared thread pool. In practice, this provides excellent throughput because the connections rarely all have heavy traffic simultaneously.

Monitoring Parallel Performance

Check how many transactions have been retried due to conflicts:

SHOW ALL SLAVES STATUS\G

Look at the Retried_transactions field. A small number of retries is normal and expected in optimistic mode. If retries are growing rapidly (thousands per hour), consider switching to conservative mode or investigating the conflicting queries.


Handling Schema Differences

Cross-Version Replication

When masters run different MariaDB versions (e.g., 10.6 and 10.11), keep these points in mind:

  1. The replica should run the same or newer version than the oldest master. MariaDB maintains backward compatibility in the binlog format, so a 10.11 replica can apply events from a 10.6 master.

  2. New system table columns introduced in newer MariaDB versions may appear in DDL statements. The slave_skip_errors setting handles most of these gracefully.

  3. Storage engine differences are generally not an issue. ViciDial uses MyISAM on all servers. If a master has some InnoDB tables, they replicate as InnoDB on the replica.

Table Structure Differences Between Masters

In a multi-server call center, each server may be running a slightly different ViciDial version or have custom tables. This is fine:

Adding Custom Tables

If you create custom tables on a master (e.g., for a CRM integration), they will automatically replicate to the corresponding database on the replica, as long as they are in the asterisk database and not in the ignore list.


Error Skipping Strategy

Configured Skip Errors

The slave_skip_errors setting in 50-server.cnf tells MariaDB to silently skip specific error codes instead of stopping replication:

slave_skip_errors = 1032,1062,1146,1269,1396,1932
Error Code Message Common Cause
1032 Can't find record DELETE/UPDATE for a row that does not exist on the replica (MEMORY table cleared, or initial sync mismatch)
1062 Duplicate entry INSERT for a row that already exists (MEMORY table rebuild, or initial sync overlap)
1146 Table doesn't exist Statement references a table not present on the replica (custom table on one master, or filtered table)
1269 Can't revoke all privileges User management statement that does not apply cleanly
1396 Operation CREATE USER failed User already exists or doesn't exist for CREATE/DROP
1932 Table ... doesn't exist in engine Table file missing on disk but exists in metadata (crash recovery artifact)

When Is Error Skipping Safe?

Error skipping is safe when:

Error skipping is NOT safe when:

Manual Error Skipping

If replication stops on an error that is not in the skip list, investigate first:

-- Check the error
SHOW SLAVE 'master1' STATUS\G

If the error is safe to skip (e.g., a one-time DDL issue), skip it manually:

-- Stop the specific connection
STOP SLAVE 'master1';

-- Skip one event
SET GLOBAL master1.sql_slave_skip_counter = 1;

-- Restart
START SLAVE 'master1';

Warning: Never blindly skip errors on critical data tables. If vicidial_closer_log or vicidial_log (your CDR tables) are throwing errors, investigate the root cause. Skipping CDR events means lost call records.


Troubleshooting

Problem: Slave_IO_Running: Connecting

The IO thread is trying to connect to the master but cannot establish a connection.

Diagnostic steps:

# 1. Test network connectivity from replica to master
mariadb -h MASTER1_IP -P 3306 -u repl_replica -p -e "SELECT 1;"

# 2. Check if MariaDB is listening on the master
ssh MASTER1 "ss -tlnp | grep 3306"

# 3. Check master's bind-address (must be 0.0.0.0, not 127.0.0.1)
ssh MASTER1 "grep bind-address /etc/my.cnf /etc/mysql/mariadb.conf.d/*.cnf"

# 4. Check firewall on the master
ssh MASTER1 "ufw status | grep 3306"

# 5. Verify the replication user exists and has the right password
ssh MASTER1 "mariadb -e \"SELECT User, Host FROM mysql.user WHERE User = 'repl_replica';\""

Common fixes:

Cause Fix
Master's bind-address = 127.0.0.1 Change to 0.0.0.0 and restart MariaDB
Firewall blocking port 3306 ufw allow from REPLICA_IP to any port 3306
Wrong password Recreate the user with the correct password
Master not running systemctl start mariadb on the master
DNS resolution failure Use IP addresses, not hostnames; ensure skip-name-resolve is set

Problem: Slave_SQL_Running: No

The SQL thread has stopped due to an error applying a binlog event.

SHOW SLAVE 'master1' STATUS\G
-- Look at: Last_SQL_Error, Last_SQL_Errno

Common errors and fixes:

Error Typical Cause Fix
1032: Can't find record Row deleted on master, never existed on replica Add 1032 to slave_skip_errors or skip manually
1062: Duplicate entry Row inserted on master, already exists on replica Add 1062 to slave_skip_errors or skip manually
1146: Table doesn't exist Table exists on master but not on replica Create the table on the replica, or add 1146 to skip list
1007: Database already exists CREATE DATABASE replayed Skip the event
Schema mismatch Column added on master but not on replica Usually self-resolves (the ALTER will replicate); skip if stuck

Problem: High Seconds_Behind_Master

The replica is lagging behind one or more masters.

Diagnostic steps:

-- Check which connection is lagging
SHOW ALL SLAVES STATUS\G

-- Check for long-running queries on the replica that might be blocking
SHOW PROCESSLIST;

-- Check for lock waits
SHOW ENGINE INNODB STATUS\G

Common causes and fixes:

Cause Fix
Slow disk I/O on replica Move to SSD; check iostat -x 1
Report query holding table locks Add max_statement_time to limit query duration
Single-threaded apply bottleneck Increase slave_parallel_threads
Master had a burst of large operations Wait; the replica will catch up
Network saturation Check bandwidth; slave_compressed_protocol helps

Problem: Relay Log Corruption After Crash

If the replica crashes (power loss, OOM kill), relay logs may be incomplete.

-- relay_log_recovery = ON handles this automatically
-- But if it doesn't, manually reset:

STOP SLAVE 'master1';

-- Reset and re-fetch from the last known good position
RESET SLAVE 'master1';

-- Reconfigure with the current master position
CHANGE MASTER 'master1' TO
  MASTER_HOST     = 'MASTER1_IP',
  MASTER_USER     = 'repl_replica',
  MASTER_PASSWORD = 'REPL_PASSWORD',
  MASTER_LOG_FILE = 'mysql-bin.000042',
  MASTER_LOG_POS  = 12345678;

START SLAVE 'master1';

To find the correct position after a crash, check the relay log info:

-- The Exec_Master_Log_Pos from before the crash
-- is stored in the relay-log.info file or the mysql.slave_master_info table
SELECT * FROM mysql.slave_master_info WHERE Connection_name = 'master1'\G

Problem: Binlog File No Longer Available on Master

If the replica was offline too long, the master may have purged the binlog file that the replica needs.

Last_IO_Error: Got fatal error 1236 from master ... binlog truncated

Fix: You need to re-initialize that connection with a fresh dump:

  1. Stop the affected connection: STOP SLAVE 'master1';
  2. Take a new dump from the master with --master-data=2
  3. Import it into the replica database
  4. CHANGE MASTER with the new position from the dump
  5. START SLAVE 'master1';

Prevention: Set expire_logs_days = 7 (or higher) on masters so binlogs are retained long enough to survive replica downtime.


Maintenance Operations

Stopping and Starting Replication

-- Stop all connections (e.g., before replica maintenance)
STOP ALL SLAVES;

-- Start all connections
START ALL SLAVES;

-- Stop/start a single connection
STOP SLAVE 'master1';
START SLAVE 'master1';

After Replica Server Reboot

Because skip_slave_start = ON, replication does not auto-start. After rebooting the replica:

# 1. Check that MariaDB is running
systemctl status mariadb

# 2. Inspect replication state before starting
mariadb -e "SHOW ALL SLAVES STATUS\G" | grep -E 'Connection_name|IO_Running|SQL_Running|Seconds_Behind|Last.*Error'

# 3. Start all connections
mariadb -e "START ALL SLAVES;"

# 4. Verify
mariadb -e "SHOW ALL SLAVES STATUS\G" | grep -E 'Connection_name|IO_Running|SQL_Running|Seconds_Behind'

Adding a New Master

To add a fourth master to an existing setup:

  1. Add filter rules to 60-multisource.cnf:
master4.replicate_rewrite_db = asterisk->master4_asterisk

master4.replicate_ignore_table = asterisk.vicidial_live_agents
master4.replicate_ignore_table = asterisk.live_sip_channels
# ... (same 12 tables as other connections)
  1. Restart MariaDB on the replica:
systemctl restart mariadb
  1. Create the database, import the initial dump, and configure the connection:
CREATE DATABASE IF NOT EXISTS master4_asterisk;

CHANGE MASTER 'master4' TO
  MASTER_HOST     = 'MASTER4_IP',
  MASTER_USER     = 'repl_replica',
  MASTER_PASSWORD = 'REPL_PASSWORD',
  MASTER_LOG_FILE = 'mysql-bin.000001',
  MASTER_LOG_POS  = 12345;

START SLAVE 'master4';
  1. Update the monitoring script's CONNECTIONS array.

Removing a Master

STOP SLAVE 'master3';
RESET SLAVE 'master3' ALL;

The ALL keyword removes the connection configuration entirely. The replicated data in master3_asterisk remains on disk until you explicitly drop it.


Performance Considerations

Disk I/O

The replica receives write traffic from all masters simultaneously. With 3 masters, expect 3x the write I/O of a single server. Key recommendations:

Memory

With ViciDial's MyISAM tables, the key_buffer_size is critical. The key buffer caches index blocks for all MyISAM tables across all databases:

-- Check key buffer utilization
SHOW STATUS LIKE 'Key%';

Calculate hit ratio:

Key cache hit ratio = 1 - (Key_reads / Key_read_requests) * 100

Target 99%+ hit ratio. If Key_reads is growing relative to Key_read_requests, increase key_buffer_size.

Network Bandwidth

With slave_compressed_protocol = ON, a busy ViciDial server generating 10 MB/s of binlog data compresses to roughly 3-4 MB/s over the wire. For 3 masters, budget ~10-12 MB/s sustained bandwidth to the replica.

Table Open Cache

Each replicated database has its own set of tables. With 3 databases of ~400 tables each, you need table_open_cache = 4096 or higher to avoid table open/close churn:

-- Check if the cache is too small
SHOW STATUS LIKE 'Opened_tables';

If Opened_tables is growing rapidly, increase table_open_cache.

Query Performance on Replica

Report queries on the replica compete with replication writes for disk I/O and locks. Best practices:

Replica-only indexes: You can safely CREATE INDEX on the replica for report optimization. Since read_only = ON prevents application writes and log_slave_updates = OFF prevents the DDL from propagating, the index stays local. However, be aware that if the master runs ALTER TABLE on the same table, the replica's version of the table will be rebuilt, potentially dropping your custom index. Monitor this and recreate indexes as needed.


Appendix A: Complete File Reference

Files on Each Master

File Contents
/etc/my.cnf or /etc/mysql/mariadb.conf.d/50-server.cnf server_id, log_bin, binlog_format, expire_logs_days

Files on Replica

File Contents
/etc/mysql/mariadb.conf.d/50-server.cnf Server config, replication engine, resource tuning
/etc/mysql/mariadb.conf.d/60-multisource.cnf Per-connection rewrite and ignore rules
/usr/local/bin/check_replication.sh Health monitoring script
/var/log/replication_alerts.log Alert log written by health monitor
/var/lib/mysql/replica-slow.log Slow query log
/var/log/mysql/error.log MariaDB error log
/etc/logrotate.d/replication-alerts Log rotation config

User Accounts

User Server Privileges Purpose
repl_replica Each master REPLICATION SLAVE Binlog streaming
report_user Replica SELECT on all *_asterisk DBs Report scripts
grafana_ro Replica SELECT on all *_asterisk DBs Grafana dashboards

Appendix B: Quick Reference Commands

-- Show all replication connections
SHOW ALL SLAVES STATUS\G

-- Show one connection
SHOW SLAVE 'master1' STATUS\G

-- Start/stop all
START ALL SLAVES;
STOP ALL SLAVES;

-- Start/stop one
START SLAVE 'master1';
STOP SLAVE 'master1';

-- Skip one event on a connection
STOP SLAVE 'master1';
SET GLOBAL master1.sql_slave_skip_counter = 1;
START SLAVE 'master1';

-- Remove a connection entirely
STOP SLAVE 'master3';
RESET SLAVE 'master3' ALL;

-- Check relay log space usage
SHOW ALL SLAVES STATUS\G
-- Look at: Relay_Log_Space (bytes per connection)

-- Check replication filters in effect
SHOW ALL SLAVES STATUS\G
-- Look at: Replicate_Rewrite_DB, Replicate_Ignore_Table

-- Quick health one-liner
mariadb -e "SHOW ALL SLAVES STATUS\G" \
  | grep -E 'Connection_name|IO_Running|SQL_Running|Seconds_Behind|Last.*Error:'

Appendix C: Capacity Planning

Estimating Replica Storage

Replica disk = (Master1 data size) + (Master2 data size) + (Master3 data size)
             + relay log space (~1 GB per connection)
             + temp tables for queries
             + 20% headroom

For a typical ViciDial deployment with 2 years of CDR history:

Component Size
master1_asterisk 15-40 GB
master2_asterisk 15-40 GB
master3_asterisk 15-40 GB
Relay logs ~3 GB
Slow query log < 1 GB
System + overhead ~5 GB
Total 60-130 GB

When to Archive

ViciDial's vicidial_log, vicidial_closer_log, and recording_log tables grow indefinitely. On the replica, consider:

  1. Partitioning by month (replica-only) for fast DROP PARTITION cleanup.
  2. Archive cron that moves records older than N days to archive tables.
  3. Setting up data retention on the masters (ViciDial's built-in archiving moves old records to _archive tables).

This tutorial documents a production multi-source replication deployment for VoIP call centers. Adapt connection names, IP addresses, and table lists to your specific environment. Always test configuration changes on a staging system before applying to production.

Need expert help with your setup?

VoIP infrastructure consulting, AI voice agent integration, monitoring stacks, scaling — I've done it all in production.

Get a Free Consultation