MariaDB Galera Cluster Monitoring:

Implementing MariaDB Galera Cluster Observability: Complete Guide to Monitoring with Grafana and Prometheus for SRE Excellence



Introduction: Building a Bulletproof MariaDB SRE Ecosystem

In today’s data-driven landscape, MariaDB Galera Cluster observability is crucial for maintaining high-performance database operations. This comprehensive guide demonstrates how to implement a robust monitoring stack using Grafana and Prometheus to create a highly responsive MariaDB Database SRE ecosystem with proactive alerting capabilities.

Understanding MariaDB Galera Cluster Architecture

Galera Cluster Fundamentals

MariaDB Galera Cluster provides:

  • Synchronous Multi-Master Replication: All nodes are writable with automatic conflict resolution
  • Automatic Node Provisioning: New nodes automatically sync with the cluster
  • True Parallel Replication: Enhanced performance through parallel slave threads
  • Automatic Node Failover: Seamless failover without data loss

Key Metrics for Galera Monitoring

Essential metrics for Galera Cluster performance monitoring:

  • wsrep_cluster_size: Number of active cluster nodes
  • wsrep_cluster_status: Cluster operational status
  • wsrep_ready: Node readiness state
  • wsrep_local_state: Node synchronization status
  • wsrep_flow_control_paused: Flow control events indicating performance bottlenecks

Setting Up Prometheus for MariaDB Monitoring

Installing and Configuring Prometheus

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "mariadb_rules.yml"
  - "galera_rules.yml"

scrape_configs:
  - job_name: 'mariadb-galera'
    static_configs:
      - targets: 
          - 'mariadb-node1:9104'
          - 'mariadb-node2:9104'
          - 'mariadb-node3:9104'
    scrape_interval: 10s
    metrics_path: /metrics

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

MariaDB Exporter Configuration

# Install mysqld_exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.14.0/mysqld_exporter-0.14.0.linux-amd64.tar.gz
tar xvfz mysqld_exporter-0.14.0.linux-amd64.tar.gz
sudo mv mysqld_exporter-0.14.0.linux-amd64/mysqld_exporter /usr/local/bin/

# Create monitoring user
mysql -u root -p << EOF
CREATE USER 'prometheus'@'localhost' IDENTIFIED BY 'secure_password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'prometheus'@'localhost';
FLUSH PRIVILEGES;
EOF

Exporter Service Configuration

# /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=MariaDB Exporter
After=network.target

[Service]
Type=simple
Restart=always
User=prometheus
Environment=DATA_SOURCE_NAME="prometheus:secure_password@(localhost:3306)/"
ExecStart=/usr/local/bin/mysqld_exporter \
  --collect.global_status \
  --collect.global_variables \
  --collect.slave_status \
  --collect.info_schema.innodb_metrics \
  --collect.info_schema.innodb_tablespaces \
  --collect.info_schema.innodb_cmp \
  --collect.info_schema.innodb_cmpmem \
  --collect.info_schema.processlist \
  --collect.info_schema.query_response_time \
  --web.listen-address=0.0.0.0:9104

[Install]
WantedBy=multi-user.target

Advanced Galera-Specific Monitoring Configuration

Custom Galera Metrics Collection

-- Create custom monitoring views for Galera metrics
CREATE OR REPLACE VIEW galera_cluster_metrics AS
SELECT 
  VARIABLE_NAME,
  VARIABLE_VALUE,
  NOW() as timestamp
FROM INFORMATION_SCHEMA.GLOBAL_STATUS 
WHERE VARIABLE_NAME LIKE 'wsrep_%'
   OR VARIABLE_NAME LIKE 'galera_%';

-- Grant access to monitoring user
GRANT SELECT ON performance_schema.* TO 'prometheus'@'localhost';
GRANT SELECT ON information_schema.* TO 'prometheus'@'localhost';

Enhanced Exporter Configuration

# Enhanced mysqld_exporter with Galera-specific flags
ExecStart=/usr/local/bin/mysqld_exporter \
  --collect.global_status \
  --collect.global_variables \
  --collect.slave_status \
  --collect.info_schema.innodb_metrics \
  --collect.info_schema.processlist \
  --collect.info_schema.tables \
  --collect.info_schema.tablestats \
  --collect.info_schema.userstats \
  --collect.perf_schema.eventswaits \
  --collect.perf_schema.file_events \
  --collect.perf_schema.indexiowaits \
  --collect.perf_schema.tableiowaits \
  --web.listen-address=0.0.0.0:9104 \
  --log.level=info

Grafana Dashboard Implementation

Installing and Configuring Grafana

# Install Grafana
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install grafana

# Configure Grafana datasource
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Grafana Datasource Configuration

{
  "name": "Prometheus-MariaDB",
  "type": "prometheus",
  "url": "http://localhost:9090",
  "access": "proxy",
  "basicAuth": false,
  "isDefault": true,
  "jsonData": {
    "timeInterval": "5s",
    "queryTimeout": "60s"
  }
}

Essential Grafana Dashboard Panels

Galera Cluster Health Panel

{
  "title": "Galera Cluster Status",
  "type": "stat",
  "targets": [
    {
      "expr": "mysql_global_status_wsrep_cluster_size",
      "legendFormat": "Cluster Size"
    },
    {
      "expr": "mysql_global_status_wsrep_ready",
      "legendFormat": "Node Ready"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "steps": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 1},
          {"color": "green", "value": 3}
        ]
      }
    }
  }
}

Performance Metrics Dashboard

{
  "title": "MariaDB Performance Metrics",
  "panels": [
    {
      "title": "Queries Per Second",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(mysql_global_status_queries[5m])",
          "legendFormat": "QPS - {{instance}}"
        }
      ]
    },
    {
      "title": "Connection Usage",
      "type": "graph",
      "targets": [
        {
          "expr": "mysql_global_status_threads_connected",
          "legendFormat": "Connected - {{instance}}"
        },
        {
          "expr": "mysql_global_variables_max_connections",
          "legendFormat": "Max Connections - {{instance}}"
        }
      ]
    }
  ]
}

Comprehensive Alerting Strategy

Prometheus Alerting Rules

# mariadb_galera_rules.yml
groups:
- name: mariadb_galera_alerts
  rules:

  # Cluster Health Alerts
  - alert: GaleraClusterSizeReduced
    expr: mysql_global_status_wsrep_cluster_size < 3
    for: 30s
    labels:
      severity: critical
      service: mariadb-galera
    annotations:
      summary: "Galera cluster size reduced on {{ $labels.instance }}"
      description: "Galera cluster size is {{ $value }}, expected 3 nodes"

  - alert: GaleraNodeNotReady
    expr: mysql_global_status_wsrep_ready != 1
    for: 15s
    labels:
      severity: critical
      service: mariadb-galera
    annotations:
      summary: "Galera node not ready on {{ $labels.instance }}"
      description: "Node {{ $labels.instance }} is not ready for operations"

  # Performance Alerts
  - alert: MariaDBHighConnections
    expr: (mysql_global_status_threads_connected / mysql_global_variables_max_connections) > 0.8
    for: 2m
    labels:
      severity: warning
      service: mariadb
    annotations:
      summary: "High connection usage on {{ $labels.instance }}"
      description: "Connection usage is {{ $value | humanizePercentage }}"

  - alert: MariaDBSlowQueries
    expr: rate(mysql_global_status_slow_queries[5m]) > 10
    for: 2m
    labels:
      severity: warning
      service: mariadb
    annotations:
      summary: "High slow query rate on {{ $labels.instance }}"
      description: "Slow query rate is {{ $value }} queries/second"

  # Replication Alerts
  - alert: GaleraFlowControlActive
    expr: mysql_global_status_wsrep_flow_control_paused > 0.1
    for: 1m
    labels:
      severity: warning
      service: mariadb-galera
    annotations:
      summary: "Galera flow control active on {{ $labels.instance }}"
      description: "Flow control paused {{ $value | humanizePercentage }} of the time"

  - alert: GaleraReplicationLag
    expr: mysql_global_status_wsrep_local_recv_queue > 100
    for: 2m
    labels:
      severity: warning
      service: mariadb-galera
    annotations:
      summary: "Galera replication lag on {{ $labels.instance }}"
      description: "Receive queue size is {{ $value }} transactions"

AlertManager Configuration

# alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@company.com'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
  - match:
      service: mariadb-galera
    receiver: 'database-team'

receivers:
- name: 'default'
  email_configs:
  - to: 'admin@company.com'
    subject: 'MariaDB Alert: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Instance: {{ .Labels.instance }}
      {{ end }}

- name: 'critical-alerts'
  email_configs:
  - to: 'oncall@company.com'
    subject: 'CRITICAL: MariaDB Alert'
  slack_configs:
  - api_url: 'YOUR_SLACK_WEBHOOK_URL'
    channel: '#database-alerts'
    title: 'Critical MariaDB Alert'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

- name: 'database-team'
  email_configs:
  - to: 'dba-team@company.com'
    subject: 'MariaDB Galera Alert'

Advanced SRE Monitoring Strategies

Custom Metrics for SRE Excellence

-- Create custom SLI/SLO tracking
CREATE TABLE sre_metrics (
  timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
  metric_name VARCHAR(100),
  metric_value DECIMAL(10,4),
  instance_name VARCHAR(50),
  INDEX idx_timestamp_metric (timestamp, metric_name)
);

-- Procedure to calculate availability SLI
DELIMITER //
CREATE PROCEDURE CalculateAvailabilitySLI()
BEGIN
  DECLARE availability_sli DECIMAL(10,4);

  SELECT 
    (SUM(CASE WHEN wsrep_ready = 1 THEN 1 ELSE 0 END) / COUNT(*)) * 100
  INTO availability_sli
  FROM (
    SELECT 
      CASE WHEN VARIABLE_VALUE = 'ON' THEN 1 ELSE 0 END as wsrep_ready
    FROM INFORMATION_SCHEMA.GLOBAL_STATUS 
    WHERE VARIABLE_NAME = 'wsrep_ready'
  ) t;

  INSERT INTO sre_metrics (metric_name, metric_value, instance_name)
  VALUES ('availability_sli', availability_sli, @@hostname);
END //
DELIMITER ;

Automated Remediation Scripts

#!/bin/bash
# galera_auto_recovery.sh

check_galera_health() {
    local node=$1
    mysql -h $node -u monitoring -p$MONITORING_PASSWORD \
          -e "SHOW STATUS LIKE 'wsrep_ready';" 2>/dev/null | grep -q "ON"
    return $?
}

recover_galera_node() {
    local node=$1
    echo "Attempting to recover Galera node: $node"

    # Stop MariaDB
    ssh $node "sudo systemctl stop mariadb"

    # Start with bootstrap if primary node
    if [[ $node == $PRIMARY_NODE ]]; then
        ssh $node "sudo galera_new_cluster"
    else
        ssh $node "sudo systemctl start mariadb"
    fi

    # Wait and verify
    sleep 30
    if check_galera_health $node; then
        echo "Node $node recovered successfully"
        # Send success notification
        curl -X POST $SLACK_WEBHOOK \
             -d "{\"text\":\"✅ Galera node $node recovered automatically\"}"
    else
        echo "Failed to recover node $node - manual intervention required"
        # Send failure notification
        curl -X POST $SLACK_WEBHOOK \
             -d "{\"text\":\"❌ Failed to recover Galera node $node - manual intervention required\"}"
    fi
}

# Main monitoring loop
NODES=("mariadb-node1" "mariadb-node2" "mariadb-node3")
PRIMARY_NODE="mariadb-node1"

for node in "${NODES[@]}"; do
    if ! check_galera_health $node; then
        echo "Node $node is unhealthy - initiating recovery"
        recover_galera_node $node
    fi
done

Performance Optimization Through Monitoring

Query Performance Analysis

-- Enable performance schema for detailed monitoring
UPDATE performance_schema.setup_consumers 
SET ENABLED = 'YES' 
WHERE NAME LIKE '%events_statements%';

-- Create view for slow query analysis
CREATE VIEW slow_query_analysis AS
SELECT 
  DIGEST_TEXT,
  COUNT_STAR as execution_count,
  AVG_TIMER_WAIT/1000000000 as avg_execution_time_sec,
  MAX_TIMER_WAIT/1000000000 as max_execution_time_sec,
  SUM_ROWS_EXAMINED/COUNT_STAR as avg_rows_examined,
  SUM_ROWS_SENT/COUNT_STAR as avg_rows_sent
FROM performance_schema.events_statements_summary_by_digest
WHERE COUNT_STAR > 10
ORDER BY AVG_TIMER_WAIT DESC
LIMIT 20;

Capacity Planning Metrics

# Additional Prometheus queries for capacity planning
- record: mariadb:connection_utilization
  expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections

- record: mariadb:innodb_buffer_pool_utilization
  expr: mysql_global_status_innodb_buffer_pool_pages_data / mysql_global_status_innodb_buffer_pool_pages_total

- record: mariadb:query_cache_hit_rate
  expr: mysql_global_status_qcache_hits / (mysql_global_status_qcache_hits + mysql_global_status_qcache_inserts)

Implementing SRE Best Practices

Error Budget Tracking

# SLO definitions for MariaDB Galera
- record: slo:mariadb_availability_4w
  expr: avg_over_time(up{job="mariadb-galera"}[4w])

- record: slo:mariadb_latency_4w
  expr: histogram_quantile(0.99, rate(mysql_global_status_queries[4w]))

- alert: SLOBudgetExhausted
  expr: slo:mariadb_availability_4w < 0.999
  labels:
    severity: critical
    slo: availability
  annotations:
    summary: "MariaDB availability SLO budget exhausted"
    description: "4-week availability is {{ $value | humanizePercentage }}, below 99.9% SLO"

Incident Response Automation

#!/usr/bin/env python3
# mariadb_incident_response.py

import requests
import mysql.connector
import json
from datetime import datetime

class MariaDBIncidentResponse:
    def __init__(self, config):
        self.config = config
        self.slack_webhook = config['slack_webhook']

    def check_cluster_health(self):
        """Check overall cluster health"""
        healthy_nodes = 0
        total_nodes = len(self.config['nodes'])

        for node in self.config['nodes']:
            try:
                conn = mysql.connector.connect(
                    host=node['host'],
                    user=self.config['monitoring_user'],
                    password=self.config['monitoring_password']
                )
                cursor = conn.cursor()
                cursor.execute("SHOW STATUS LIKE 'wsrep_ready'")
                result = cursor.fetchone()

                if result and result[1] == 'ON':
                    healthy_nodes += 1

                conn.close()
            except Exception as e:
                self.send_alert(f"Failed to connect to {node['host']}: {str(e)}")

        cluster_health = healthy_nodes / total_nodes
        return cluster_health, healthy_nodes, total_nodes

    def send_alert(self, message):
        """Send alert to Slack"""
        payload = {
            "text": f"🚨 MariaDB Alert: {message}",
            "timestamp": datetime.now().isoformat()
        }
        requests.post(self.slack_webhook, json=payload)

    def run_health_check(self):
        """Main health check routine"""
        health_ratio, healthy, total = self.check_cluster_health()

        if health_ratio < 0.67:  # Less than 2/3 nodes healthy
            self.send_alert(f"Cluster degraded: {healthy}/{total} nodes healthy")
            return False
        elif health_ratio < 1.0:
            self.send_alert(f"Cluster warning: {healthy}/{total} nodes healthy")

        return True

if __name__ == "__main__":
    config = {
        'nodes': [
            {'host': 'mariadb-node1'},
            {'host': 'mariadb-node2'},
            {'host': 'mariadb-node3'}
        ],
        'monitoring_user': 'prometheus',
        'monitoring_password': 'secure_password',
        'slack_webhook': 'YOUR_SLACK_WEBHOOK_URL'
    }

    incident_response = MariaDBIncidentResponse(config)
    incident_response.run_health_check()

Conclusion: Achieving MariaDB SRE Excellence

Implementing comprehensive MariaDB Galera Cluster observability with Grafana and Prometheus creates a robust foundation for database SRE operations. This monitoring stack provides:

Key Benefits Achieved:

  • Proactive Issue Detection: Early warning systems prevent outages before they impact users
  • Automated Remediation: Reduces MTTR through intelligent automation
  • Performance Optimization: Data-driven insights enable continuous performance improvements
  • SLO Compliance: Measurable service level objectives with error budget tracking

Next Steps for Advanced Implementation:

  1. Implement Chaos Engineering: Test cluster resilience with controlled failures
  2. Advanced Analytics: Machine learning-based anomaly detection
  3. Multi-Region Monitoring: Global cluster monitoring and alerting
  4. Cost Optimization: Resource utilization analysis and right-sizing recommendations

By following this comprehensive guide, your organization will establish a world-class MariaDB Database SRE ecosystem that ensures high availability, optimal performance, and operational excellence.


Ready to implement enterprise-grade MariaDB monitoring? Contact our database experts for customized implementation and ongoing support services.

 

Troubleshooting Galera Cluster for performance

 

Monitoring MySQL Group Replication Performance

 

Troubleshooting Writes in Galera Cluster

About MinervaDB Corporation 83 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.

Be the first to comment

Leave a Reply