ProDG Mainframe — Operational Runbook

Version: 1.0.0
Audience: Infrastructure operators, DevOps, CEO (Mitch)


Table of Contents

  1. Daily Operations
  2. Service Management
  3. Health Checks
  4. Backup Verification
  5. Log Analysis
  6. Credential Rotation
  7. Disaster Recovery
  8. Scaling

1. Daily Operations

1.1 Quick Health Dashboard

# On mainframe
ssh root@mainframe.prodg.studio
 
# Full stack status
cd /opt/prodg/compose && docker compose ps
 
# Prometheus targets (all should be UP)
docker exec prometheus wget -qO- http://localhost:9090/api/v1/targets 2>/dev/null | grep '"health":"up"' | wc -l
# Expected: 9
 
# API health
curl -s https://api.mainframe.prodg.studio/health
 
# Grafana health
curl -s https://metrics.mainframe.prodg.studio/api/health

1.2 Monitoring Alerts (Telegram)

Alerts are sent to Telegram group chat_id: -5267054745. Monitor for:

  • High CPU >80% for 5m (warning)
  • High Memory >85% for 5m (critical)
  • Disk >85% (critical)
  • Any service down for 2m (critical)

1.3 Backup Verification

# Check last backup log
tail -30 /var/log/prodg-backup.log
 
# List B2 remote files
RCLONE_B2_KEY_ID=$(grep B2_KEY_ID /opt/prodg/compose/.env | cut -d= -f2) \
RCLONE_B2_KEY=$(grep B2_KEY_SECRET /opt/prodg/compose/.env | cut -d= -f2) \
rclone ls :b2:MainframeBackup

2. Service Management

2.1 Start/Stop/Restart Entire Stack

cd /opt/prodg/compose
 
# Start all
docker compose up -d
 
# Restart all
docker compose restart
 
# Stop all
docker compose down
 
# Stop + remove volumes (DATA LOSS — USE WITH CAUTION)
docker compose down -v

2.2 Restart Individual Service

cd /opt/prodg/compose
docker compose restart <service_name>
# e.g., docker compose restart hermes-api

2.3 View Service Logs

# Last 50 lines
docker logs --tail 50 <container_name>
 
# Follow live
docker logs -f <container_name>
 
# Via Loki (Grafana)
# Open https://metrics.mainframe.prodg.studio → Explore → Loki datasource

2.4 Rebuild Custom Image (Hermes API)

cd /opt/prodg/hermes-api
docker build -t prodg/hermes-api:latest .
cd /opt/prodg/compose
docker compose up -d hermes-api

3. Health Checks

3.1 Service-by-Service

ServiceCheck CommandExpected
Caddycurl -s https://headscale.prodg.studioHTTP 200
Headscaleheadscale nodes listmainframe online
PostgreSQLdocker exec postgres pg_isready -U prodgaccepting connections
Redis`docker exec redis redis-cli -a $(grep REDIS_PASSWORD .envcut -d= -f2) ping`
Infisicalcurl -s https://secrets.mainframe.prodg.studioHTTP 200
Vaultwardencurl -s https://vault.mainframe.prodg.studioHTTP 200
MinIOcurl -s http://127.0.0.1:9000/minio/health/liveHTTP 200
Grafanacurl -s https://metrics.mainframe.prodg.studio/api/health{"database":"ok"}
Prometheusdocker exec prometheus wget -qO- http://localhost:9090/-/healthyPrometheus Server is Healthy.
Hermes APIcurl -s https://api.mainframe.prodg.studio/health{"status":"ok"}

3.2 Tailnet Status

# On mainframe
tailscale status
 
# Expected output
# 100.64.0.1   mainframe      mitch@      linux   -
# 100.64.0.2   mitch-laptop   mitch@      macOS   active; direct <public_ip>:41641, tx 1234 rx 5678

4. Backup Verification

4.1 Manual Backup Trigger

/opt/prodg/backups/scripts/backup-all.sh

4.2 Verify Remote Files

source /opt/prodg/compose/.env
RCLONE_B2_KEY_ID="$B2_KEY_ID" RCLONE_B2_KEY="$B2_KEY_SECRET" rclone ls :b2:MainframeBackup

4.3 Restore PostgreSQL from Backup

# Download latest backup
source /opt/prodg/compose/.env
RCLONE_B2_KEY_ID="$B2_KEY_ID" RCLONE_B2_KEY="$B2_KEY_SECRET" \
rclone copy :b2:MainframeBackup/postgres/ /tmp/postgres-restore/
 
# Stop dependent services
cd /opt/prodg/compose
docker compose stop infisical vaultwarden hermes-api
 
# Restore (DANGER — overwrites all DBs)
gunzip -c /tmp/postgres-restore/$(ls -t /tmp/postgres-restore/*.sql.gz | head -1) | \
    docker exec -i postgres psql -U prodg
 
# Restart services
docker compose start infisical vaultwarden hermes-api

4.4 Restore MinIO from Backup

# Download latest backup
RCLONE_B2_KEY_ID="$B2_KEY_ID" RCLONE_B2_KEY="$B2_KEY_SECRET" \
rclone copy :b2:MainframeBackup/minio/ /tmp/minio-restore/
 
# Stop MinIO
docker compose stop minio
 
# Extract
tar xzf /tmp/minio-restore/$(ls -t /tmp/minio-restore/*.tar.gz | head -1) -C /opt/prodg/data/minio/
 
# Restart
docker compose start minio

5. Log Analysis

5.1 Container Logs

# All containers
docker compose logs --tail 100
 
# Specific service
docker compose logs --tail 100 hermes-api
 
# With timestamps
docker compose logs --tail 100 -t hermes-api

5.2 Loki Log Queries (via Grafana)

# All Docker container logs
{job="docker"}

# Specific container
{job="docker", container_name="hermes-api"}

# Error logs only
{job="docker"} |= "error" or "ERROR" or "FATAL"

# Systemd journal
{job="systemd-journal"}

# SSH login attempts
{job="systemd-journal"} |= "sshd" |= "Accepted" or "Failed"

5.3 Prometheus Queries

# CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

# Pending Hermes tasks
hermes_tasks_pending

# Container memory by name
container_memory_usage_bytes / 1024 / 1024

6. Credential Rotation

6.1 Rotate Hermes API Token

# Generate new token
NEW_TOKEN=$(openssl rand -hex 32)
 
# Update .env
sed -i "s/HERMES_API_TOKEN=.*/HERMES_API_TOKEN=${NEW_TOKEN}/" /opt/prodg/compose/.env
 
# Restart API
cd /opt/prodg/compose && docker compose up -d hermes-api

6.2 Rotate PostgreSQL Password

# 1. Update .env
# 2. Apply to running DB
docker exec -i postgres psql -U prodg -c "ALTER USER prodg WITH PASSWORD 'newpassword';"
 
# 3. Restart all dependent containers
cd /opt/prodg/compose
docker compose restart infisical vaultwarden hermes-api

6.3 Rotate Backblaze B2 Key

  1. Generate new Application Key in Backblaze console
  2. Update B2_KEY_ID and B2_KEY_SECRET in .env
  3. Run /opt/prodg/backups/scripts/update-rclone-conf.sh
  4. Trigger test backup: /opt/prodg/backups/scripts/backup-all.sh

7. Disaster Recovery

7.1 Complete Server Rebuild (from B2 backups)

# 1. Provision new Ubuntu 24.04 server
# 2. Install Docker + Docker Compose
apt-get update && apt-get install -y docker.io docker-compose-plugin rclone
 
# 3. Restore configs from B2
mkdir -p /opt/prodg
RCLONE_B2_KEY_ID=<key_id> RCLONE_B2_KEY=<key_secret> \
rclone copy :b2:MainframeBackup/configs/ /tmp/configs/
tar xzf /tmp/configs/$(ls -t /tmp/configs/*.tar.gz | head -1) -C /opt/prodg
 
# 4. Restore .env secrets (from Infisical or secure vault)
# Place .env at /opt/prodg/compose/.env
 
# 5. Start data layer first
cd /opt/prodg/compose
docker compose up -d postgres redis minio
 
# 6. Restore PostgreSQL
RCLONE_B2_KEY_ID=<key_id> RCLONE_B2_KEY=<key_secret> \
rclone copy :b2:MainframeBackup/postgres/ /tmp/postgres/
gunzip -c /tmp/postgres/$(ls -t /tmp/postgres/*.sql.gz | head -1) | \
    docker exec -i postgres psql -U prodg
 
# 7. Restore MinIO
RCLONE_B2_KEY_ID=<key_id> RCLONE_B2_KEY=<key_secret> \
rclone copy :b2:MainframeBackup/minio/ /tmp/minio/
tar xzf /tmp/minio/$(ls -t /tmp/minio/*.tar.gz | head -1) -C /opt/prodg/data/minio/
 
# 8. Start full stack
docker compose up -d

7.2 Headscale Recovery

If Headscale data is lost:

# Re-initialize
docker exec headscale headscale namespaces create prodg
docker exec headscale headscale preauthkeys create -u prodg -e 24h --reusable
 
# Re-register nodes on each device
tailscale up --login-server https://headscale.prodg.studio --authkey <new-key>

8. Scaling

8.1 Add New Tailscale Node

# On mainframe: generate pre-auth key
docker exec headscale headscale preauthkeys create -u prodg -e 24h
 
# On new device (Linux)
tailscale up --login-server https://headscale.prodg.studio --authkey <key>
 
# On new device (macOS)
tailscale up --login-server https://headscale.prodg.studio --authkey <key>

8.2 Add Prometheus Scrape Target

Edit /opt/prodg/compose/prometheus/prometheus.yml, add job, then:

docker exec prometheus kill -HUP 1

8.3 Add Hermes Agent (Tier 2)

# Register via API
curl -X POST https://api.mainframe.prodg.studio/v1/agents/register \
  -H "X-API-Token: <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "research-agent-01",
    "tier": "tier-2-trusted",
    "capabilities": ["research", "code-review"],
    "host": "agent-01.local",
    "tailscale_ip": "100.64.0.10"
  }'

Runbook Version: 1.0.0 — Generated by Hermes Agent