ProDG Mainframe — Operational Runbook
Version: 1.0.0
Audience: Infrastructure operators, DevOps, CEO (Mitch)
Table of Contents
- Daily Operations
- Service Management
- Health Checks
- Backup Verification
- Log Analysis
- Credential Rotation
- Disaster Recovery
- Scaling
1. Daily Operations
1.1 Quick Health Dashboard
# On mainframe
ssh root@mainframe.prodg.studio
# Full stack status
cd /opt/prodg/compose && docker compose ps
# Prometheus targets (all should be UP)
docker exec prometheus wget -qO- http://localhost:9090/api/v1/targets 2>/dev/null | grep '"health":"up"' | wc -l
# Expected: 9
# API health
curl -s https://api.mainframe.prodg.studio/health
# Grafana health
curl -s https://metrics.mainframe.prodg.studio/api/health1.2 Monitoring Alerts (Telegram)
Alerts are sent to Telegram group chat_id: -5267054745. Monitor for:
- High CPU >80% for 5m (warning)
- High Memory >85% for 5m (critical)
- Disk >85% (critical)
- Any service down for 2m (critical)
1.3 Backup Verification
# Check last backup log
tail -30 /var/log/prodg-backup.log
# List B2 remote files
RCLONE_B2_KEY_ID=$(grep B2_KEY_ID /opt/prodg/compose/.env | cut -d= -f2) \
RCLONE_B2_KEY=$(grep B2_KEY_SECRET /opt/prodg/compose/.env | cut -d= -f2) \
rclone ls :b2:MainframeBackup2. Service Management
2.1 Start/Stop/Restart Entire Stack
cd /opt/prodg/compose
# Start all
docker compose up -d
# Restart all
docker compose restart
# Stop all
docker compose down
# Stop + remove volumes (DATA LOSS — USE WITH CAUTION)
docker compose down -v2.2 Restart Individual Service
cd /opt/prodg/compose
docker compose restart <service_name>
# e.g., docker compose restart hermes-api2.3 View Service Logs
# Last 50 lines
docker logs --tail 50 <container_name>
# Follow live
docker logs -f <container_name>
# Via Loki (Grafana)
# Open https://metrics.mainframe.prodg.studio → Explore → Loki datasource2.4 Rebuild Custom Image (Hermes API)
cd /opt/prodg/hermes-api
docker build -t prodg/hermes-api:latest .
cd /opt/prodg/compose
docker compose up -d hermes-api3. Health Checks
3.1 Service-by-Service
| Service | Check Command | Expected |
|---|---|---|
| Caddy | curl -s https://headscale.prodg.studio | HTTP 200 |
| Headscale | headscale nodes list | mainframe online |
| PostgreSQL | docker exec postgres pg_isready -U prodg | accepting connections |
| Redis | `docker exec redis redis-cli -a $(grep REDIS_PASSWORD .env | cut -d= -f2) ping` |
| Infisical | curl -s https://secrets.mainframe.prodg.studio | HTTP 200 |
| Vaultwarden | curl -s https://vault.mainframe.prodg.studio | HTTP 200 |
| MinIO | curl -s http://127.0.0.1:9000/minio/health/live | HTTP 200 |
| Grafana | curl -s https://metrics.mainframe.prodg.studio/api/health | {"database":"ok"} |
| Prometheus | docker exec prometheus wget -qO- http://localhost:9090/-/healthy | Prometheus Server is Healthy. |
| Hermes API | curl -s https://api.mainframe.prodg.studio/health | {"status":"ok"} |
3.2 Tailnet Status
# On mainframe
tailscale status
# Expected output
# 100.64.0.1 mainframe mitch@ linux -
# 100.64.0.2 mitch-laptop mitch@ macOS active; direct <public_ip>:41641, tx 1234 rx 56784. Backup Verification
4.1 Manual Backup Trigger
/opt/prodg/backups/scripts/backup-all.sh4.2 Verify Remote Files
source /opt/prodg/compose/.env
RCLONE_B2_KEY_ID="$B2_KEY_ID" RCLONE_B2_KEY="$B2_KEY_SECRET" rclone ls :b2:MainframeBackup4.3 Restore PostgreSQL from Backup
# Download latest backup
source /opt/prodg/compose/.env
RCLONE_B2_KEY_ID="$B2_KEY_ID" RCLONE_B2_KEY="$B2_KEY_SECRET" \
rclone copy :b2:MainframeBackup/postgres/ /tmp/postgres-restore/
# Stop dependent services
cd /opt/prodg/compose
docker compose stop infisical vaultwarden hermes-api
# Restore (DANGER — overwrites all DBs)
gunzip -c /tmp/postgres-restore/$(ls -t /tmp/postgres-restore/*.sql.gz | head -1) | \
docker exec -i postgres psql -U prodg
# Restart services
docker compose start infisical vaultwarden hermes-api4.4 Restore MinIO from Backup
# Download latest backup
RCLONE_B2_KEY_ID="$B2_KEY_ID" RCLONE_B2_KEY="$B2_KEY_SECRET" \
rclone copy :b2:MainframeBackup/minio/ /tmp/minio-restore/
# Stop MinIO
docker compose stop minio
# Extract
tar xzf /tmp/minio-restore/$(ls -t /tmp/minio-restore/*.tar.gz | head -1) -C /opt/prodg/data/minio/
# Restart
docker compose start minio5. Log Analysis
5.1 Container Logs
# All containers
docker compose logs --tail 100
# Specific service
docker compose logs --tail 100 hermes-api
# With timestamps
docker compose logs --tail 100 -t hermes-api5.2 Loki Log Queries (via Grafana)
# All Docker container logs
{job="docker"}
# Specific container
{job="docker", container_name="hermes-api"}
# Error logs only
{job="docker"} |= "error" or "ERROR" or "FATAL"
# Systemd journal
{job="systemd-journal"}
# SSH login attempts
{job="systemd-journal"} |= "sshd" |= "Accepted" or "Failed"
5.3 Prometheus Queries
# CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
# Pending Hermes tasks
hermes_tasks_pending
# Container memory by name
container_memory_usage_bytes / 1024 / 1024
6. Credential Rotation
6.1 Rotate Hermes API Token
# Generate new token
NEW_TOKEN=$(openssl rand -hex 32)
# Update .env
sed -i "s/HERMES_API_TOKEN=.*/HERMES_API_TOKEN=${NEW_TOKEN}/" /opt/prodg/compose/.env
# Restart API
cd /opt/prodg/compose && docker compose up -d hermes-api6.2 Rotate PostgreSQL Password
# 1. Update .env
# 2. Apply to running DB
docker exec -i postgres psql -U prodg -c "ALTER USER prodg WITH PASSWORD 'newpassword';"
# 3. Restart all dependent containers
cd /opt/prodg/compose
docker compose restart infisical vaultwarden hermes-api6.3 Rotate Backblaze B2 Key
- Generate new Application Key in Backblaze console
- Update
B2_KEY_IDandB2_KEY_SECRETin.env - Run
/opt/prodg/backups/scripts/update-rclone-conf.sh - Trigger test backup:
/opt/prodg/backups/scripts/backup-all.sh
7. Disaster Recovery
7.1 Complete Server Rebuild (from B2 backups)
# 1. Provision new Ubuntu 24.04 server
# 2. Install Docker + Docker Compose
apt-get update && apt-get install -y docker.io docker-compose-plugin rclone
# 3. Restore configs from B2
mkdir -p /opt/prodg
RCLONE_B2_KEY_ID=<key_id> RCLONE_B2_KEY=<key_secret> \
rclone copy :b2:MainframeBackup/configs/ /tmp/configs/
tar xzf /tmp/configs/$(ls -t /tmp/configs/*.tar.gz | head -1) -C /opt/prodg
# 4. Restore .env secrets (from Infisical or secure vault)
# Place .env at /opt/prodg/compose/.env
# 5. Start data layer first
cd /opt/prodg/compose
docker compose up -d postgres redis minio
# 6. Restore PostgreSQL
RCLONE_B2_KEY_ID=<key_id> RCLONE_B2_KEY=<key_secret> \
rclone copy :b2:MainframeBackup/postgres/ /tmp/postgres/
gunzip -c /tmp/postgres/$(ls -t /tmp/postgres/*.sql.gz | head -1) | \
docker exec -i postgres psql -U prodg
# 7. Restore MinIO
RCLONE_B2_KEY_ID=<key_id> RCLONE_B2_KEY=<key_secret> \
rclone copy :b2:MainframeBackup/minio/ /tmp/minio/
tar xzf /tmp/minio/$(ls -t /tmp/minio/*.tar.gz | head -1) -C /opt/prodg/data/minio/
# 8. Start full stack
docker compose up -d7.2 Headscale Recovery
If Headscale data is lost:
# Re-initialize
docker exec headscale headscale namespaces create prodg
docker exec headscale headscale preauthkeys create -u prodg -e 24h --reusable
# Re-register nodes on each device
tailscale up --login-server https://headscale.prodg.studio --authkey <new-key>8. Scaling
8.1 Add New Tailscale Node
# On mainframe: generate pre-auth key
docker exec headscale headscale preauthkeys create -u prodg -e 24h
# On new device (Linux)
tailscale up --login-server https://headscale.prodg.studio --authkey <key>
# On new device (macOS)
tailscale up --login-server https://headscale.prodg.studio --authkey <key>8.2 Add Prometheus Scrape Target
Edit /opt/prodg/compose/prometheus/prometheus.yml, add job, then:
docker exec prometheus kill -HUP 18.3 Add Hermes Agent (Tier 2)
# Register via API
curl -X POST https://api.mainframe.prodg.studio/v1/agents/register \
-H "X-API-Token: <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "research-agent-01",
"tier": "tier-2-trusted",
"capabilities": ["research", "code-review"],
"host": "agent-01.local",
"tailscale_ip": "100.64.0.10"
}'Runbook Version: 1.0.0 — Generated by Hermes Agent