ProDG Mainframe — End-to-End Evaluation

Date: 2026-04-27
Scope: Phases 0–8 (Complete Infrastructure Deployment)

1. Process Overview

The ProDG Mainframe deployment was executed across 8 phases over multiple sessions, transitioning from a bare Ubuntu VPS to a fully operational infrastructure platform. This evaluation covers what worked, what didn’t, and lessons learned.

2. Phase-by-Phase Assessment

Phase 0: Foundation (SSH, Users, Docker)

Status: ✅ Complete
Assessment: Clean baseline. Docker + Compose installed without issues. UFW configured correctly. prodg user created for future service migration.

What Worked:

Standard Ubuntu package installs
Docker rootless setup deferred correctly (service user migration is Phase 9)

What Didn’t:

None significant

Phase 1: Tailscale Network (Headscale)

Status: ✅ Complete
Assessment: The most complex phase. Self-hosted Headscale control plane required significant troubleshooting.

What Worked:

Headscale container deployed and initialized
Pre-auth key generation worked
Linux node (mainframe) registered successfully

What Didn’t (Initially):

macOS client refused to connect to HTTP control plane (no --accept-risk-of-no-https flag on macOS)
Cloudflare orange-cloud broke Let’s Encrypt HTTP-01 challenges
SSH tunnel approach failed due to local SSH agent issues

Resolution:

Deployed Caddy reverse proxy with TLS termination
Switched Cloudflare records to DNS-only (grey cloud)
Used DNS-based ACME challenges implicitly via Caddy’s automatic HTTPS
macOS client connected successfully once https://headscale.prodg.studio was available

Lesson Learned: macOS Tailscale client enforces HTTPS for all non-localhost control planes. Always provision TLS before attempting macOS enrollment.

Phase 2: Core Data Layer (Postgres, Redis, MinIO)

Status: ✅ Complete
Assessment: Straightforward container deployment. All services healthy.

What Worked:

PostgreSQL 16 with persistent volume
Redis 7 with AUTH and AOF persistence
MinIO with dual-port (API + Console)

What Didn’t:

None

Phase 3: Secrets Layer (Infisical + Vaultwarden)

Status: ✅ Complete
Assessment: Both services deployed and accessible. Infisical super admin created by Mitch.

What Worked:

Infisical connected to existing Postgres + Redis
Vaultwarden connected to Postgres
Both reverse-proxied through Caddy with TLS

What Didn’t:

Grafana Telegram alerting initially failed because TELEGRAM_BOT_TOKEN was empty in provisioning
Grafana exits hard if any alerting provision fails — no graceful degradation

Resolution:

Used placeholder token to allow Grafana to boot
Required exact-match environment variable injection via docker compose up -d (not just restart)

Lesson Learned: Grafana’s alerting provisioning is all-or-nothing. Missing any required field in a contact point causes the entire provisioning module to fail, preventing Grafana startup.

Phase 4: Caddy + TLS + DNS

Status: ✅ Complete
Assessment: Critical infrastructure layer. Took multiple iterations.

What Worked:

Caddy container serving all 7 domains
Let’s Encrypt certificates issued automatically
HSTS headers, compression, security headers applied

What Didn’t:

Caddy admin API default localhost-only blocked Prometheus scraping from Docker network
Caddy reload rejects admin listener mutations from non-localhost origins

Resolution:

Changed global block from debug to admin 0.0.0.0:2019
Required container restart (not just caddy reload) because the admin listener change is treated as a security-sensitive mutation

Lesson Learned: Caddy’s live reload has security guardrails for admin listener changes. Plan for container restarts when modifying the admin socket.

Phase 5: Observability (Prometheus + Grafana)

Status: ✅ Complete
Assessment: Full metrics visibility achieved.

What Worked:

All 6 initial scrape targets healthy
Grafana auto-provisioned Prometheus datasource
External HTTPS access confirmed

What Didn’t:

Prometheus config validation passed but caddy target showed DOWN
Root cause: Caddy admin API bound to localhost only

Lesson Learned: Prometheus target DOWN doesn’t always mean the service is down — check scrape URL accessibility from within the Prometheus container network.

Phase 5d/e: Loki + Promtail + Telegram Alerting

Status: ✅ Complete
Assessment: Extended observability successfully.

What Worked:

Loki deployed with filesystem storage
Promtail tailing Docker logs and systemd journal
Loki datasource auto-provisioned in Grafana
4 alert rules active (CPU, RAM, Disk, Service Down)

What Didn’t:

Loki latest (v3.7.1) rejected schema v11 config
Required allow_structured_metadata: false in limits_config
Permission denied on /loki/rules — container UID mismatch

Resolution:

Added allow_structured_metadata: false
chown -R 10001:10001 /opt/prodg/data/loki (Loki container runs as UID 10001)

Lesson Learned: Always check container UIDs for volume mounts. Loki v3.x has stricter config validation than v2.x.

Phase 6: Multi-Agent Orchestration

Status: ✅ Complete
Assessment: Custom FastAPI application built and deployed.

What Worked:

FastAPI app with asyncpg for database operations
Agent registry, task queue, and dispatch endpoints
Docker socket mounted read-only for local container spawning
Prometheus custom metrics endpoint

What Didn’t:

SyntaxError in Python f-string for SSH command construction
Docker compose *** password masking broke database URL parsing
FastAPI PlainTextResponse missing from imports caused /metrics to return JSON

Resolution:

Fixed triple-quoted f-string with single-quoted outer string
Changed compose env interpolation from *** to actual ${VAR} reference
Added PlainTextResponse import for Prometheus content-type compliance

Lesson Learned: Docker Compose’s automatic secret masking in env vars (***) is helpful for logs but breaks string interpolation if the masked value appears in a URL.

Phase 7: Backblaze B2 Backups

Status: ✅ Complete
Assessment: Backup infrastructure deployed and verified.

What Worked:

rclone installed and configured
All 3 backup scripts (Postgres, MinIO, Configs) functional
Master orchestrator with Telegram notifications
Cron job scheduled

What Didn’t:

rclone v1.60.1 doesn’t support --b2-key-id CLI flags
On-the-fly :b2: backend syntax unsupported in this version
B2 Application Key restricted to single bucket (MainframeBackup)

Resolution:

Switched to RCLONE_B2_KEY_ID / RCLONE_B2_KEY environment variables
Created dynamic rclone.conf generator from .env
Discovered bucket restriction and updated all paths to MainframeBackup/

Lesson Learned: rclone CLI flags for backends vary by version. Environment variables are the most portable approach. Always verify B2 key permissions (bucket-restricted vs account-wide).

3. Overall Metrics

Metric	Value
Total Phases	8
Total Services Deployed	14 containers
Configuration Files Created	25+
Docker Images Built	1 (hermes-api)
DNS Records Configured	2 (1 apex, 1 wildcard)
Backup Targets	3
Prometheus Targets	9
Alert Rules	4
Cron Jobs	1
Documentation Files	4

4. Risk Assessment

Risk	Severity	Mitigation
`.env` file on host contains all secrets	High	Planned migration to Infisical (Phase 9)
Root user used for all operations	Medium	Planned migration to `prodg` service user
Single point of failure (1 VPS)	Medium	B2 backups enable DR to new host
Caddy admin API exposed to Docker net	Low	Network isolation; no host binding
Headscale exposed on 0.0.0.0:8080	Low	UFW allows it; required for control plane
No log encryption for B2 backups	Low	Backups are gzip compressed but not encrypted at rest

5. Recommendations for Phase 9

Migrate .env to Infisical — Eliminate plaintext secrets on disk
Service user migration — Run containers as prodg user, not root
Modal integration — Complete the /v1/dispatch/modal stub
Dashboard provisioning — Auto-provision Grafana dashboards as JSON
Log encryption — Add age/sops encryption to B2 backups
Health dashboard — Build a /health endpoint for the entire stack
Automated tests — Post-deployment smoke tests for each service
Blue-green deploys — Hermes API rolling updates without downtime

6. What Went Well

Modular phase approach — Each phase built on the previous without major rework
Early TLS — Getting Caddy + HTTPS working early prevented later security debt
Observability first — Prometheus/Grafana deployed before custom code, enabling debugging
Documentation discipline — Configs, scripts, and docs written as we went
User involvement — Mitch’s active participation (DNS changes, credential provision) kept momentum

7. What Could Be Improved

Pre-flight validation — Some issues (Caddy admin, Loki schema) could have been caught before deployment
Version pinning discipline — latest tags used for some services; should pin to specific versions
Secret rotation automation — Currently manual; should be API-driven via Infisical
Testing harness — No automated integration tests for the full stack

Evaluation Version: 1.0.0 — Generated by Hermes Agent

ProDG Knowledge Base

Explorer

End To End Evaluation

ProDG Mainframe — End-to-End Evaluation

1. Process Overview

2. Phase-by-Phase Assessment

Phase 0: Foundation (SSH, Users, Docker)

Phase 1: Tailscale Network (Headscale)

Phase 2: Core Data Layer (Postgres, Redis, MinIO)

Phase 3: Secrets Layer (Infisical + Vaultwarden)

Phase 4: Caddy + TLS + DNS

Phase 5: Observability (Prometheus + Grafana)

Phase 5d/e: Loki + Promtail + Telegram Alerting

Phase 6: Multi-Agent Orchestration

Phase 7: Backblaze B2 Backups

3. Overall Metrics

4. Risk Assessment

5. Recommendations for Phase 9

6. What Went Well

7. What Could Be Improved

Graph View

Table of Contents

Backlinks