ProDG Mainframe — End-to-End Evaluation
Date: 2026-04-27
Scope: Phases 0–8 (Complete Infrastructure Deployment)
1. Process Overview
The ProDG Mainframe deployment was executed across 8 phases over multiple sessions, transitioning from a bare Ubuntu VPS to a fully operational infrastructure platform. This evaluation covers what worked, what didn’t, and lessons learned.
2. Phase-by-Phase Assessment
Phase 0: Foundation (SSH, Users, Docker)
Status: ✅ Complete
Assessment: Clean baseline. Docker + Compose installed without issues. UFW configured correctly. prodg user created for future service migration.
What Worked:
- Standard Ubuntu package installs
- Docker rootless setup deferred correctly (service user migration is Phase 9)
What Didn’t:
- None significant
Phase 1: Tailscale Network (Headscale)
Status: ✅ Complete
Assessment: The most complex phase. Self-hosted Headscale control plane required significant troubleshooting.
What Worked:
- Headscale container deployed and initialized
- Pre-auth key generation worked
- Linux node (
mainframe) registered successfully
What Didn’t (Initially):
- macOS client refused to connect to HTTP control plane (no
--accept-risk-of-no-httpsflag on macOS) - Cloudflare orange-cloud broke Let’s Encrypt HTTP-01 challenges
- SSH tunnel approach failed due to local SSH agent issues
Resolution:
- Deployed Caddy reverse proxy with TLS termination
- Switched Cloudflare records to DNS-only (grey cloud)
- Used DNS-based ACME challenges implicitly via Caddy’s automatic HTTPS
- macOS client connected successfully once
https://headscale.prodg.studiowas available
Lesson Learned: macOS Tailscale client enforces HTTPS for all non-localhost control planes. Always provision TLS before attempting macOS enrollment.
Phase 2: Core Data Layer (Postgres, Redis, MinIO)
Status: ✅ Complete
Assessment: Straightforward container deployment. All services healthy.
What Worked:
- PostgreSQL 16 with persistent volume
- Redis 7 with AUTH and AOF persistence
- MinIO with dual-port (API + Console)
What Didn’t:
- None
Phase 3: Secrets Layer (Infisical + Vaultwarden)
Status: ✅ Complete
Assessment: Both services deployed and accessible. Infisical super admin created by Mitch.
What Worked:
- Infisical connected to existing Postgres + Redis
- Vaultwarden connected to Postgres
- Both reverse-proxied through Caddy with TLS
What Didn’t:
- Grafana Telegram alerting initially failed because
TELEGRAM_BOT_TOKENwas empty in provisioning - Grafana exits hard if any alerting provision fails — no graceful degradation
Resolution:
- Used placeholder token to allow Grafana to boot
- Required exact-match environment variable injection via
docker compose up -d(not justrestart)
Lesson Learned: Grafana’s alerting provisioning is all-or-nothing. Missing any required field in a contact point causes the entire provisioning module to fail, preventing Grafana startup.
Phase 4: Caddy + TLS + DNS
Status: ✅ Complete
Assessment: Critical infrastructure layer. Took multiple iterations.
What Worked:
- Caddy container serving all 7 domains
- Let’s Encrypt certificates issued automatically
- HSTS headers, compression, security headers applied
What Didn’t:
- Caddy admin API default localhost-only blocked Prometheus scraping from Docker network
- Caddy reload rejects admin listener mutations from non-localhost origins
Resolution:
- Changed global block from
debugtoadmin 0.0.0.0:2019 - Required container restart (not just
caddy reload) because the admin listener change is treated as a security-sensitive mutation
Lesson Learned: Caddy’s live reload has security guardrails for admin listener changes. Plan for container restarts when modifying the admin socket.
Phase 5: Observability (Prometheus + Grafana)
Status: ✅ Complete
Assessment: Full metrics visibility achieved.
What Worked:
- All 6 initial scrape targets healthy
- Grafana auto-provisioned Prometheus datasource
- External HTTPS access confirmed
What Didn’t:
- Prometheus config validation passed but
caddytarget showed DOWN - Root cause: Caddy admin API bound to localhost only
Lesson Learned: Prometheus target DOWN doesn’t always mean the service is down — check scrape URL accessibility from within the Prometheus container network.
Phase 5d/e: Loki + Promtail + Telegram Alerting
Status: ✅ Complete
Assessment: Extended observability successfully.
What Worked:
- Loki deployed with filesystem storage
- Promtail tailing Docker logs and systemd journal
- Loki datasource auto-provisioned in Grafana
- 4 alert rules active (CPU, RAM, Disk, Service Down)
What Didn’t:
- Loki latest (v3.7.1) rejected schema v11 config
- Required
allow_structured_metadata: falsein limits_config - Permission denied on
/loki/rules— container UID mismatch
Resolution:
- Added
allow_structured_metadata: false chown -R 10001:10001 /opt/prodg/data/loki(Loki container runs as UID 10001)
Lesson Learned: Always check container UIDs for volume mounts. Loki v3.x has stricter config validation than v2.x.
Phase 6: Multi-Agent Orchestration
Status: ✅ Complete
Assessment: Custom FastAPI application built and deployed.
What Worked:
- FastAPI app with asyncpg for database operations
- Agent registry, task queue, and dispatch endpoints
- Docker socket mounted read-only for local container spawning
- Prometheus custom metrics endpoint
What Didn’t:
- SyntaxError in Python f-string for SSH command construction
- Docker compose
***password masking broke database URL parsing - FastAPI PlainTextResponse missing from imports caused
/metricsto return JSON
Resolution:
- Fixed triple-quoted f-string with single-quoted outer string
- Changed compose env interpolation from
***to actual${VAR}reference - Added
PlainTextResponseimport for Prometheus content-type compliance
Lesson Learned: Docker Compose’s automatic secret masking in env vars (***) is helpful for logs but breaks string interpolation if the masked value appears in a URL.
Phase 7: Backblaze B2 Backups
Status: ✅ Complete
Assessment: Backup infrastructure deployed and verified.
What Worked:
- rclone installed and configured
- All 3 backup scripts (Postgres, MinIO, Configs) functional
- Master orchestrator with Telegram notifications
- Cron job scheduled
What Didn’t:
- rclone v1.60.1 doesn’t support
--b2-key-idCLI flags - On-the-fly
:b2:backend syntax unsupported in this version - B2 Application Key restricted to single bucket (
MainframeBackup)
Resolution:
- Switched to
RCLONE_B2_KEY_ID/RCLONE_B2_KEYenvironment variables - Created dynamic
rclone.confgenerator from.env - Discovered bucket restriction and updated all paths to
MainframeBackup/
Lesson Learned: rclone CLI flags for backends vary by version. Environment variables are the most portable approach. Always verify B2 key permissions (bucket-restricted vs account-wide).
3. Overall Metrics
| Metric | Value |
|---|---|
| Total Phases | 8 |
| Total Services Deployed | 14 containers |
| Configuration Files Created | 25+ |
| Docker Images Built | 1 (hermes-api) |
| DNS Records Configured | 2 (1 apex, 1 wildcard) |
| Backup Targets | 3 |
| Prometheus Targets | 9 |
| Alert Rules | 4 |
| Cron Jobs | 1 |
| Documentation Files | 4 |
4. Risk Assessment
| Risk | Severity | Mitigation |
|---|---|---|
.env file on host contains all secrets | High | Planned migration to Infisical (Phase 9) |
| Root user used for all operations | Medium | Planned migration to prodg service user |
| Single point of failure (1 VPS) | Medium | B2 backups enable DR to new host |
| Caddy admin API exposed to Docker net | Low | Network isolation; no host binding |
| Headscale exposed on 0.0.0.0:8080 | Low | UFW allows it; required for control plane |
| No log encryption for B2 backups | Low | Backups are gzip compressed but not encrypted at rest |
5. Recommendations for Phase 9
- Migrate
.envto Infisical — Eliminate plaintext secrets on disk - Service user migration — Run containers as
prodguser, not root - Modal integration — Complete the
/v1/dispatch/modalstub - Dashboard provisioning — Auto-provision Grafana dashboards as JSON
- Log encryption — Add age/sops encryption to B2 backups
- Health dashboard — Build a
/healthendpoint for the entire stack - Automated tests — Post-deployment smoke tests for each service
- Blue-green deploys — Hermes API rolling updates without downtime
6. What Went Well
- Modular phase approach — Each phase built on the previous without major rework
- Early TLS — Getting Caddy + HTTPS working early prevented later security debt
- Observability first — Prometheus/Grafana deployed before custom code, enabling debugging
- Documentation discipline — Configs, scripts, and docs written as we went
- User involvement — Mitch’s active participation (DNS changes, credential provision) kept momentum
7. What Could Be Improved
- Pre-flight validation — Some issues (Caddy admin, Loki schema) could have been caught before deployment
- Version pinning discipline —
latesttags used for some services; should pin to specific versions - Secret rotation automation — Currently manual; should be API-driven via Infisical
- Testing harness — No automated integration tests for the full stack
Evaluation Version: 1.0.0 — Generated by Hermes Agent