Internal PKI & mTLS Infrastructure
A three-tier offline-root Certificate Authority with automated certificate lifecycle management, enforcing mutual TLS across a VLAN-segmented infrastructure.
Problem
Running multiple internal services (Grafana, Prometheus, Loki, Forgejo, Jellyfin) across VLAN-segmented networks creates a core security challenge: every service needs authenticated, encrypted access, but external certificate authorities are inappropriate for internal-only infrastructure. Generating self-signed certificates per service, trades one problem for several: each certificate is its own trust anchor with no centralized revocation, long-lived certificates expand the window of compromise if a key is leaked, and every service requires its own trust store configuration rather than trusting a single root authority. Scaling this across a growing number of services means duplicating certificate management logic on every host rather than centralizing it.
A proper PKI addresses these problems structurally: a single root of trust, centralized issuance and revocation through intermediate CAs, short-lived certificates with automated renewal, and proactive expiry monitoring.
Technical Approach
The PKI uses a three-tier offline-root model:
- Root CA (10-year validity, kept offline): The trust anchor.
- Server Intermediate CA (2-year validity): Signs server certificates for TLS termination at reverse proxies.
- Client Intermediate CA (2-year validity): Signs client certificates for mTLS authentication.
Splitting into two intermediates enables separate trust models. Media streaming services use standard TLS (server certificate only), while administrative interfaces like Grafana enforce mutual TLS, requiring both a valid server certificate and a client certificate before granting access. If a client certificate is compromised, it can be revoked via CRL without affecting server certificates or other clients.
The entire system uses a metadata-driven deployment model. Rather than hardcoding installation paths, each managed file declares its own deployment target:
# deploy-target: /opt/ca/templates/monitoring-grafana-server.cnf
[ req ]
default_bits = 2048
distinguished_name = req_distinguished_name
req_extensions = req_ext
prompt = no
[ req_distinguished_name ]
CN = monitoring.internal
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
DNS.1 = monitoring.internal
A single deployment script discovers all managed files, parses their headers, and handles file placement, ownership, and permissions. Adding a new certificate template to the system requires only adding the header and avoids deployment logic changes or complicated deployment configuration management.
Implementation
The initial design used per-service certificates: individual server and client certs for every application (10+ server, 7+ client). It was immediately apparent that this was over-engineered. Services communicate internally via container networks where only the reverse proxies are network-facing. Per-service certificates added complexity without security benefit.
The infrastructure was consolidated to per-host reverse proxy certificates with a single admin client certificate for browser access. Performing the cleanup early the next day avoided accumulating technical debt in the automation that was being built around the certificate structure.
The test suite validates the entire PKI without requiring root access or network connectivity. The framework builds an ephemeral CA hierarchy in a temporary directory, issues test certificates, and validates across seven categories: certificate chain verification, CRL integrity, deployment consistency, renewal automation, alerting, deploy-script validation, and security checks (encryption algorithms, key sizes, hashing algorithms, permission enforcement). The entire suite tears down cleanly, leaving no persistent state.
Results
- Full certificate lifecycle automation: issuance, renewal, revocation, CRL distribution, deployment, health checks, and monitoring managed through a single orchestrated toolchain
- Comprehensive test coverage across seven categories, running in ephemeral environments with CI/CD integration
- Automated renewal cadence: 90-day server certificates renewed on 60-day cycles (30-day safety margin), daily expiry monitoring with email alerts, weekly CRL regeneration
- Zero certificate-related service outages since deployment
- Zero hardcoded secrets in the repository; all sensitive values loaded at runtime from gitignored configuration
Lessons Learned
- Design certificate scope around network boundaries, not application boundaries. Behind a reverse proxy, the proxy certificate is sufficient. Per-service certificates are justified only when services are independently network-exposed.
- Investing in lifecycle automation upfront pays for itself within the first renewal cycle. The monitoring and renewal scripts have run unattended through multiple 60-day cycles with zero manual intervention.