GitOps engine — Terraform → ArgoCD
This page complements the project page with the design choices and the migration sequence.
The architecture
Section titled “The architecture”┌──────────────────────────────────────────────────────────────────────┐│ ││ ┌──────────────────────────────────────────────┐ ││ │ gitops-repo (Git, source of truth) │ ││ │ - bootstrap/ │ ││ │ - applicationsets/ │ ││ │ - cluster-overrides/ │ ││ └──────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────┐ ││ │ ArgoCD on staging-devops │ ││ │ ArgoCD on prod-devops │ ││ │ - app-of-apps bootstrap │ ││ │ - ApplicationSet controller │ ││ │ - Server-Side Diff enabled │ ││ │ - webhook HA (2 replicas in prod) │ ││ │ - JumpCloud SSO │ ││ └──────────────────────────────────────────────┘ ││ │ │ │ │ │ ││ ▼ ▼ ▼ ▼ ▼ ││ ┌────────┐┌────────┐┌────────┐┌────────┐┌────────┐ ││ │ lab ││ shared ││platform││ engage ││ survey │ ││ │workload││workload││workload││workload││workload│ ││ │ EKS ││ EKS ││ EKS ││ EKS ││ EKS │ ││ └────────┘└────────┘└────────┘└────────┘└────────┘ ││ ││ Each workload cluster is registered to ArgoCD via ││ Pod Identity + AssumeRole — NO bearer tokens. ││ │└──────────────────────────────────────────────────────────────────────┘Why two devops clusters
Section titled “Why two devops clusters”staging-devops and prod-devops are isolated from workload clusters intentionally:
- A workload-cluster outage (k8s upgrade, capacity blip, blast-radius incident) doesn’t take down the controller for the rest of the fleet.
- Upgrade cadence of the engine cluster is decoupled from workload upgrades — I upgrade the engine on a slower, more conservative schedule.
- IAM, secrets, SSO, RBAC live in one cluster per environment instead of being smeared across 5+ workload clusters.
Two clusters (not one) because staging-engine should never have IAM that can sync prod targets. The prod-engine cluster lives in a separate AWS account with separate trust to the prod target accounts. Blast-radius enforcement at the AWS IAM layer.
Self-managing app-of-apps
Section titled “Self-managing app-of-apps”The bootstrap looks like:
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: root namespace: argocdspec: project: default source: repoURL: https://gitlab.com/<your-org>/gitops-repo.git path: applicationsets/ targetRevision: main destination: server: https://kubernetes.default.svc namespace: argocd syncPolicy: automated: { prune: true, selfHeal: true }Then applicationsets/ contains:
argocd-self.yaml— an ApplicationSet that manages ArgoCD itself (the controller’s own Helm chart)addons-foundation.yaml— cert-manager, external-secrets, external-dns, AWS LBCaddons-policy.yaml— Kyverno, VPAaddons-observability.yaml— datadog-operator (dormant scaffolding), metrics-server, node-local-dns
After the very first terraform apply to lay down bootstrap/root.yaml, the terraform import block was removed. The platform now reconciles its own desired state — change a chart version in Git, ArgoCD picks it up, applies fleet-wide.
ApplicationSet fan-out
Section titled “ApplicationSet fan-out”Example: cert-manager across all workload clusters
apiVersion: argoproj.io/v1alpha1kind: ApplicationSetmetadata: name: cert-manager namespace: argocdspec: generators: - clusters: selector: matchLabels: workload-cluster: "true" template: metadata: name: 'cert-manager-{{name}}' spec: project: addons source: repoURL: https://gitlab.com/<your-org>/gitops-repo.git path: charts/cert-manager targetRevision: main helm: valueFiles: - 'values.yaml' - 'cluster-overrides/{{name}}/cert-manager.yaml' # optional per-cluster destination: server: '{{server}}' namespace: cert-manager syncPolicy: automated: { prune: true, selfHeal: true } syncOptions: - ServerSideApply=true - CreateNamespace=trueWhen a new cluster gets the workload-cluster: "true" label in ArgoCD’s cluster registry, cert-manager appears in that cluster automatically — no PR, no manual helm install.
Server-Side Diff
Section titled “Server-Side Diff”ArgoCD’s classic diff strategy compares full rendered manifests. That generates noise on resources that admission webhooks or CRD controllers mutate after creation — for example, the caBundle field on a MutatingWebhookConfiguration, or controller-managed annotations.
Server-Side Diff defers diffing to the API server’s strategic-merge-patch logic — only fields the manager (ArgoCD) owns are compared. Result: no more OutOfSync on cosmetic mutations. Enabled at the controller level:
data: controller.diff.server.side.diff: "true" controller.sync.server.side.apply: "true"Cluster registration without bearer tokens
Section titled “Cluster registration without bearer tokens”The default argocd cluster add flow:
- Creates a
ServiceAccountin the target cluster - Generates a Secret containing the SA’s bearer token
- Stores the token in ArgoCD’s cluster registry as a Secret
Three problems: long-lived credentials, cluster-bound credentials living in another cluster, and no rotation story. So I built the alternative:
-
Workload cluster has an EKS Access Entry mapped to a cluster-admin (or namespace-scoped) k8s RBAC role:
resource "aws_eks_access_entry" "argocd" {cluster_name = module.eks.cluster_nameprincipal_arn = aws_iam_role.argocd_cluster_role.arnkubernetes_groups = ["system:masters"] # or scoped} -
The role in the workload account (
argocd_cluster_role) has a trust policy permitting the ArgoCD Pod Identity role in the engine account to assume it:{"Effect": "Allow","Principal": { "AWS": "arn:aws:iam::ENGINE_ACCOUNT:role/argocd-pod-identity" },"Action": "sts:AssumeRole"} -
ArgoCD pods in the engine cluster run with a Pod Identity association mapped to
argocd-pod-identity— no static creds, no IRSA OIDC fan-out. -
Register the cluster in ArgoCD with the AssumeRole config (via the
argocd cluster add--aws-cluster-name+--aws-role-arnflags or, equivalently, a Secret oftype: clusterwithawsAuthConfigset). -
Result: No bearer token is ever written. EKS auth flows through STS AssumeRole, expiring tokens are refreshed by the AWS SDK transparently, and access can be revoked by deleting the IAM role / Access Entry — no Secret rotation needed.
I packaged this end-to-end as the argocd-eks-cluster-onboard Claude Code skill so it’s runnable in 5 minutes for a new cluster.
Migration sequence
Section titled “Migration sequence”Migrating an addon from Terraform Helm to ArgoCD without downtime:
-
Author the chart under
gitops-repo/charts/<addon>/matching the current Terraformhelm_releasevalues. -
Add the ApplicationSet but with
syncPolicy.automated: {}off — so it doesn’t auto-apply yet. -
Generate the cluster (the cluster-list / git-files Generator) — confirm the Application appears in ArgoCD with
OutOfSyncstatus and a diff of zero (because Terraform already applied the same chart). -
terraform state rmthehelm_releaseso Terraform stops managing it. Verifyterraform planis clean. -
Enable automated sync on the ApplicationSet’s template — ArgoCD takes over reconciliation.
-
Bump the chart version in Git — verify it flows to the cluster via ArgoCD.
Step 4 is the moment of truth. If the chart values in the Terraform module don’t exactly match what’s rendered into the cluster, ArgoCD will see a diff and try to apply it on takeover, potentially restarting workloads. Carefully audit values before this step.
Where it landed
Section titled “Where it landed”As of mid-May 2026, only karpenter and the deprecating nginx-ingress remained on Terraform-managed Helm. Every other addon is reconciled by ArgoCD ApplicationSets from gitops-repo. Karpenter is intentionally last because its helm_release is referenced by the EKS module — moving it requires module refactoring rather than a clean addon swap.
Related
Section titled “Related”- GitOps engine project page — the numbers and where the migration landed
- Cross-account TGB architecture — the AWS LBC in workload clusters that this engine deploys
- GitHub: devops-claude-skills —
gitops-helm-onboard+argocd-eks-cluster-onboardskills package this flow