Ahmad Asmar · Portfolio
A platform,
by hand.
Staff DevOps Platform Lead CKA Open to opportunities · GMT+3
Six+ years scaling cloud-native infrastructure across AWS, GCP, and Azure. The last three of those as senior platform owner for a SaaS-scale AWS-native platform: grown from a hybrid Azure+AWS estate into a 20-account, 4-region, 20-EKS-cluster system serving 25+ microservices and 20+ government customers across the US and UK. Sole DevOps engineer for the first 18 months, then primary IC for the next 18. The Azure-to-AWS migration close-out shipped with zero customer-visible downtime; the last fleet-wide EKS upgrade landed in two working days.
Before that: Freightos (2021–2023) — GCP, Kubernetes, GitOps with ArgoCD, monitoring; PDF Solutions (2020–2021) — cloud infrastructure-as-code; plus two years of IT-support roots at Partners for Sustainable Development and Palestine Telecommunications. Open-source maintainer of a Terraform module powering self-service GitLab runner fleets (3,767 downloads on the Terraform Registry, 2 years of maintenance) and a Claude Code skills marketplace for DevOps workflows (158 ★ · 32 forks).
Self-authored Terraform module powering self-service GitLab runner fleets: 100% Spot, attribute-based instance selection, Fleeting plugin for the Docker Autoscaler executor, multi-arch. The module that runs a production CI fleet at SaaS scale — and 3,767 downloads says other organisations had the same problem.
Curated Claude Code skills for DevOps workflows: ArgoCD cluster onboarding via Pod Identity + AssumeRole, AWS SSO auth recovery, Terraform → ArgoCD migration, k8s triage, FinOps. Real community traction in 7 months.
ALB in one AWS account, EKS workloads in another — the textbook answer puts an NLB in each spoke account. With AWS LBC v3, you can register pod IPs directly into a cross-account target group. Validated on a pilot cluster, then rolled to all six staging clusters in a single day.
3-month epic standing up dedicated devops clusters, the ApplicationSet pattern for multi-cluster fan-out, and cluster onboarding via Pod Identity + cross-account AssumeRole — no bearer tokens. Authored ~70% of the underlying GitOps repo.
Auto-PDB ClusterPolicy on every multi-replica Deployment fleet-wide. Observation-VPA generation via a separate-by-kind rule design. CRD drift suppression for ArgoCD. Migrated from Terraform Helm onto an ApplicationSet.
Crossplane v2.5.3 (Upbound) bootstrapped on dedicated devops
clusters in Pipeline mode. The first usable Composition —
AppWorkloadBucket with S3 versioning + lifecycle + CORS
— shipped to prod. Declarative AWS resources reconciled by k8s
controllers instead of terraform apply.
Multi-account AWS at scale
20 SSO-managed accounts across staging + prod orgs. Transit Gateway as the only inter-VPC primitive (no peering). IAM Identity Center via JumpCloud SAML. SCPs for guardrails. AWS provider 4 → 6 lifecycle managed in shared modules.
Kubernetes platform
20 EKS clusters on Bottlerocket 1.59 + Karpenter 1.12 with >=gen5
instance generation gating. 6 major k8s upgrades over 3 years with zero
rollbacks. Pod Identity replacing IRSA fleet-wide. AWS LBC v3 with
cross-account TargetGroupBinding.
Policy & governance
Kyverno fleet-wide with auto-PDB generation for every workload
missing one, ClusterPolicy generating observation VPAs across all
deployments + statefulsets, and ignoreDifferences patterns for
CRD cosmetic-drift suppression.
IaC v2 — Crossplane
Crossplane v2.5.3 (Upbound) bootstrapped on shared devops
clusters in Pipeline mode. AppWorkloadBucket Composition with S3 versioning,
lifecycle, and CORS shipped. Webhook scaled to 2 replicas in prod;
requests == limits for predictable footprint.
Autoscaling — VPA + Karpenter
VPA live on every staging and prod cluster — admission controller + a Kyverno ClusterPolicy that generates observation VPAs for every Deployment and StatefulSet (via two separate rules to keep the design clean). Karpenter NodePools restricted to gen5+ instances after the 2024 NLB-m3 incident.
Service mesh — Istio Ambient + Gateway API
12-phase migration plan; shipped Phases 0/1/2.0/2.1/2.2 to all 6 staging clusters in a 2-week sprint (125 commits across 8 repos). Sidecar-less Ambient over per-pod sidecar — no restart-on-upgrade. HTTPRoute auto-generation from monochart.
Modern Terraform CI/CD
Validate → Checkov → plan-in-MR → Infracost → auto-apply on merge. Plan and cost diff posted as MR comments so reviewers see what's about to change in AWS and how much it costs. Cut fleet pipelines from 8:52 to 5:36.
Observability
Datadog primary — APM, logs, synthetics, Operator-managed agents on EKS. SKU renegotiation (Ephemeral Infra + APM, Infra vCPU) eliminated the container commit floor. Custom monitors as Terraform.
How a SaaS platform used Kyverno to auto-generate PDBs for every microservice that lacked one — preventing service downtime when Karpenter consolidates nodes and closing the operational gap of "we forgot to add a PDB."
A local dev platform that mirrors production: k3d + Traefik locally, EKS in the cloud, identical Kustomize bases and overlays across environments. Tilt for rapid feedback; GitLab CI for immutable image promotion from staging to prod.
How a CI workload arrives, an EC2 Spot instance boots, runs the job, and
the ASG returns to zero. Attribute-based selection,
price-capacity-optimized allocation, deliberate
capacity_rebalance = false, Packer AMI builder.
The IAM trust, the security-group topology, and the readiness-gate ordering that makes the NLB hop go away. Three IAM gotchas account for > 90% of the debugging time; they're catalogued here so the next person doesn't have to find them live.
Dedicated devops clusters. Self-managing app-of-apps bootstrap. ApplicationSet fan-out. Server-Side Diff at the controller level. The migration sequence that takes an addon off Terraform Helm without restarting workloads.
Honed across six k8s versions (1.22 → 1.34) on roughly fifteen production clusters. Module bumps first, GitOps versions separate, cluster_version last, one cluster at a time. The Kyverno PDB gotcha. The ArgoCD OOM gotcha. The NLB-vs-gen-3-instance gotcha. The unlock-runbook you need ready.
Validate → Checkov → plan → Infracost → auto-apply. The OIDC dance, the
resource_group: trick that makes concurrent merges safe, and
the cache-key fix that bought a 37% pipeline speedup fleet-wide.
Replacing IRSA with Pod Identity across 10 production + 5 staging EKS
clusters in two days. Why before_compute = true is
non-optional. The cross-account variant. Why AWS LBC v3 needs the SA
annotation removed.