Three Mistakes We See in Every Kubernetes Migration

Over the last few years, the InfraZen team has helped dozens of teams move from VMs, ECS, and legacy orchestrators onto Kubernetes. The technical details are always different — the app, the cloud, the compliance posture. But the failure patterns are almost always the same three.

If you're planning a migration, or you're six months in and wondering why things feel harder than they should, odds are at least one of these is playing out on your cluster right now.

1. Lifting workloads without rewriting health checks

The most common mistake: teams containerize their app, deploy it, and skip the part where they translate "is this process alive" from their old orchestrator into Kubernetes' liveness and readiness probes.

Kubernetes is unforgiving here. Without a proper readinessProbe, traffic hits your pod before it's ready, you see connection resets at deploy time, and everyone blames the ingress controller. Without a proper livenessProbe, a wedged process stays in the rotation until a human pages someone.

The fix: spend a full day before your first cutover writing real probes. Not TCP probes — HTTP probes that actually exercise the app's dependency graph (DB, cache, upstream services). Treat the probe endpoint as a first-class feature.

2. Trusting "default" resource limits

Nobody sets resource requests and limits thoughtfully on day one. They copy a value from a tutorial, ship it, and move on. Six weeks later the cluster is full of pods that either get OOMKilled under load or hold 10x the memory they actually need, and the node bill has doubled.

The core trap: CPU limits in Kubernetes are enforced via cgroups throttling, and aggressive CPU limits can cause latency spikes even when the pod has headroom. Memory limits are even worse — the kernel will just kill the process.

The fix: set memory requests equal to memory limits (so the pod is guaranteed its budget and can't be evicted for overcommit). For CPU, set a request based on observed p95 usage and skip the limit entirely unless you have a clear noisy-neighbor problem. Revisit these numbers monthly using actual metrics, not guesses.

3. Treating cluster upgrades as a "later" problem

Kubernetes ships a new minor version every four months. Managed offerings (EKS, GKE, AKS) support roughly the last three. That means if you stand up a cluster and don't touch it for a year, you're already out of support.

We've walked into engagements where a team was stuck on a two-year-old version, the ingress controller's CRDs were incompatible with anything newer, and the upgrade path required taking the cluster down and rebuilding it from scratch. That's a full migration, not an upgrade.

The fix: from day one, run quarterly upgrade drills on a staging cluster. Pin your add-ons (ingress, cert-manager, external-dns, metrics-server) to versions that have a known upgrade path. Automate the drain-and-replace so upgrades are boring, not heroic.

The pattern behind the patterns

All three of these have the same root cause: Kubernetes rewards teams that invest in operational discipline upfront and punishes teams that treat it as "just a deployment target." It is not a drop-in replacement for what you had before. It's a platform, and platforms need care.

The good news: if you dodge these three, you're ahead of 80% of the migrations we've seen. The rest is engineering, and engineering is the part that's actually fun.

Running a Kubernetes migration and want a second set of eyes? Book a free 30-minute review — we'll look at your probes, your limits, and your upgrade plan, and tell you honestly where the risk is.

Related: Learn more about our DevOps, SRE & Cloud consulting services, or see our transparent engagement pricing.

Three mistakes we see in
every Kubernetes migration.

1. Lifting workloads without rewriting health checks

2. Trusting "default" resource limits

3. Treating cluster upgrades as a "later" problem

The pattern behind the patterns

More engineering notes?

Three mistakes we see in every Kubernetes migration.

1. Lifting workloads without rewriting health checks

2. Trusting "default" resource limits

3. Treating cluster upgrades as a "later" problem

The pattern behind the patterns

More engineering notes?

Three mistakes we see in
every Kubernetes migration.