| | | | |

Storage Has Gravity: Debugging PVCs & AZ Lock-in

Why Your Pod is Pending in Zone B When Your Disk is Stuck in Zone A

This is Part 4 of the Rack2Cloud Diagnostic Series, where we debug the silent killers of Kubernetes reliability.

A 3D illustration of a floating server rack chained to the ground by a heavy anchor, symbolizing Kubernetes data gravity and availability zone lock-in.
Stateless apps can fly. Stateful apps have gravity.

You have a StatefulSet—maybe it’s Postgres, Redis, or Jenkins. You drain a node for maintenance. The Pod tries to move somewhere else. And then… nothing. It just sits in Pending. Forever.

The error message might be the classic one:

1 node(s) had volume node affinity conflict

But at 3 AM, you might also see these terrors:

  • pod has unbound immediate PersistentVolumeClaims
  • persistentvolumeclaim "data-postgres-0" is not bound
  • failed to provision volume with StorageClass
  • Multi-Attach error for volume "pvc-xxxx": Volume is already used by node

Welcome to Cloud Physics.

We spend so much time treating containers like “cattle” (disposable, movable) that we forget

Data is not a container.

You can move a microservice in the blink of an eye. You cannot move a 1TB disk in the blink of an eye.

This is Part 4 of the Rack2Cloud Diagnostic Series. Today: Data Gravity.

The Mental Model: The “Double Scheduler”

Most engineers think Kubernetes schedules a Pod once. Wrong. A Stateful Pod is effectively scheduled twice:

  1. Storage Scheduling: Where should the disk exist?
  2. Compute Scheduling: Where should the code run?
A diagram comparing Immediate Binding vs WaitForFirstConsumer in Kubernetes storage scheduling.
The Double Scheduler: Compute and Storage must agree on the location before the disk is created.

If these two decisions happen independently, you get a deadlock. The Storage Scheduler picks us-east-1a (because it has free disk space). The Compute Scheduler picks us-east-1b (because it has free CPU). Result: The Pod cannot start because the cable doesn’t reach.

The Trap: “Immediate” Binding

The #1 cause of storage lock-in is a default setting called VolumeBindingMode: Immediate.

Why does this even exist?

It exists for legacy reasons, local storage, or single-zone clusters where location doesn’t matter. But in a multi-zone cloud, it is a trap.

The Failure Chain:

  1. You create the PVC.
  2. The Storage Driver wakes up immediately. It thinks, “I need to make a disk. I’ll pick us-east-1a.”
  3. The Disk is created in 1a. It is now physically anchored there.
  4. You deploy the Pod.
  5. The Scheduler sees 1a is full of other apps, but 1b has free CPU.
  6. The Conflict: The Scheduler wants to put the Pod in 1b. But the disk is locked in 1a.

The Scheduler cannot move the Pod to the data (no CPU in 1a).

The Cloud Provider cannot move the data to the Pod (EBS volumes do not cross zones).

The Fix: WaitForFirstConsumer

You have to teach the storage driver some patience.

Do not make the disk until the Scheduler picks a spot for the Pod.

This is what WaitForFirstConsumer does.

How it works:

  1. You create the PVC.
  2. Storage Driver: “I see the request, but I’m waiting.” (Status: Pending).
  3. You deploy the Pod.
  4. Scheduler: “I’m picking Node X in us-east-1b. Plenty of CPU there.”
  5. Storage Driver: “Alright, Pod’s going to 1b. I’ll make the disk in 1b.”
  6. Result: Success.

The Hidden Trap:

This only works once.

WaitForFirstConsumer delays the creation of the disk. Once the disk is created in us-east-1b, gravity wins forever. If you later upgrade and 1b is full, your Pod will not fail over to 1a. It will simply hang until space frees up in 1b.

The Multi-Attach Illusion (RWO vs RWX)

Engineers often assume, “If a node dies, the pod will just restart on another node immediately.”

This is the Multi-Attach Illusion.

Most cloud block storage (AWS EBS, Google PD, Azure Disk) is ReadWriteOnce (RWO).

This means the disk can only attach to one node at a time.

ModeWhat Engineers ThinkThe Reality
RWO (ReadWriteOnce)“My pod can move anywhere.”Zonal Lock. The disk must detach from Node A before attaching to Node B.
RWX (ReadWriteMany)“Like a shared drive.”Network Filesystem (NFS/EFS). Slower, but can be mounted by multiple nodes.
ROX (ReadOnlyMany)“Read Replicas.”Rarely used in production databases.

If your node crashes hard, the cloud control plane might still think the disk is “attached” to the dead node.

Result: Multi-Attach error. The new Pod cannot start because the old node hasn’t released the lock.

The “Slow Restart” (Why DBs Take Forever)

You rescheduled the Pod. The node is in the right zone. But the Pod is stuck in ContainerCreating for 5 minutes.

Why?

The “Detach/Attach” dance is slow.

  1. Kubernetes tells the Cloud API: “Detach volume vol-123 from Node A.”
  2. Cloud API: “Working on it…” (Wait 1–3 minutes).
  3. Cloud API: “Detached.”
  4. Kubernetes tells Cloud API: “Attach vol-123 to Node B.”
  5. Linux Kernel: “I see a new device. Mounting filesystem…”
  6. Database: “Replaying journal logs…”

This is not a bug. This is physics.

The Rack2Cloud Storage Triage Protocol

If your stateful pod is pending, stop guessing. Run this sequence to isolate the failure domain.

Phase 1: The Claim State

Goal: Is Kubernetes even trying to provision storage?

Bash

kubectl get pvc <pvc-name>
  • Result: Status is Bound. Move to Phase 2.
  • Result: Status is Pending.
  • The Fix: Describe the PVC (kubectl describe pvc <name>). Look for waiting for a volume to be created, either by external provisioner... or cloud quota errors. If using WaitForFirstConsumer, you must deploy a Pod before it will bind.

Phase 2: The Attachment Lock

Goal: Is an old node holding the disk hostage?

Bash

kubectl describe pod <pod-name> | grep -A 5 Events
  • Result: Normal scheduling events, eventually pulling image.
  • Result: Multi-Attach error or Volume is already used by node.
  • The Fix: The cloud control plane thinks the disk is still attached to a dead node. Wait 6–10 minutes for the cloud provider timeout, or forcibly delete the previous dead pod to break the lock.

Phase 3: The Zonal Lock (The Smoking Gun)

Goal: Are the physics impossible?

Bash

# 1. Get the Pod's assigned node (if it made it past scheduling)
kubectl get pod <pod-name> -o wide

# 2. Get the Volume's physical location
kubectl get pv <pv-name> --show-labels | grep zone
  • Result: The Node and the PV are in the same zone (e.g., both us-east-1a).
  • Result: Mismatch. The PV is in us-east-1a, but the Pod is trying to run on a Node in us-east-1b.
  • The Fix: Deadlock. The Pod cannot run on that Node. You must drain the wrong node to force rescheduling, or ensure capacity exists in the correct zone.

Phase 4: Infrastructure Availability

Goal: Does the required zone even have servers?

Bash

# Replace with the zone where your PV is stuck
kubectl get nodes -l topology.kubernetes.io/zone=us-east-1a
  • Result: Nodes are listed and status is Ready.
  • Result: No resources found, or all nodes are NotReady / SchedulingDisabled.
  • The Fix: Your cluster has lost capacity in that specific zone. Check Cluster Autoscaler logs or cloud provider health dashboards for that AZ.

Production Hardening Checklist

Don’t just fix it today. Prevent it tomorrow.

Use WaitForFirstConsumer: Set this in your StorageClass immediately.
Run ≥2 Nodes Per Zone: Never run a single-node zone for stateful workloads.
Avoid Strict Anti-Affinity: Don’t use requiredDuringScheduling for StatefulSets; use preferredDuringScheduling so you don’t corner yourself.
Monitor Attach/Detach Latency: Alert if ContainerCreating takes >5 minutes.
Prefer Replication over Relocation: For HA, run a Primary/Replica DB setup (like Postgres Patroni) so you fail over to a running pod instead of waiting for a moving pod.

Summary: Respect the Physics

Stateless apps? They’re like water—they flow wherever there’s room. Stateful apps? Anchors. Once they drop, that’s where they stay.

  1. Use WaitForFirstConsumer to prevent “Day 1” fragmentation.
  2. Ensure your Auto Scaling Groups cover all zones, so there’s always a node available wherever your data lives.
  3. And please, never treat a database Pod like a web server.

This concludes the Rack2Cloud Diagnostic Series. Now that your Identity, Compute, Network, and Storage are solid, you’re ready to scale.

Next, we’ll zoom out and talk strategy: How do you survive Day 2 operations without drowning in YAML?

The Rack2Cloud Diagnostic Series (Complete)

Additional Resources

R.M. - Senior Technical Solutions Architect
About The Architect

R.M.

Senior Solutions Architect with 25+ years of experience in HCI, cloud strategy, and data resilience. As the lead behind Rack2Cloud, I focus on lab-verified guidance for complex enterprise transitions. View Credentials →

Editorial Integrity & Security Protocol

This technical deep-dive adheres to the Rack2Cloud Deterministic Integrity Standard. All benchmarks and security audits are derived from zero-trust validation protocols within our isolated lab environments. No vendor influence.

Last Validated: Feb 2026   |   Status: Production Verified
Affiliate Disclosure

This architectural deep-dive contains affiliate links to hardware and software tools validated in our lab. If you make a purchase through these links, we may earn a commission at no additional cost to you. This support allows us to maintain our independent testing environment and continue producing ad-free strategic research. See our Full Policy.

Similar Posts