| | |

The 2-Node Trap: Why Your Proxmox “HA” Will Fail When You Need It Most (and How to Fix It)

OPTION 3: THE ENGINEER (Editorial Integrity Verified)
Editorial Integrity Verified

This technical deep-dive has passed the Rack2Cloud 3-Stage Vetting Process: Lab-Validated, Peer-Challenged, and Document-Anchored. No vendor marketing influence. See our Editorial Guidelines.

LAST VALIDATED: Jan 2026 TARGET STACK: Proxmox VE 8.x (Corosync 3) STATUS: Production Verified
// ARCHITECTURAL MEMO: PART OF THE HCI REFACTORING LAB

I built my first Proxmox cluster on a Friday night. Two beefy nodes. Shared storage. HA enabled. I shut the laptop feeling smug—I had just replaced a six-figure VMware stack with two commodity servers and some Linux magic.

Saturday morning, a power blip hit the rack. Both nodes came back online. No VMs came back.

Three hours later, after console spelunking, log archaeology, and a crash course in Corosync internals, I learned the most expensive free lesson of my career: Two nodes ≠ High Availability. They equal a mathematically guaranteed outage when one fails.

This isn’t just a “homelab” problem. I’ve seen this exact failure pattern in 2-node retail edge clusters, “temporary” production clusters that somehow became permanent, and even 6-node enterprise clusters deployed in isolated pairs.

Let’s fix it—properly.

Key Takeaways

  • The Math Trap: Corosync requires $>50\%$ votes. 2 nodes is a tie. No majority = cluster freeze.
  • The Failure Mode: When Node 1 dies, Node 2 cannot confirm it has a majority, so it fences itself to prevent data corruption.
  • The 15-Minute Fix: A QDevice (external witness) restores the math. $2 + 1 = 3$.
  • Enterprise Reality: QDevice is for edge cases. True enterprise HA starts at 3 nodes with hardware fencing (Stonith).

Isometric schematic of a Proxmox cluster showing a split-brain scenario.

The Lie: “Two Proxmox Nodes = vSphere HA”

I hear this constantly from clients migrating off Broadcom: “We only need two servers. VMware let us do it, so Proxmox should too.”

Here is the cold reality: VMware didn’t eliminate quorum physics—it just hid them behind expensive proprietary locking, vCenter arbitration, and license-enforced constraints. Proxmox doesn’t hide physics. It respects them.

Corosync (the engine driving Proxmox HA) requires a strictly majority of votes to run workloads safely. With two nodes, you have 2 votes. If one fails, you have 1 vote.

1 Vote is less than >50%.

There is no tie-breaker. So the system does the only safe thing: it fences everything to prevent Split-Brain. That’s not a bug. That’s integrity.

The Quorum Physics Trap (Math That Bites)

Here is the exact math that kills your cluster on a Saturday morning:

ScenarioNode ANode BTotal VotesRequired for Quorum (>50%)Result
Healthy1122✅ Running
Node B Dies1012FENCE ALL

What You Actually See

When this happens, your logs will scream at you.

Bash

pvecm status
# Quorum information: 1/2
# Status: Activity Blocked

Symptoms:

  1. All VMs stuck in fenced or stopped.
  2. The surviving node is healthy, accessible via SSH, but refuses to start workloads.
  3. Your “HA cluster” is now mathematically forbidden from working. Welcome to Friday night hell.

The 2-Node Death Spiral (Step-by-Step)

I once watched payroll batch processing halt at 2:00 AM because a single PSU died in a 2-node cluster.

  1. State: Both nodes healthy. (2/2 votes).
  2. Event: Node B power supply fails. Heartbeat lost.
  3. Reaction: Node A asks, “Did Node B crash, or did the network partition?”
  4. Calculation: Node A checks votes. It has 1. It needs 2.
  5. Decision: “No majority. I cannot guarantee I am the only writer. To save the filesystem, I must stop.”
  6. Outcome: Node A fences itself. The entire data center is dark.

The CIO’s takeaway from that night? “Our HA failed when we needed it most.” That sentence should terrify you.


The Physics Fix: Add a QDevice (Witness Vote)

The fix isn’t complicated. It’s mathematical. You need an External Witness—a third vote that lives outside the blast radius of your two nodes.

The Correct Math:

2 Nodes (1 vote each) + 1 QDevice (1 vote) = 3 Total Votes

If Node B dies: Node A + QDevice = 2 Votes.

2 Votes ≥ 2 (Majority)

Result: Quorum achieved. VMs auto-start on Node A.

Hardware Options (Pick Your Poison)

OptionCapEx CostReliabilityUse Case
Raspberry Pi 4~$35HighThe Homelab King. Ethernet connected, low power.
AWS t3.nano~$5/moVery HighExcellent for remote/edge clusters.
Spare Linux VM$0MediumGood if it lives on a separate hypervisor.
Cluster NodeN/AFAILNever put the witness on the cluster itself.
Network flow chart contrasting a 2-node failure vs. a QDevice recovery.

Deploying a QDevice (Lab-Validated)

Pre-requisite: Ensure you have SSH access to both Proxmox nodes and your target QDevice (Debian/Ubuntu).

1. On the QDevice Host (Pi/VM):

Bash

# Install the external voter daemon
apt update && apt install corosync-qnetd corosync-qdevice -y

2. On the Main Proxmox Cluster (Node 1):

Bash

# Install the QDevice plugin
apt install corosync-qdevice -y

# Setup the QDevice (IP of your Pi/VM)
pvecm qdevice setup 192.168.1.100

3. Verify The Math:

Bash

pvecm status
# Look for:
# Quorum information: 2/3   <-- SUCCESS
# Votequorum information:
# Expected votes: 3

Time: 15 minutes. Cost: <$50. Outcome: Your cluster stops triggering self-fencing during minor failures.


When 2+1 Still Fails (Production Reality Check)

If you are running a Homelab or a small SMB stack, stop reading here. You are safe.

But if you are an Enterprise Architect, we need to talk about Stonith and Scale.

A QDevice fixes Quorum physics, but it does not fix Capacity or Storage physics. I’ve seen 2+1 clusters survive the network partition only to die immediately because the single surviving node couldn’t handle the RAM/CPU pressure of 200 re-starting VMs.

Enterprise Constraints:

  • Ceph Storage: Ceph needs its own quorum. A 2-node Proxmox + QDevice setup cannot run Ceph safely. You strictly need 3 full nodes for Ceph monitors.
  • Rack Awareness: If you have 5 nodes, but 3 are in Rack A and 2 in Rack B… losing Rack A kills your quorum (2/5).
  • Fencing (Stonith): In high-stakes environments, software quorum isn’t enough. You need hardware fencing (IPMI/PDU) to physically cut power to a rogue node before recovering.

Architecture Archetypes (Pick Your Scale)

Archetype A: The “Homelab Escape Hatch”

  • Structure: 2 Beefy Nodes + Raspberry Pi QDevice.
  • Pros: Cheap ($100 HA fix), survives single node failure.
  • Cons: No live migration (unless shared storage), not payroll-safe.
  • Verdict: Perfect for learning or non-critical edge.

Archetype B: SMB Production Ready

  • Structure: 3 Identical Nodes (e.g., 16c/128GB each).
  • Pros: Live migration, no QDevice dependency, true N+1 redundancy.
  • Verdict: The minimum viable product for business.

Archetype C: Enterprise Standard

  • Structure: 5+ Nodes + Ceph + Redundant Corosync Links.
  • Pros: Survives drive failures, node failures, and even rack failures (if architected correctly).
  • Verdict: Required for 100+ VMs or mission-critical SLAs.

Validation Tests (Prove Your HA Works)

If you haven’t tested failure, you don’t have HA—you have hope. And hope is not a clustering strategy. Nor is blind optimism.

  1. The Pull-the-Plug Test: Physically yank the power from Node 2. Does Node 1 take over?
    • Without QDevice: NO.
    • With QDevice: YES.
  2. The Network Slice: Unplug the LAN cables from Node 2.
    • pvecm status on Node 1 should show 2/3.
  3. The QDevice Death: Stop the corosync-qnetd service on your Pi.
    • Cluster should remain Green (2/3 votes active).

Verdict: Fix It Before Disaster Teaches You

The 2-node trap kills more Proxmox clusters than hardware failure ever will.

  • pvecm status = 1/2 → You are in the trap.
  • pvecm status = 2/3 → Escape hatch deployed.
  • 3+ Nodes → Production reality.

Your Weekend Project:

  1. Run pvecm status right now. Confirm your doom.
  2. If you see 1/2, order a Raspberry Pi or spin up a t3.nano.
  3. Run the validation tests. Prove the physics work.
  4. Before you scale further: Use our HCI Migration Advisor to audit your cluster health and zombie snapshots.

Two servers ≠ HA. Two servers + QDevice = survival. Three servers = reality.

I wish someone had told me this in Week 1. Now you know.

Additional Resources

// NEXT HOP IN QUEUE
You’ve secured your cluster quorum with a Q-Device, but there is a hidden killer: Network Latency. If your heartbeats exceed 1ms, your cluster will drift. Learn how to tune the Corosync stack to stay stable under pressure (COMING SOON). Continue your journey to: Proxmox ZFS Physics: Why it isn’t “Free vSphere or return to Mission Control_
R.M. - Senior Technical Solutions Architect
About The Architect

R.M.

Senior Solutions Architect with 25+ years of experience in HCI, cloud strategy, and data resilience. As the lead behind Rack2Cloud, I focus on lab-verified guidance for complex enterprise transitions. View Credentials →

Affiliate Disclosure

This architectural deep-dive contains affiliate links to hardware and software tools validated in our lab. If you make a purchase through these links, we may earn a commission at no additional cost to you. This support allows us to maintain our independent testing environment and continue producing ad-free strategic research. See our Full Policy.

Similar Posts