DNS Failover Testing: Why Your Failover Declared Done But Traffic Never Moved

Field Notes — Engineering Notes from the Complexity Gap | Rack2Cloud

The failover was declared at 02:14. The runbook was followed. DNS records updated. Health checks passing on secondary. The on-call engineer closed the incident bridge call at 02:31 with a single line in the ticket: failover complete. At 02:32, a monitoring alert fired. Traffic was still hitting the dead primary.

The DNS record had changed in seconds. The traffic moved 18 minutes later. Only one of those numbers mattered.

This is dns failover testing failure in its most common form — not a misconfiguration, not a vendor bug, not a missed step. Every layer in the stack behaved exactly as designed. The system still failed operationally. It belongs to the same class of failure documented in data protection architecture where the protection plane reports success and the recovery plane produces nothing useful.

dns failover testing — the Declaration Gap: four caching layers between record change and traffic movement — The Declaration Gap — four independently-timed caching layers between record change and traffic movement.

What the Runbook Said Was Done

The runbook covered the right things. TTL had been pre-reduced to 60 seconds during the maintenance window two weeks prior. The health check interval on the secondary was 30 seconds. The DNS record update propagated to the authoritative nameservers within 90 seconds of execution. By every documented metric in the disaster recovery and failover playbook, the failover was complete.

The team was not wrong to close the bridge. They were wrong about what “complete” meant.

DNS failover is treated as a discrete event — you change the record, propagation happens, traffic moves. The mental model is a switch: off, then on. The operational reality is a drainage problem. Traffic does not move when the record changes. Traffic moves when every active path that was routing to the old record has expired its cached state and re-resolved. Those are different events, separated by an amount of time that no runbook entry captures.

The Declaration Gap is the period between when the failover is declared complete and when traffic has actually moved. It maps directly to RTA — Recovery Time Actual — the measured gap between when recovery is declared and when the system is genuinely operational again. In this case, the Declaration Gap was 18 minutes. In environments with more caching layers, it can be significantly longer.

The Four Layers That Each Did Their Job

This is the part worth understanding precisely. The failure was not caused by any single layer behaving incorrectly. It was caused by four layers each behaving correctly — and nobody having modelled what that looked like in combination.

01 — DNS TTL

TTL had been pre-reduced to 60 seconds — a deliberate preparation step. Resolvers that re-queried after expiry got the new record immediately. TTL did its job. The problem is that TTL is a floor, not a ceiling. Resolvers are not required to honour TTL exactly, and under load many cache longer than the declared value. The 60-second TTL reduced the blast radius. It did not eliminate it.

02 — HEALTH CHECK LAG

The health check confirmed the secondary was healthy before the failover was declared. That check passed. What it did not model was the transition period — the window between the primary being declared dead and all traffic paths having drained away from it. Health checks measure endpoint state. They do not measure traffic distribution state. Passing health checks on secondary does not mean traffic has moved to secondary.

03 — CDN ORIGIN CACHE

The CDN layer sitting in front of the application had its own origin resolution cache with a TTL independent of the DNS TTL. When the DNS record changed, the CDN did not immediately re-resolve the origin. It served from its cached origin record for the remainder of its own TTL window. Traffic transiting the CDN continued to reach the old origin until the CDN’s internal cache expired — a separately timed event that nobody had factored into the RTO calculation.

04 — CLIENT-SIDE RESOLVER PERSISTENCE

Enterprise clients behind corporate recursive resolvers, browsers with internal DNS caches, and mobile devices with persistent resolver state all maintained their own cached records independently of what the authoritative nameserver was serving. When the record changed, these clients did not immediately re-resolve. They continued routing to the cached record until their own TTL — or their own cache logic — expired it. Every one of these clients honoured its own caching logic correctly. The system still failed operationally.

This is the same structural failure class as the retry storm — every component executes its designed behaviour correctly, and the combination of correct individual behaviours produces a system-level failure nobody modelled. The failure is not in any layer. It is in the assumption that layers coordinate.

dns failover testing — four caching layers that each behaved correctly while the system failed operationally — Every layer honoured its own cache logic correctly. The system still failed operationally.

What DNS Failover Testing Actually Needs to Measure

Most dns failover testing validates the wrong thing. A test that confirms the DNS record updated and the health check passed has validated the protection plane. It has not validated the recovery plane — whether traffic actually moved, when it moved, and what the distribution looked like during the transition window. This is precisely the gap documented in restore path design applied to a different layer: success on the write side proves nothing about the read side.

This failure mode has a close relative in Azure Private Endpoint DNS resolution — environments where DNS behaves correctly at every individual hop while producing the wrong outcome at the traffic layer. The mechanism differs; the diagnostic principle is identical.

DIAGNOSTIC QUESTION

“How do you know traffic moved — and how long after declaration did you check?”

A test that surfaces the Declaration Gap needs to measure traffic distribution, not DNS state. It needs to run active traffic against the production path — including CDN-transited requests and enterprise resolver-cached clients. It needs to timestamp when the DNS record change executes and when traffic distribution on the secondary crosses a defined threshold. The gap between those two timestamps is the real RTO contribution from DNS failover. That number belongs in the RTO model, not a TTL assumption.

Pre-reducing TTL before a planned failover is necessary but not sufficient. The CDN cache TTL needs its own pre-reduction step — most CDN providers allow origin cache TTL configuration independently of DNS TTL. Ignoring this makes the CDN the binding constraint on traffic movement regardless of how aggressively the DNS TTL was tuned.

Monitoring during the failover window needs to watch traffic distribution at the application layer, not DNS propagation at the nameserver layer. The question the on-call engineer needs to answer in real time is not “has the record changed” — it is “what percentage of traffic is hitting the secondary right now.” The incident recovery process depends on being able to answer that question with data, not inference.

The Transferable Principle

DNS failover is not complete when the record changes. It is complete when traffic distribution changes.

That distinction rewrites the RTO model for any architecture that depends on DNS-based failover. The RTO contribution from a DNS failover event is not the TTL value. It is the time required for all active traffic paths to drain their cached state and re-resolve to the new record. These drainage events happen independently, on different timers, with no coordination signal between them.

The TTL is a floor on propagation speed. The actual Declaration Gap is determined by whichever caching layer takes longest to drain. In most environments that layer is either the CDN origin cache or the enterprise recursive resolver — neither of which is controlled by the TTL set on the DNS record.

Testing needs to validate this explicitly — not as a one-time exercise, but on the same cadence as the RTO it is supposed to guarantee. An architecture with a 15-minute RTO commitment that has never measured its Declaration Gap does not have a 15-minute RTO. It has a 15-minute aspiration and an unknown operational reality.

The record changed in seconds. The traffic moved 18 minutes later. Only one of those numbers mattered.

Architect’s Verdict

DNS failover testing that validates record propagation is not failover testing. It is nameserver testing. The operational requirement is different — traffic must move, not just resolve differently — and the gap between those two events is determined by caching layers that operate on independent timers with no coordination between them.

The Declaration Gap exists in every DNS-based failover architecture. TTL behaves correctly. Health checks behave correctly. CDN caches behave correctly. Client-side resolvers behave correctly. The failure is architectural — nobody modelled what the combination of correct individual behaviours produces at the traffic distribution layer during a failover transition.

RTO commitments anchored to DNS TTL values rather than measured Declaration Gap data are not commitments. They are assumptions that have never been tested against the actual drainage physics of the environment they are supposed to protect.

Additional Resources

>_ Internal Resource

RPO, RTO and RTA: Why Recovery Metrics Drive Infrastructure

RTA as the measured gap between declared and actual recovery time; the Declaration Gap is an RTA problem

>_ Internal Resource

Restore Path Design: The Most Neglected Part of Backup Architecture

same protection-plane vs recovery-plane distinction at the backup layer

>_ Internal Resource

Incident Recovery Process: Why the Incident Isn’t Over After Restore

what “complete” actually means under operational pressure

>_ Internal Resource

Disaster Recovery & Failover Architecture

pillar sub-page: failover design as a distinct architectural concern

>_ Internal Resource

Data Protection Architecture

pillar page: recovery design as a distinct architectural layer

>_ External Reference

IETF RFC 8767 — Serving Stale Data to Improve DNS Resiliency

documents resolver stale-cache behaviour and the formal basis for TTL-as-floor

Disaster Recovery dns failover dns ttl failover testing infrastructure Networking RTO Site Reliability

Editorial Integrity & Security Protocol

This technical deep-dive adheres to the Rack2Cloud Deterministic Integrity Standard. All benchmarks and security audits are derived from zero-trust validation protocols within our isolated lab environments. No vendor influence.

Last Validated: April 2026 | Status: Production Verified

About The Architect

R.M.

Senior Solutions Architect with 25+ years of experience in HCI, cloud strategy, and data resilience. As the lead behind Rack2Cloud, I focus on lab-verified guidance for complex enterprise transitions. View Credentials →

The Dispatch — Architecture Playbooks

Get the Playbooks Vendors Won’t Publish

Field-tested blueprints for migration, HCI, sovereign infrastructure, and AI architecture. Real failure-mode analysis. No marketing filler. Delivered weekly.

Select your infrastructure paths. Receive field-tested blueprints direct to your inbox.

> Virtualization & Migration Physics
> Cloud Strategy & Egress Math
> Data Protection & RTO Reality
> AI Infrastructure & GPU Fabric

[+] Select My Playbooks

Zero spam. Includes The Dispatch weekly drop.

Need Architectural Guidance?

Unbiased infrastructure audit for your migration, cloud strategy, or HCI transition.

>_ Request Triage Session

Why Your DNS Failover Didn’t Actually Fail Over

What the Runbook Said Was Done

The Four Layers That Each Did Their Job

What DNS Failover Testing Actually Needs to Measure

The Transferable Principle

Architect’s Verdict

Additional Resources

Editorial Integrity & Security Protocol

R.M.

Get the Playbooks Vendors Won’t Publish

Your Ransomware Plan Is Fiction: 5 Recovery Metrics Nutanix, Cohesity, Rubrik & Pure Can’t Hide

Your Monitoring Didn’t Miss the Incident. It Was Never Designed to See It.

Your Identity System Is Your Biggest Single Point of Failure

Your Cloud Provider Is Not Your HA Strategy

What the Runbook Said Was Done

The Four Layers That Each Did Their Job

What DNS Failover Testing Actually Needs to Measure

The Transferable Principle

Architect’s Verdict

Additional Resources

Editorial Integrity & Security Protocol

R.M.

Get the Playbooks Vendors Won’t Publish

>_Related Posts