Skip to main content

HA/DR Testing & Troubleshooting β€” Failover Drills, RPO/RTO Verification, Common Failure Modes

The April 24, 2026 DP-300 outline added two explicit Domain 5 skills:

"Test a high availability and disaster recovery strategy"
"Troubleshoot a high availability and disaster recovery strategy"

A documented HA/DR design is worthless if it has never been tested. This page covers the test plans, T-SQL & CLI verification, RPO/RTO measurement, and the common failure modes you must be ready to triage.

Why this matters​

TruthImplication
Untested DR = no DRCustomers regularly find that promoted secondaries fail the first real test (missing logins, jobs, linked servers, NSG rules).
RPO/RTO are targets, not guaranteesReal numbers depend on workload, network, replica lag, and failover trigger time.
Failover β‰  failbackA successful failover is half the test β€” failback to the primary region is the other half.
App tier failover is part of DRConnection strings, listener IPs, traffic-manager records, and Azure Front Door origins must all flip.
🎯 Exam Focus

The exam likes to test the distinction between planned failover (no data loss, manual) and forced/unplanned failover (potential data loss, used when primary is down). Know the CLI/T-SQL verbs for each: set-primary (planned) vs force-failover-allow-data-loss (unplanned). :::

Part 1 β€” Build a test plan​

Drill catalog (run quarterly)​

DrillTriggerExpected RPOExpected RTOValidates
Planned regional failover of Auto-Failover Groupaz sql failover-group set-primary030–60 sRead/write listener cuts over; app tier reconnects
Forced failover simulating primary outage--allow-data-loss switchup to 5 s for FOG; 0 for sync AG30–60 s for FOG; ~10 s for sync AGAcceptable data loss, monitoring alarms fire
AG failover (SQL VM)ALTER AVAILABILITY GROUP ag1 FAILOVER0 (sync)secondsListener, ILB probe, AG state, witness/quorum
MI restore from PITRaz sql midb restoredepends on backup timeminutes–hoursBackup chain integrity, restore permissions
Cross-region geo-restoreaz sql db restore --source-database-id <geo-backup-id>up to 1 h (RA-GRS replication lag)minutes–hoursGeo-redundant backup recoverability
App-tier failover onlyDNS / Front Door / Traffic Manager changen/aseconds–minutesConnection-string strategy works
Full BCDR fire-drillCombine FOG failover + app-tier failover + run smoke testsper designper designEnd-to-end runbook validates

Test plan template (fill out before every drill)​

Drill name:
Date / time window:
In-scope resources (servers, DBs, FOGs, AGs):
Trigger command:
Expected RTO / RPO:
Pre-drill state capture (queries / screenshots):
Smoke tests after failover:
Rollback / failback steps:
Comms plan (who to notify, status page updates):
Sign-off criteria:

Part 2 β€” Failover commands​

Auto-Failover Group (Azure SQL DB / MI)​

# PLANNED failover β€” geo-secondary becomes primary, no data loss
az sql failover-group set-primary \
-g rg-secondary -s ehab-sqlsrv-secondary \
-n fog-prod

# FORCED failover β€” accept potential data loss (use when primary is down)
az sql failover-group set-primary \
-g rg-secondary -s ehab-sqlsrv-secondary \
-n fog-prod --allow-data-loss

PowerShell equivalent:

# Planned
Switch-AzSqlDatabaseFailoverGroup -ResourceGroupName rg-secondary `
-ServerName ehab-sqlsrv-secondary -FailoverGroupName fog-prod

# Forced
Switch-AzSqlDatabaseFailoverGroup -ResourceGroupName rg-secondary `
-ServerName ehab-sqlsrv-secondary -FailoverGroupName fog-prod -AllowDataLoss

Always On AG on SQL Server VMs​

-- Planned (synchronous replica only) β€” connect to SECONDARY
ALTER AVAILABILITY GROUP [ag1] FAILOVER;

-- Forced (potential data loss) β€” connect to SECONDARY
ALTER AVAILABILITY GROUP [ag1] FORCE_FAILOVER_ALLOW_DATA_LOSS;

MI failover (refresh underlying VM)​

# User-initiated MI failover β€” useful for testing app reconnection logic
az sql mi failover -g $RG -n ehab-mi-prod --replica-type Primary

Part 3 β€” Verify state with T-SQL & DMVs​

Failover Group / Geo-Replication state​

-- Run on the PRIMARY logical server
SELECT
link_guid,
partner_server,
partner_database,
replication_state_desc, -- CATCH_UP / SEEDING / SUSPENDED
role_desc, -- PRIMARY / SECONDARY
secondary_allow_connections_desc,
last_replication,
replication_lag_sec
FROM sys.dm_geo_replication_link_status;

replication_state_desc = 'CATCH_UP' and a low replication_lag_sec are what you want before triggering a planned failover.

Always On AG state​

SELECT
ar.replica_server_name,
drs.database_id,
DB_NAME(drs.database_id) AS db_name,
drs.synchronization_state_desc, -- SYNCHRONIZED / SYNCHRONIZING / NOT SYNCHRONIZING
drs.synchronization_health_desc, -- HEALTHY / PARTIALLY_HEALTHY / NOT_HEALTHY
drs.is_primary_replica,
drs.last_hardened_lsn,
drs.log_send_queue_size, -- KB pending to ship
drs.redo_queue_size -- KB pending to redo on secondary
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar
ON drs.replica_id = ar.replica_id
ORDER BY ar.replica_server_name;
-- Listener / cluster health
SELECT * FROM sys.dm_hadr_availability_group_states;
SELECT * FROM sys.dm_hadr_cluster_members;
SELECT * FROM sys.availability_group_listeners;

Backup history (verify chain before geo-restore)​

SELECT TOP 20
database_name, backup_start_date, backup_finish_date, type,
DATEDIFF(SECOND, backup_start_date, backup_finish_date) AS duration_sec,
backup_size / 1024.0 / 1024 / 1024 AS size_gb
FROM msdb.dbo.backupset
WHERE database_name = 'salesprod'
ORDER BY backup_start_date DESC;

Part 4 β€” Measure RPO and RTO​

RPO (data loss window)​

For Auto-Failover Groups:

-- On the primary BEFORE the drill, write a sentinel row with current UTC timestamp
INSERT INTO dbo.DR_HeartBeat (utc, note) VALUES (SYSUTCDATETIME(), 'pre-failover');
GO

-- After forced failover, on the NEW primary, find the highest sentinel that survived
SELECT TOP 1 utc, note FROM dbo.DR_HeartBeat ORDER BY utc DESC;

-- RPO = (UTC of last write on old primary) - (UTC of last sentinel that arrived)

For AG sync replicas, RPO is by definition zero β€” the secondary acknowledges before the commit returns.

RTO (downtime window)​

RTO = (timestamp app reconnects successfully) - (timestamp failover triggered)

Capture both timestamps in your test runbook. Azure Monitor metric replication_link_status (failover groups) and AG DMVs flip immediately, but the app's RTO can be longer due to connection-pool retries and DNS TTLs.

Azure Monitor metrics to watch​

ResourceMetricWhat it tells you
Failover Groupreplication_lag_secReal-time RPO ceiling
Failover Groupreplication_link_statusUP / DEGRADED / DOWN
SQL DBdtu_consumption_percent / cpu_percent post-failoverWorkload behavior on new primary
MIstorage_percent, cpu_percentCapacity validation after failover
Application Insightsrequest count, failed requestsApp-tier RTO from a user perspective

Part 5 β€” Troubleshooting playbook​

Failover group: "primary still primary after set-primary"​

CauseFix
Used the primary server in the command instead of the secondaryRe-run with -s <secondary-server> (this is the most common mistake)
The geo-secondary is SUSPENDEDResume with ALTER DATABASE ... SET PARTNER RESUME (sync) or recreate the FOG
Permissions: caller lacks SQL DB Contributor on the secondary RGGrant role; retry

Forced failover: "data loss higher than expected"​

CauseFix
replication_lag_sec was high before the drill (workload hot, link saturated)Increase compute; check network throughput; consider Premium β†’ Business Critical for sync
Bulk operation in flight (BULK INSERT, index rebuild)Large operations re-ship after failover; schedule maintenance windows

AG: replica NOT SYNCHRONIZING / NOT HEALTHY​

SymptomDiagnosisFix
synchronization_state_desc = 'NOT SYNCHRONIZING'Network outage or endpoint downCheck the database mirroring endpoint TCP 5022 NSG/firewall; restart endpoint
redo_queue_size keeps growingSecondary CPU/IO saturatedRight-size secondary VM; verify storage SLO matches primary
Listener returns wrong IP after failoverILB probe misconfigured (probe port, RegisterAllProvidersIP)Set probe port; configure RegisterAllProvidersIP=0 + HostRecordTTL=300
App connects but reads stale dataApp reading from a read-only routing list with old dataVerify READ_ONLY_ROUTING_URL and ApplicationIntent=ReadOnly clients

MI failover group quota errors​

"Cannot enable Failover Group: subscription quota for vCores in the secondary region is exhausted"

Fix: file a quota increase via az quota for Microsoft.Sql/managedInstance in the target region. MI requires the same SKU and capacity in both regions.

Split-brain (rare but catastrophic)​

A split-brain happens when an outage isolates the primary's region from the rest of the world, the secondary is force-failed-over, then the original primary recovers and tries to keep accepting writes.

  • Failover Groups: protected β€” the FOG fences the old primary. After connectivity returns, the old primary is demoted automatically. No action needed beyond confirming.
  • AG on SQL VMs: requires a quorum model (Cloud Witness) and REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT >= 1 to prevent both replicas accepting writes.

Listener not reachable after failover​

1. Confirm the listener IP via ILB rules (az network lb show)
2. Confirm the AG cluster registered the IP (cluster.exe /status)
3. Confirm the ILB health probe port is open in the OS firewall
4. Confirm DNS TTL for the listener record is short (<= 60 s)
5. From client: Test-NetConnection <listener> -Port 1433

Application doesn't reconnect​

1. Connection string must use the FOG read/write listener: <fog-name>.database.windows.net  (NOT the server name)
2. Connection retry policy: 5 attempts, 5s delay, exponential backoff
3. Connection-pool reset on transient failures (SqlConnection.ClearAllPools)
4. .NET client config: `Connection Lifetime=60` to recycle stale pooled connections post-failover
5. For Always On AGs, set `MultiSubnetFailover=True` in the connection string

Part 6 β€” Hybrid scenario (on-prem AG β†’ Azure VM AG)​

A common DR pattern: an on-prem 2-node sync AG replicates asynchronously to a third replica running on an Azure VM in another region. Drill steps:

Hybrid AG failover drill
πŸ“‘
1. Verify async replica health
sys.dm_hadr_database_replica_states
redo_queue_size acceptable
log_send_queue_size acceptable
🚦
2. Force-failover to Azure replica
ALTER AG ... FORCE_FAILOVER_ALLOW_DATA_LOSS
Capture RPO via heartbeat table
πŸ”
3. Resume data movement
ALTER DATABASE ... SET HADR RESUME on old primary
πŸ§ͺ
4. App smoke tests
Connection string points to Azure listener
Critical transactions complete
Latency baselines captured
↩️
5. Failback
Re-establish sync mode on the original primaries
Planned failover back
Document the elapsed RTO
🎯 Exam Focus

For forced failover with data loss, always capture the LSN of the last applied log record on the new primary (sys.dm_hadr_database_replica_states.last_hardened_lsn) and the LSN of the last shipped transaction on the old primary. Subtracting them gives you the exact number of transactions lost. :::

🏒 Real-World DBA Note

Oracle DBA parallel:

  • Failover Group planned failover β‰ˆ Oracle Data Guard switchover (zero data loss).
  • Failover Group forced failover β‰ˆ Oracle Data Guard failover (potential data loss).
  • AG synchronization_health_desc β‰ˆ Data Guard transport lag + apply lag health checks via V$DATAGUARD_STATS.
  • Cloud Witness for SQL AG β‰ˆ Oracle Observer for Data Guard Fast-Start Failover. :::

Anti-Patterns​

  • "Test failover by killing the primary in prod." Use planned failover (az sql failover-group set-primary) β€” same code path, no data loss. Force-failure tests belong in pre-prod.
  • "We don't need scheduled DR drills β€” the secondary is healthy." Healthy != tested. App connection strings, DNS, secrets, and on-call runbooks all rot. Test failover quarterly at minimum.
  • "After failover, leave the old primary as the new secondary." That's the right move β€” BUT confirm data sync caught up, reconfigure backups (LTR runs on the new primary now), and update monitoring/alert scopes.
  • "Forced failover is fine if we have backups." It causes data loss equal to unsynchronized log. Always check last_hardened_lsn delta first; fall back to backups only if the delta is unacceptable.
  • "Listener works β†’ app fine after failover." App may cache the old IP; connection string with MultiSubnetFailover=True is required for AG. For FOG, app must use the read-write listener, not the server name.
  • "Run DR drill once a year on a Friday afternoon." Drill during business hours with the team available. Fridays = nobody to debug Monday.
⚠️ Watch Out

Failover Group read-write listener stays the same after failover; read-only listener gets reassigned. Apps reading reports may break if hardcoded to a specific server name. Always use the listener.


Migration Between HA/DR Postures​

From β†’ ToPathCost
Single primary β†’ Active geo-replicationAdd geo-secondary; manual failoverAdds secondary compute cost; manual RTO
Active geo-replication β†’ Failover GroupWrap in FOG; gain auto-failover policySame compute; gain orchestration
FOG manual failover β†’ FOG auto-failover (1h grace)Enable auto-failover policyAuto-recovery; risk of split-brain on flapping
AG with file-share witness β†’ Cloud WitnessReconfigure cluster quorumRemoves file server dependency; tiny storage cost
Single AG β†’ Distributed AG (cross-region)Add forwarder + secondary AGCross-region resilience; complex topology
Quarterly manual DR drill β†’ Monthly automated drillRunbook + scripted failover + verificationAutomation effort; gain confidence
Email-only DR notification β†’ Action Group + ITSMWire AG β†’ ServiceNowOne-time wiring; auditable evidence
Reactive troubleshooting β†’ Synthetic transaction monitoringApp Insights availability test through listenerDetects DNS/listener issues before users notice

Most expensive moves: AG β†’ Distributed AG (cross-region complexity) and shifting to auto-failover policy (operational risk if not tested).


Real Scenarios​

  1. Quarterly DR drill on FOG β†’ Planned failover during business hours, validate app connectivity, fail back, document runbook updates. Driver: muscle memory. Trade-off: ~5 min app blip during failover.
  2. Region-down event β€” forced failover required β†’ Capture last_hardened_lsn from secondary + last LSN known on primary β†’ forced failover with --allow-data-loss β†’ reconcile delta from app journals. Driver: BCP triggered. Trade-off: documented data loss window.
  3. Listener returns wrong IP after subnet move β†’ Update DNS TTL + flush client DNS cache + verify with Test-NetConnection. Driver: DNS staleness. Trade-off: TTL planning before any subnet change.
  4. AG replica stuck NOT SYNCHRONIZING β†’ Check log send queue + redo queue, check network, suspend/resume data movement, last resort = remove + re-seed. Driver: targeted recovery. Trade-off: re-seed of TB-scale DBs is hours.
  5. App connection string still hardcoded to original server name post-failover β†’ Switch to FOG read-write listener; add MultiSubnetFailover=True for AG. Driver: catch in pre-prod drill. Trade-off: app team coordination.

Flashcards​

What is the difference between az sql failover-group set-primary with and without --allow-data-loss?
Click to reveal answer
Without the flag = planned failover (zero data loss; both replicas must be in sync). With the flag = forced failover, used when the primary is down or unreachable; potential data loss equal to the unsynchronized log.
1 / 6

Practice quiz​

Q1/5
0 correct
You issued `az sql failover-group set-primary` against the secondary server, but `replication_link_status` still shows the original primary as primary 5 minutes later. What is the MOST likely cause?

Sources & further reading​

Last verified against MS Learn: April 28, 2026.