HA/DR Testing & Troubleshooting β Failover Drills, RPO/RTO Verification, Common Failure Modes
The April 24, 2026 DP-300 outline added two explicit Domain 5 skills:
"Test a high availability and disaster recovery strategy"
"Troubleshoot a high availability and disaster recovery strategy"
A documented HA/DR design is worthless if it has never been tested. This page covers the test plans, T-SQL & CLI verification, RPO/RTO measurement, and the common failure modes you must be ready to triage.
Why this mattersβ
| Truth | Implication |
|---|---|
| Untested DR = no DR | Customers regularly find that promoted secondaries fail the first real test (missing logins, jobs, linked servers, NSG rules). |
| RPO/RTO are targets, not guarantees | Real numbers depend on workload, network, replica lag, and failover trigger time. |
| Failover β failback | A successful failover is half the test β failback to the primary region is the other half. |
| App tier failover is part of DR | Connection strings, listener IPs, traffic-manager records, and Azure Front Door origins must all flip. |
The exam likes to test the distinction between planned failover (no data loss, manual) and forced/unplanned failover (potential data loss, used when primary is down). Know the CLI/T-SQL verbs for each: set-primary (planned) vs force-failover-allow-data-loss (unplanned).
:::
Part 1 β Build a test planβ
Drill catalog (run quarterly)β
| Drill | Trigger | Expected RPO | Expected RTO | Validates |
|---|---|---|---|---|
| Planned regional failover of Auto-Failover Group | az sql failover-group set-primary | 0 | 30β60 s | Read/write listener cuts over; app tier reconnects |
| Forced failover simulating primary outage | --allow-data-loss switch | up to 5 s for FOG; 0 for sync AG | 30β60 s for FOG; ~10 s for sync AG | Acceptable data loss, monitoring alarms fire |
| AG failover (SQL VM) | ALTER AVAILABILITY GROUP ag1 FAILOVER | 0 (sync) | seconds | Listener, ILB probe, AG state, witness/quorum |
| MI restore from PITR | az sql midb restore | depends on backup time | minutesβhours | Backup chain integrity, restore permissions |
| Cross-region geo-restore | az sql db restore --source-database-id <geo-backup-id> | up to 1 h (RA-GRS replication lag) | minutesβhours | Geo-redundant backup recoverability |
| App-tier failover only | DNS / Front Door / Traffic Manager change | n/a | secondsβminutes | Connection-string strategy works |
| Full BCDR fire-drill | Combine FOG failover + app-tier failover + run smoke tests | per design | per design | End-to-end runbook validates |
Test plan template (fill out before every drill)β
Drill name:
Date / time window:
In-scope resources (servers, DBs, FOGs, AGs):
Trigger command:
Expected RTO / RPO:
Pre-drill state capture (queries / screenshots):
Smoke tests after failover:
Rollback / failback steps:
Comms plan (who to notify, status page updates):
Sign-off criteria:
Part 2 β Failover commandsβ
Auto-Failover Group (Azure SQL DB / MI)β
# PLANNED failover β geo-secondary becomes primary, no data loss
az sql failover-group set-primary \
-g rg-secondary -s ehab-sqlsrv-secondary \
-n fog-prod
# FORCED failover β accept potential data loss (use when primary is down)
az sql failover-group set-primary \
-g rg-secondary -s ehab-sqlsrv-secondary \
-n fog-prod --allow-data-loss
PowerShell equivalent:
# Planned
Switch-AzSqlDatabaseFailoverGroup -ResourceGroupName rg-secondary `
-ServerName ehab-sqlsrv-secondary -FailoverGroupName fog-prod
# Forced
Switch-AzSqlDatabaseFailoverGroup -ResourceGroupName rg-secondary `
-ServerName ehab-sqlsrv-secondary -FailoverGroupName fog-prod -AllowDataLoss
Always On AG on SQL Server VMsβ
-- Planned (synchronous replica only) β connect to SECONDARY
ALTER AVAILABILITY GROUP [ag1] FAILOVER;
-- Forced (potential data loss) β connect to SECONDARY
ALTER AVAILABILITY GROUP [ag1] FORCE_FAILOVER_ALLOW_DATA_LOSS;
MI failover (refresh underlying VM)β
# User-initiated MI failover β useful for testing app reconnection logic
az sql mi failover -g $RG -n ehab-mi-prod --replica-type Primary
Part 3 β Verify state with T-SQL & DMVsβ
Failover Group / Geo-Replication stateβ
-- Run on the PRIMARY logical server
SELECT
link_guid,
partner_server,
partner_database,
replication_state_desc, -- CATCH_UP / SEEDING / SUSPENDED
role_desc, -- PRIMARY / SECONDARY
secondary_allow_connections_desc,
last_replication,
replication_lag_sec
FROM sys.dm_geo_replication_link_status;
replication_state_desc = 'CATCH_UP' and a low replication_lag_sec are what you want before triggering a planned failover.
Always On AG stateβ
SELECT
ar.replica_server_name,
drs.database_id,
DB_NAME(drs.database_id) AS db_name,
drs.synchronization_state_desc, -- SYNCHRONIZED / SYNCHRONIZING / NOT SYNCHRONIZING
drs.synchronization_health_desc, -- HEALTHY / PARTIALLY_HEALTHY / NOT_HEALTHY
drs.is_primary_replica,
drs.last_hardened_lsn,
drs.log_send_queue_size, -- KB pending to ship
drs.redo_queue_size -- KB pending to redo on secondary
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar
ON drs.replica_id = ar.replica_id
ORDER BY ar.replica_server_name;
-- Listener / cluster health
SELECT * FROM sys.dm_hadr_availability_group_states;
SELECT * FROM sys.dm_hadr_cluster_members;
SELECT * FROM sys.availability_group_listeners;
Backup history (verify chain before geo-restore)β
SELECT TOP 20
database_name, backup_start_date, backup_finish_date, type,
DATEDIFF(SECOND, backup_start_date, backup_finish_date) AS duration_sec,
backup_size / 1024.0 / 1024 / 1024 AS size_gb
FROM msdb.dbo.backupset
WHERE database_name = 'salesprod'
ORDER BY backup_start_date DESC;
Part 4 β Measure RPO and RTOβ
RPO (data loss window)β
For Auto-Failover Groups:
-- On the primary BEFORE the drill, write a sentinel row with current UTC timestamp
INSERT INTO dbo.DR_HeartBeat (utc, note) VALUES (SYSUTCDATETIME(), 'pre-failover');
GO
-- After forced failover, on the NEW primary, find the highest sentinel that survived
SELECT TOP 1 utc, note FROM dbo.DR_HeartBeat ORDER BY utc DESC;
-- RPO = (UTC of last write on old primary) - (UTC of last sentinel that arrived)
For AG sync replicas, RPO is by definition zero β the secondary acknowledges before the commit returns.
RTO (downtime window)β
RTO = (timestamp app reconnects successfully) - (timestamp failover triggered)
Capture both timestamps in your test runbook. Azure Monitor metric replication_link_status (failover groups) and AG DMVs flip immediately, but the app's RTO can be longer due to connection-pool retries and DNS TTLs.
Azure Monitor metrics to watchβ
| Resource | Metric | What it tells you |
|---|---|---|
| Failover Group | replication_lag_sec | Real-time RPO ceiling |
| Failover Group | replication_link_status | UP / DEGRADED / DOWN |
| SQL DB | dtu_consumption_percent / cpu_percent post-failover | Workload behavior on new primary |
| MI | storage_percent, cpu_percent | Capacity validation after failover |
| Application Insights | request count, failed requests | App-tier RTO from a user perspective |
Part 5 β Troubleshooting playbookβ
Failover group: "primary still primary after set-primary"β
| Cause | Fix |
|---|---|
| Used the primary server in the command instead of the secondary | Re-run with -s <secondary-server> (this is the most common mistake) |
The geo-secondary is SUSPENDED | Resume with ALTER DATABASE ... SET PARTNER RESUME (sync) or recreate the FOG |
Permissions: caller lacks SQL DB Contributor on the secondary RG | Grant role; retry |
Forced failover: "data loss higher than expected"β
| Cause | Fix |
|---|---|
replication_lag_sec was high before the drill (workload hot, link saturated) | Increase compute; check network throughput; consider Premium β Business Critical for sync |
| Bulk operation in flight (BULK INSERT, index rebuild) | Large operations re-ship after failover; schedule maintenance windows |
AG: replica NOT SYNCHRONIZING / NOT HEALTHYβ
| Symptom | Diagnosis | Fix |
|---|---|---|
synchronization_state_desc = 'NOT SYNCHRONIZING' | Network outage or endpoint down | Check the database mirroring endpoint TCP 5022 NSG/firewall; restart endpoint |
redo_queue_size keeps growing | Secondary CPU/IO saturated | Right-size secondary VM; verify storage SLO matches primary |
| Listener returns wrong IP after failover | ILB probe misconfigured (probe port, RegisterAllProvidersIP) | Set probe port; configure RegisterAllProvidersIP=0 + HostRecordTTL=300 |
| App connects but reads stale data | App reading from a read-only routing list with old data | Verify READ_ONLY_ROUTING_URL and ApplicationIntent=ReadOnly clients |
MI failover group quota errorsβ
"Cannot enable Failover Group: subscription quota for vCores in the secondary region is exhausted"
Fix: file a quota increase via az quota for Microsoft.Sql/managedInstance in the target region. MI requires the same SKU and capacity in both regions.
Split-brain (rare but catastrophic)β
A split-brain happens when an outage isolates the primary's region from the rest of the world, the secondary is force-failed-over, then the original primary recovers and tries to keep accepting writes.
- Failover Groups: protected β the FOG fences the old primary. After connectivity returns, the old primary is demoted automatically. No action needed beyond confirming.
- AG on SQL VMs: requires a quorum model (Cloud Witness) and
REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT >= 1to prevent both replicas accepting writes.
Listener not reachable after failoverβ
1. Confirm the listener IP via ILB rules (az network lb show)
2. Confirm the AG cluster registered the IP (cluster.exe /status)
3. Confirm the ILB health probe port is open in the OS firewall
4. Confirm DNS TTL for the listener record is short (<= 60 s)
5. From client: Test-NetConnection <listener> -Port 1433
Application doesn't reconnectβ
1. Connection string must use the FOG read/write listener: <fog-name>.database.windows.net (NOT the server name)
2. Connection retry policy: 5 attempts, 5s delay, exponential backoff
3. Connection-pool reset on transient failures (SqlConnection.ClearAllPools)
4. .NET client config: `Connection Lifetime=60` to recycle stale pooled connections post-failover
5. For Always On AGs, set `MultiSubnetFailover=True` in the connection string
Part 6 β Hybrid scenario (on-prem AG β Azure VM AG)β
A common DR pattern: an on-prem 2-node sync AG replicates asynchronously to a third replica running on an Azure VM in another region. Drill steps:
For forced failover with data loss, always capture the LSN of the last applied log record on the new primary (sys.dm_hadr_database_replica_states.last_hardened_lsn) and the LSN of the last shipped transaction on the old primary. Subtracting them gives you the exact number of transactions lost.
:::
Oracle DBA parallel:
- Failover Group planned failover β Oracle Data Guard switchover (zero data loss).
- Failover Group forced failover β Oracle Data Guard failover (potential data loss).
- AG
synchronization_health_descβ Data Guardtransport lag+apply laghealth checks viaV$DATAGUARD_STATS. - Cloud Witness for SQL AG β Oracle Observer for Data Guard Fast-Start Failover. :::
Anti-Patternsβ
- "Test failover by killing the primary in prod." Use planned failover (
az sql failover-group set-primary) β same code path, no data loss. Force-failure tests belong in pre-prod. - "We don't need scheduled DR drills β the secondary is healthy." Healthy != tested. App connection strings, DNS, secrets, and on-call runbooks all rot. Test failover quarterly at minimum.
- "After failover, leave the old primary as the new secondary." That's the right move β BUT confirm data sync caught up, reconfigure backups (LTR runs on the new primary now), and update monitoring/alert scopes.
- "Forced failover is fine if we have backups." It causes data loss equal to unsynchronized log. Always check
last_hardened_lsndelta first; fall back to backups only if the delta is unacceptable. - "Listener works β app fine after failover." App may cache the old IP; connection string with
MultiSubnetFailover=Trueis required for AG. For FOG, app must use the read-write listener, not the server name. - "Run DR drill once a year on a Friday afternoon." Drill during business hours with the team available. Fridays = nobody to debug Monday.
Failover Group read-write listener stays the same after failover; read-only listener gets reassigned. Apps reading reports may break if hardcoded to a specific server name. Always use the listener.
Migration Between HA/DR Posturesβ
| From β To | Path | Cost |
|---|---|---|
| Single primary β Active geo-replication | Add geo-secondary; manual failover | Adds secondary compute cost; manual RTO |
| Active geo-replication β Failover Group | Wrap in FOG; gain auto-failover policy | Same compute; gain orchestration |
| FOG manual failover β FOG auto-failover (1h grace) | Enable auto-failover policy | Auto-recovery; risk of split-brain on flapping |
| AG with file-share witness β Cloud Witness | Reconfigure cluster quorum | Removes file server dependency; tiny storage cost |
| Single AG β Distributed AG (cross-region) | Add forwarder + secondary AG | Cross-region resilience; complex topology |
| Quarterly manual DR drill β Monthly automated drill | Runbook + scripted failover + verification | Automation effort; gain confidence |
| Email-only DR notification β Action Group + ITSM | Wire AG β ServiceNow | One-time wiring; auditable evidence |
| Reactive troubleshooting β Synthetic transaction monitoring | App Insights availability test through listener | Detects DNS/listener issues before users notice |
Most expensive moves: AG β Distributed AG (cross-region complexity) and shifting to auto-failover policy (operational risk if not tested).
Real Scenariosβ
- Quarterly DR drill on FOG β Planned failover during business hours, validate app connectivity, fail back, document runbook updates. Driver: muscle memory. Trade-off: ~5 min app blip during failover.
- Region-down event β forced failover required β Capture
last_hardened_lsnfrom secondary + last LSN known on primary β forced failover with--allow-data-lossβ reconcile delta from app journals. Driver: BCP triggered. Trade-off: documented data loss window. - Listener returns wrong IP after subnet move β Update DNS TTL + flush client DNS cache + verify with
Test-NetConnection. Driver: DNS staleness. Trade-off: TTL planning before any subnet change. - AG replica stuck NOT SYNCHRONIZING β Check log send queue + redo queue, check network, suspend/resume data movement, last resort = remove + re-seed. Driver: targeted recovery. Trade-off: re-seed of TB-scale DBs is hours.
- App connection string still hardcoded to original server name post-failover β Switch to FOG read-write listener; add
MultiSubnetFailover=Truefor AG. Driver: catch in pre-prod drill. Trade-off: app team coordination.
Flashcardsβ
Practice quizβ
Sources & further readingβ
- Use auto-failover groups to enable transparent and coordinated failover
- Configure and manage failover for SQL Managed Instance
- Active geo-replication for Azure SQL Database
- Always On Availability Groups: Monitoring
- Troubleshoot Always On Availability Groups failover
- Application connection retry logic for transient errors
- DP-300 study guide (April 24, 2026)
Last verified against MS Learn: April 28, 2026.