HA/DR Testing & Troubleshooting — Failover Drills, RPO/RTO Verification, Common Failure Modes

The April 24, 2026 DP-300 outline added two explicit Domain 5 skills:

"Test a high availability and disaster recovery strategy"
"Troubleshoot a high availability and disaster recovery strategy"

A documented HA/DR design is worthless if it has never been tested. This page covers the test plans, T-SQL & CLI verification, RPO/RTO measurement, and the common failure modes you must be ready to triage.

Why this matters

Truth	Implication
Untested DR = no DR	Customers regularly find that promoted secondaries fail the first real test (missing logins, jobs, linked servers, NSG rules).
RPO/RTO are targets, not guarantees	Real numbers depend on workload, network, replica lag, and failover trigger time.
Failover ≠ failback	A successful failover is half the test — failback to the primary region is the other half.
App tier failover is part of DR	Connection strings, listener IPs, traffic-manager records, and Azure Front Door origins must all flip.

🎯 Exam Focus

The exam likes to test the distinction between planned failover (no data loss, manual) and forced/unplanned failover (potential data loss, used when primary is down). Know the CLI/T-SQL verbs for each: set-primary (planned) vs force-failover-allow-data-loss (unplanned). :::

Part 1 — Build a test plan

Drill catalog (run quarterly)

Drill	Trigger	Expected RPO	Expected RTO	Validates
Planned regional failover of Auto-Failover Group	`az sql failover-group set-primary`	0	30–60 s	Read/write listener cuts over; app tier reconnects
Forced failover simulating primary outage	`--allow-data-loss` switch	up to 5 s for FOG; 0 for sync AG	30–60 s for FOG; ~10 s for sync AG	Acceptable data loss, monitoring alarms fire
AG failover (SQL VM)	`ALTER AVAILABILITY GROUP ag1 FAILOVER`	0 (sync)	seconds	Listener, ILB probe, AG state, witness/quorum
MI restore from PITR	`az sql midb restore`	depends on backup time	minutes–hours	Backup chain integrity, restore permissions
Cross-region geo-restore	`az sql db restore --source-database-id <geo-backup-id>`	up to 1 h (RA-GRS replication lag)	minutes–hours	Geo-redundant backup recoverability
App-tier failover only	DNS / Front Door / Traffic Manager change	n/a	seconds–minutes	Connection-string strategy works
Full BCDR fire-drill	Combine FOG failover + app-tier failover + run smoke tests	per design	per design	End-to-end runbook validates

Test plan template (fill out before every drill)

Drill name:
Date / time window:
In-scope resources (servers, DBs, FOGs, AGs):
Trigger command:
Expected RTO / RPO:
Pre-drill state capture (queries / screenshots):
Smoke tests after failover:
Rollback / failback steps:
Comms plan (who to notify, status page updates):
Sign-off criteria:

Part 2 — Failover commands

Auto-Failover Group (Azure SQL DB / MI)

# PLANNED failover — geo-secondary becomes primary, no data loss
az sql failover-group set-primary \
  -g rg-secondary -s ehab-sqlsrv-secondary \
  -n fog-prod

# FORCED failover — accept potential data loss (use when primary is down)
az sql failover-group set-primary \
  -g rg-secondary -s ehab-sqlsrv-secondary \
  -n fog-prod --allow-data-loss

PowerShell equivalent:

# Planned
Switch-AzSqlDatabaseFailoverGroup -ResourceGroupName rg-secondary `
  -ServerName ehab-sqlsrv-secondary -FailoverGroupName fog-prod

# Forced
Switch-AzSqlDatabaseFailoverGroup -ResourceGroupName rg-secondary `
  -ServerName ehab-sqlsrv-secondary -FailoverGroupName fog-prod -AllowDataLoss

Always On AG on SQL Server VMs

-- Planned (synchronous replica only) — connect to SECONDARY
ALTER AVAILABILITY GROUP [ag1] FAILOVER;

-- Forced (potential data loss) — connect to SECONDARY
ALTER AVAILABILITY GROUP [ag1] FORCE_FAILOVER_ALLOW_DATA_LOSS;

MI failover (refresh underlying VM)

# User-initiated MI failover — useful for testing app reconnection logic
az sql mi failover -g $RG -n ehab-mi-prod --replica-type Primary

Part 3 — Verify state with T-SQL & DMVs

Failover Group / Geo-Replication state

-- Run on the PRIMARY logical server
SELECT
    link_guid,
    partner_server,
    partner_database,
    replication_state_desc,        -- CATCH_UP / SEEDING / SUSPENDED
    role_desc,                     -- PRIMARY / SECONDARY
    secondary_allow_connections_desc,
    last_replication,
    replication_lag_sec
FROM sys.dm_geo_replication_link_status;

replication_state_desc = 'CATCH_UP' and a low replication_lag_sec are what you want before triggering a planned failover.

Always On AG state

SELECT
    ar.replica_server_name,
    drs.database_id,
    DB_NAME(drs.database_id) AS db_name,
    drs.synchronization_state_desc,         -- SYNCHRONIZED / SYNCHRONIZING / NOT SYNCHRONIZING
    drs.synchronization_health_desc,        -- HEALTHY / PARTIALLY_HEALTHY / NOT_HEALTHY
    drs.is_primary_replica,
    drs.last_hardened_lsn,
    drs.log_send_queue_size,                -- KB pending to ship
    drs.redo_queue_size                     -- KB pending to redo on secondary
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar
    ON drs.replica_id = ar.replica_id
ORDER BY ar.replica_server_name;

-- Listener / cluster health
SELECT * FROM sys.dm_hadr_availability_group_states;
SELECT * FROM sys.dm_hadr_cluster_members;
SELECT * FROM sys.availability_group_listeners;

Backup history (verify chain before geo-restore)

SELECT TOP 20
    database_name, backup_start_date, backup_finish_date, type,
    DATEDIFF(SECOND, backup_start_date, backup_finish_date) AS duration_sec,
    backup_size / 1024.0 / 1024 / 1024 AS size_gb
FROM msdb.dbo.backupset
WHERE database_name = 'salesprod'
ORDER BY backup_start_date DESC;

Part 4 — Measure RPO and RTO

RPO (data loss window)

For Auto-Failover Groups:

-- On the primary BEFORE the drill, write a sentinel row with current UTC timestamp
INSERT INTO dbo.DR_HeartBeat (utc, note) VALUES (SYSUTCDATETIME(), 'pre-failover');
GO

-- After forced failover, on the NEW primary, find the highest sentinel that survived
SELECT TOP 1 utc, note FROM dbo.DR_HeartBeat ORDER BY utc DESC;

-- RPO = (UTC of last write on old primary) - (UTC of last sentinel that arrived)

For AG sync replicas, RPO is by definition zero — the secondary acknowledges before the commit returns.

RTO (downtime window)

RTO = (timestamp app reconnects successfully) - (timestamp failover triggered)

Capture both timestamps in your test runbook. Azure Monitor metric replication_link_status (failover groups) and AG DMVs flip immediately, but the app's RTO can be longer due to connection-pool retries and DNS TTLs.

Azure Monitor metrics to watch

Resource	Metric	What it tells you
Failover Group	`replication_lag_sec`	Real-time RPO ceiling
Failover Group	`replication_link_status`	UP / DEGRADED / DOWN
SQL DB	`dtu_consumption_percent` / `cpu_percent` post-failover	Workload behavior on new primary
MI	`storage_percent`, `cpu_percent`	Capacity validation after failover
Application Insights	request count, failed requests	App-tier RTO from a user perspective

Part 5 — Troubleshooting playbook

Failover group: "primary still primary after `set-primary`"

Cause	Fix
Used the primary server in the command instead of the secondary	Re-run with `-s <secondary-server>` (this is the most common mistake)
The geo-secondary is `SUSPENDED`	Resume with `ALTER DATABASE ... SET PARTNER RESUME` (sync) or recreate the FOG
Permissions: caller lacks `SQL DB Contributor` on the secondary RG	Grant role; retry

Forced failover: "data loss higher than expected"

Cause	Fix
`replication_lag_sec` was high before the drill (workload hot, link saturated)	Increase compute; check network throughput; consider Premium → Business Critical for sync
Bulk operation in flight (BULK INSERT, index rebuild)	Large operations re-ship after failover; schedule maintenance windows

AG: replica `NOT SYNCHRONIZING` / `NOT HEALTHY`

Symptom	Diagnosis	Fix
`synchronization_state_desc = 'NOT SYNCHRONIZING'`	Network outage or endpoint down	Check the database mirroring endpoint TCP 5022 NSG/firewall; restart endpoint
`redo_queue_size` keeps growing	Secondary CPU/IO saturated	Right-size secondary VM; verify storage SLO matches primary
Listener returns wrong IP after failover	ILB probe misconfigured (probe port, RegisterAllProvidersIP)	Set probe port; configure RegisterAllProvidersIP=0 + HostRecordTTL=300
App connects but reads stale data	App reading from a read-only routing list with old data	Verify `READ_ONLY_ROUTING_URL` and ApplicationIntent=ReadOnly clients

MI failover group quota errors

"Cannot enable Failover Group: subscription quota for vCores in the secondary region is exhausted"

Fix: file a quota increase via az quota for Microsoft.Sql/managedInstance in the target region. MI requires the same SKU and capacity in both regions.

Split-brain (rare but catastrophic)

A split-brain happens when an outage isolates the primary's region from the rest of the world, the secondary is force-failed-over, then the original primary recovers and tries to keep accepting writes.

Failover Groups: protected — the FOG fences the old primary. After connectivity returns, the old primary is demoted automatically. No action needed beyond confirming.
AG on SQL VMs: requires a quorum model (Cloud Witness) and REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT >= 1 to prevent both replicas accepting writes.

Listener not reachable after failover

Confirm the listener IP via ILB rules (az network lb show)
Confirm the AG cluster registered the IP (cluster.exe /status)
Confirm the ILB health probe port is open in the OS firewall
Confirm DNS TTL for the listener record is short (<= 60 s)
From client: Test-NetConnection <listener> -Port 1433

Application doesn't reconnect

Connection string must use the FOG read/write listener: <fog-name>.database.windows.net  (NOT the server name)
Connection retry policy: 5 attempts, 5s delay, exponential backoff
Connection-pool reset on transient failures (SqlConnection.ClearAllPools)
.NET client config: `Connection Lifetime=60` to recycle stale pooled connections post-failover
For Always On AGs, set `MultiSubnetFailover=True` in the connection string

Part 6 — Hybrid scenario (on-prem AG → Azure VM AG)

A common DR pattern: an on-prem 2-node sync AG replicates asynchronously to a third replica running on an Azure VM in another region. Drill steps:

Hybrid AG failover drill

📡

1. Verify async replica health

sys.dm_hadr_database_replica_states

redo_queue_size acceptable

log_send_queue_size acceptable

🚦

2. Force-failover to Azure replica

ALTER AG ... FORCE_FAILOVER_ALLOW_DATA_LOSS

Capture RPO via heartbeat table

🔁

3. Resume data movement

ALTER DATABASE ... SET HADR RESUME on old primary

🧪

4. App smoke tests

Connection string points to Azure listener

Critical transactions complete

Latency baselines captured

↩️

5. Failback

Re-establish sync mode on the original primaries

Planned failover back

Document the elapsed RTO

🎯 Exam Focus

For forced failover with data loss, always capture the LSN of the last applied log record on the new primary (sys.dm_hadr_database_replica_states.last_hardened_lsn) and the LSN of the last shipped transaction on the old primary. Subtracting them gives you the exact number of transactions lost. :::

🏢 Real-World DBA Note

Oracle DBA parallel:

Failover Group planned failover ≈ Oracle Data Guard switchover (zero data loss).
Failover Group forced failover ≈ Oracle Data Guard failover (potential data loss).
AG synchronization_health_desc ≈ Data Guard transport lag + apply lag health checks via V$DATAGUARD_STATS.
Cloud Witness for SQL AG ≈ Oracle Observer for Data Guard Fast-Start Failover. :::

Anti-Patterns

"Test failover by killing the primary in prod." Use planned failover (az sql failover-group set-primary) — same code path, no data loss. Force-failure tests belong in pre-prod.
"We don't need scheduled DR drills — the secondary is healthy." Healthy != tested. App connection strings, DNS, secrets, and on-call runbooks all rot. Test failover quarterly at minimum.
"After failover, leave the old primary as the new secondary." That's the right move — BUT confirm data sync caught up, reconfigure backups (LTR runs on the new primary now), and update monitoring/alert scopes.
"Forced failover is fine if we have backups." It causes data loss equal to unsynchronized log. Always check last_hardened_lsn delta first; fall back to backups only if the delta is unacceptable.
"Listener works → app fine after failover." App may cache the old IP; connection string with MultiSubnetFailover=True is required for AG. For FOG, app must use the read-write listener, not the server name.
"Run DR drill once a year on a Friday afternoon." Drill during business hours with the team available. Fridays = nobody to debug Monday.

⚠️ Watch Out

Failover Group read-write listener stays the same after failover; read-only listener gets reassigned. Apps reading reports may break if hardcoded to a specific server name. Always use the listener.

Migration Between HA/DR Postures

From → To	Path	Cost
Single primary → Active geo-replication	Add geo-secondary; manual failover	Adds secondary compute cost; manual RTO
Active geo-replication → Failover Group	Wrap in FOG; gain auto-failover policy	Same compute; gain orchestration
FOG manual failover → FOG auto-failover (1h grace)	Enable auto-failover policy	Auto-recovery; risk of split-brain on flapping
AG with file-share witness → Cloud Witness	Reconfigure cluster quorum	Removes file server dependency; tiny storage cost
Single AG → Distributed AG (cross-region)	Add forwarder + secondary AG	Cross-region resilience; complex topology
Quarterly manual DR drill → Monthly automated drill	Runbook + scripted failover + verification	Automation effort; gain confidence
Email-only DR notification → Action Group + ITSM	Wire AG → ServiceNow	One-time wiring; auditable evidence
Reactive troubleshooting → Synthetic transaction monitoring	App Insights availability test through listener	Detects DNS/listener issues before users notice

Most expensive moves: AG → Distributed AG (cross-region complexity) and shifting to auto-failover policy (operational risk if not tested).

Real Scenarios

Quarterly DR drill on FOG → Planned failover during business hours, validate app connectivity, fail back, document runbook updates. Driver: muscle memory. Trade-off: ~5 min app blip during failover.
Region-down event — forced failover required → Capture last_hardened_lsn from secondary + last LSN known on primary → forced failover with --allow-data-loss → reconcile delta from app journals. Driver: BCP triggered. Trade-off: documented data loss window.
Listener returns wrong IP after subnet move → Update DNS TTL + flush client DNS cache + verify with Test-NetConnection. Driver: DNS staleness. Trade-off: TTL planning before any subnet change.
AG replica stuck NOT SYNCHRONIZING → Check log send queue + redo queue, check network, suspend/resume data movement, last resort = remove + re-seed. Driver: targeted recovery. Trade-off: re-seed of TB-scale DBs is hours.
App connection string still hardcoded to original server name post-failover → Switch to FOG read-write listener; add MultiSubnetFailover=True for AG. Driver: catch in pre-prod drill. Trade-off: app team coordination.

Flashcards

What is the difference between az sql failover-group set-primary with and without --allow-data-loss?

Click to reveal answer

Without the flag = planned failover (zero data loss; both replicas must be in sync). With the flag = forced failover, used when the primary is down or unreachable; potential data loss equal to the unsynchronized log.

1 / 6

Practice quiz

Q1/5

0 correct

You issued `az sql failover-group set-primary` against the secondary server, but `replication_link_status` still shows the original primary as primary 5 minutes later. What is the MOST likely cause?

Sources & further reading

Last verified against MS Learn: April 28, 2026.

Why this matters​

Part 1 — Build a test plan​

Drill catalog (run quarterly)​

Test plan template (fill out before every drill)​

Part 2 — Failover commands​

Auto-Failover Group (Azure SQL DB / MI)​

Always On AG on SQL Server VMs​

MI failover (refresh underlying VM)​

Part 3 — Verify state with T-SQL & DMVs​

Failover Group / Geo-Replication state​

Always On AG state​

Backup history (verify chain before geo-restore)​

Part 4 — Measure RPO and RTO​

RPO (data loss window)​

RTO (downtime window)​

Azure Monitor metrics to watch​

Part 5 — Troubleshooting playbook​

Failover group: "primary still primary after set-primary"​

Forced failover: "data loss higher than expected"​

AG: replica NOT SYNCHRONIZING / NOT HEALTHY​

MI failover group quota errors​

Split-brain (rare but catastrophic)​

Listener not reachable after failover​

Application doesn't reconnect​

Part 6 — Hybrid scenario (on-prem AG → Azure VM AG)​

Anti-Patterns​

Migration Between HA/DR Postures​

Real Scenarios​

Flashcards​

Practice quiz​

Sources & further reading​