Lab: Cross-Region AlwaysOn Availability Groups
Extend the single-region AlwaysOn AG into a multi-region setup — deploy a 3rd SQL replica in a different Azure region with async commit, configure distributed AG or multi-subnet AG, and understand the trade-offs between same-region sync commit and cross-region async commit.
DP-300 Domain: 5 — Plan and Configure HA/DR (20-25%)
Real customer scenario: your primary AG runs in West Europe with sync commit and automatic failover. A 3rd replica in North Europe provides async DR. If the entire West Europe region goes down, you can manually failover to North Europe with minimal data loss (RPO < 5 seconds under normal load).
Why Cross-Region?
Architecture
What Gets Deployed
| Component | Resource | Region | Purpose |
|---|---|---|---|
| Primary AG | sqlvm-ha1 + sqlvm-ha2 | West Europe | Sync commit pair with automatic failover |
| DR Replica | sqlvm-dr1 (10.1.1.10) | North Europe | Async commit for region-level DR |
| VNET Peering | Global peering | Cross-region | Connects the two VNETs |
| DR VNET | vnet-sql-dr (10.1.0.0/16) | North Europe | Network for DR replica |
| NSG | nsg-sql-dr | North Europe | Firewall for DR subnet |
Prerequisites
- Completed the AlwaysOn HA Setup lab — the primary 2-node AG must be running
- Azure subscription with quota for 1 additional
Standard_D2ads_v6VM - Existing domain
sqlha.localrunning ondc-vm
Lab Steps
Step 1: Deploy DR Infrastructure
$drResourceGroup = "rg-sql-ha" # Same RG or separate — your choice
$drLocation = "northeurope"
$primaryVnet = "vnet-sql-ha"
$primaryRG = "rg-sql-ha"
# Create DR VNET in North Europe
az network vnet create `
--resource-group $drResourceGroup `
--name vnet-sql-dr `
--location $drLocation `
--address-prefixes 10.1.0.0/16 `
--subnet-name dr-subnet `
--subnet-prefixes 10.1.1.0/24
# Create NSG for DR subnet
az network nsg create `
--resource-group $drResourceGroup `
--name nsg-sql-dr `
--location $drLocation
# Allow SQL, HADR mirroring, RDP, and WSFC
az network nsg rule create --resource-group $drResourceGroup --nsg-name nsg-sql-dr `
--name AllowSQL --priority 100 --direction Inbound --access Allow `
--protocol Tcp --destination-port-range 1433 --source-address-prefix VirtualNetwork
az network nsg rule create --resource-group $drResourceGroup --nsg-name nsg-sql-dr `
--name AllowHADR --priority 200 --direction Inbound --access Allow `
--protocol Tcp --destination-port-range 5022 --source-address-prefix VirtualNetwork
az network nsg rule create --resource-group $drResourceGroup --nsg-name nsg-sql-dr `
--name AllowRDP --priority 300 --direction Inbound --access Allow `
--protocol Tcp --destination-port-range 3389 --source-address-prefix '*'
az network nsg rule create --resource-group $drResourceGroup --nsg-name nsg-sql-dr `
--name AllowWSFC --priority 400 --direction Inbound --access Allow `
--protocol '*' --destination-port-ranges 135 5985 49152-65535 --source-address-prefix VirtualNetwork
# Associate NSG with DR subnet
az network vnet subnet update `
--resource-group $drResourceGroup `
--vnet-name vnet-sql-dr `
--name dr-subnet `
--network-security-group nsg-sql-dr
# Deploy DR SQL VM
az vm create `
--resource-group $drResourceGroup `
--name sqlvm-dr1 `
--image MicrosoftSQLServer:sql2022-ws2022:enterprise-gen2:latest `
--size Standard_D2ads_v6 `
--admin-username azureadmin `
--admin-password '<YOUR_PASSWORD>' `
--vnet-name vnet-sql-dr `
--subnet dr-subnet `
--private-ip-address 10.1.1.10 `
--public-ip-sku Standard `
--nsg nsg-sql-dr `
--location $drLocation
Step 2: Configure VNET Peering
# Get VNET resource IDs
$vnet1Id = az network vnet show --resource-group $primaryRG --name $primaryVnet --query id -o tsv
$vnet2Id = az network vnet show --resource-group $drResourceGroup --name vnet-sql-dr --query id -o tsv
# Create peering: Primary → DR
az network vnet peering create `
--resource-group $primaryRG `
--name peer-primary-to-dr `
--vnet-name $primaryVnet `
--remote-vnet $vnet2Id `
--allow-vnet-access `
--allow-forwarded-traffic
# Create peering: DR → Primary
az network vnet peering create `
--resource-group $drResourceGroup `
--name peer-dr-to-primary `
--vnet-name vnet-sql-dr `
--remote-vnet $vnet1Id `
--allow-vnet-access `
--allow-forwarded-traffic
# Verify peering status
az network vnet peering list --resource-group $primaryRG --vnet-name $primaryVnet -o table
VNET peering must be created in both directions. The peering status should show Connected on both sides. Traffic won't flow until both sides are Connected.
Step 3: Configure DNS & Domain Join DR Replica
# Update DR VNET DNS to point to the primary DC
az network vnet update `
--resource-group $drResourceGroup `
--name vnet-sql-dr `
--dns-servers 10.0.1.4
# Restart DR VM to pick up new DNS
az vm restart --resource-group $drResourceGroup --name sqlvm-dr1
RDP into sqlvm-dr1 and run:
# Verify DNS resolution to domain controller
nslookup sqlha.local
# Should resolve to 10.0.1.4
# Domain join
Add-Computer -DomainName "sqlha.local" -Credential (Get-Credential SQLHA\azureadmin) -Restart
After reboot, verify:
# Verify domain membership
(Get-WmiObject Win32_ComputerSystem).Domain
# Expected: sqlha.local
Step 4: Install Failover Clustering & Add to WSFC
RDP into sqlvm-dr1:
# Install Failover Clustering feature
Install-WindowsFeature Failover-Clustering -IncludeManagementTools
# Restart
Restart-Computer -Force
After reboot, from any existing cluster node (e.g., sqlvm-ha1):
# Add DR node to existing cluster
Add-ClusterNode -Name "sqlvm-dr1" -Cluster "sqlha-cluster"
# Verify — should show 3 nodes
Get-ClusterNode | Format-Table Name, State, NodeWeight
# Verify quorum — with 3 nodes, File Share Witness gives 4 votes (odd)
Get-ClusterQuorum | Format-List
With 3 nodes + 1 File Share Witness = 4 votes. Majority quorum requires 3/4 votes, so you can lose 1 node. If you add a 4th replica later, quorum math changes. Always verify with Get-ClusterQuorum.
Step 5: Add Async Commit Replica to AG
On the current PRIMARY (connect via SSMS):
-- Step 1: Enable Always On on the DR replica first
-- (RDP into sqlvm-dr1, run in elevated PowerShell)
-- Enable-SqlAlwaysOn -ServerInstance "sqlvm-dr1" -Force
-- Step 2: On PRIMARY — add the async replica
ALTER AVAILABILITY GROUP [sqlha-ag]
ADD REPLICA ON N'sqlvm-dr1'
WITH (
ENDPOINT_URL = N'TCP://sqlvm-dr1.sqlha.local:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT, -- ← KEY DIFFERENCE
FAILOVER_MODE = MANUAL, -- ← Must be manual for async
SEEDING_MODE = AUTOMATIC, -- Automatic initial seeding
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY, -- Read-only routing target
READ_ONLY_ROUTING_URL = N'TCP://sqlvm-dr1.sqlha.local:1433'
)
);
On sqlvm-dr1:
-- Join the replica to the AG
ALTER AVAILABILITY GROUP [sqlha-ag] JOIN;
-- Grant AG permission for automatic seeding
ALTER AVAILABILITY GROUP [sqlha-ag] GRANT CREATE ANY DATABASE;
Verify the replica is synchronizing:
-- On PRIMARY — check all replicas
SELECT
r.replica_server_name,
r.availability_mode_desc,
r.failover_mode_desc,
rs.role_desc,
rs.synchronization_health_desc,
rs.connected_state_desc
FROM sys.availability_replicas r
JOIN sys.dm_hadr_availability_replica_states rs
ON r.replica_id = rs.replica_id
ORDER BY r.replica_server_name;
Expected output:
| replica_server_name | availability_mode_desc | failover_mode_desc | role_desc | sync_health |
|---|---|---|---|---|
| sqlvm-ha1 | SYNCHRONOUS_COMMIT | AUTOMATIC | PRIMARY | HEALTHY |
| sqlvm-ha2 | SYNCHRONOUS_COMMIT | AUTOMATIC | SECONDARY | HEALTHY |
| sqlvm-dr1 | ASYNCHRONOUS_COMMIT | MANUAL | SECONDARY | HEALTHY |
Step 6: Test Cross-Region Failover
Measure the replication lag first (on PRIMARY):
-- Check send/redo queue sizes for each replica
SELECT
ag.name AS ag_name,
r.replica_server_name,
rs.role_desc,
drs.database_id,
drs.synchronization_state_desc,
drs.log_send_queue_size, -- KB pending send
drs.redo_queue_size, -- KB pending redo on secondary
drs.log_send_rate, -- KB/sec being sent
drs.redo_rate, -- KB/sec being applied
drs.last_commit_time,
DATEDIFF(SECOND, drs.last_commit_time, GETUTCDATE()) AS lag_seconds
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas r ON drs.replica_id = r.replica_id
JOIN sys.availability_groups ag ON r.group_id = ag.group_id
ORDER BY r.replica_server_name;
Manual failover to DR (forced, with potential data loss):
-- On sqlvm-dr1 — FORCE failover (accepts potential data loss)
ALTER AVAILABILITY GROUP [sqlha-ag] FORCE_FAILOVER_ALLOW_DATA_LOSS;
FORCE_FAILOVER_ALLOW_DATA_LOSS is the only way to failover to an async replica. Any transactions not yet replicated are lost. After forced failover, the old primary will be in RESOLVING state and needs to be manually resumed.
After testing, failback to primary region:
-- On the old primary (sqlvm-ha1) — resume after forced failover
ALTER DATABASE [YourDB] SET HADR RESUME;
-- On sqlvm-dr1 (current primary) — failover back to original primary
-- First, wait for sync to catch up, then:
ALTER AVAILABILITY GROUP [sqlha-ag] FAILOVER; -- Only works if target is sync commit
-- OR if sqlvm-ha1 is still async, force again:
ALTER AVAILABILITY GROUP [sqlha-ag] FORCE_FAILOVER_ALLOW_DATA_LOSS;
Step 7: Compare Sync vs Async Performance
Run this workload test on the PRIMARY to see the commit latency difference:
-- Create test table
CREATE TABLE dbo.LatencyTest (
Id INT IDENTITY PRIMARY KEY,
Payload NVARCHAR(200),
InsertedAt DATETIME2 DEFAULT SYSUTCDATETIME()
);
-- Measure insert throughput (single-row inserts to show commit latency)
DECLARE @start DATETIME2 = SYSUTCDATETIME();
DECLARE @i INT = 0;
WHILE @i < 1000
BEGIN
INSERT INTO dbo.LatencyTest (Payload)
VALUES (REPLICATE('X', 200));
SET @i += 1;
END
SELECT DATEDIFF(MILLISECOND, @start, SYSUTCDATETIME()) AS total_ms,
DATEDIFF(MILLISECOND, @start, SYSUTCDATETIME()) / 1000.0 AS avg_ms_per_insert;
Compare with the replica lag:
-- Real-time view of replication latency
SELECT
r.replica_server_name,
r.availability_mode_desc,
drs.log_send_queue_size AS send_queue_kb,
drs.redo_queue_size AS redo_queue_kb,
CASE
WHEN drs.last_commit_time IS NULL THEN 'N/A'
ELSE CAST(DATEDIFF(MILLISECOND, drs.last_commit_time, GETUTCDATE()) AS VARCHAR) + ' ms'
END AS replication_lag
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas r ON drs.replica_id = r.replica_id
WHERE drs.is_local = 0;
Key Learnings: Same Region vs Cross Region
| Aspect | Same Region (Sync) | Cross Region (Async) |
|---|---|---|
| Commit mode | Synchronous | Asynchronous |
| Failover mode | Automatic | Manual only |
| RPO | 0 (zero data loss) | > 0 (depends on lag) |
| RTO | ~10-30 seconds | Manual action + ~30 sec |
| Network latency | ~1ms | ~15-30ms (EU regions) |
| Impact on commits | Slight — must wait for remote harden | None — async fire and forget |
| Use case | HA within a region | DR across regions |
Common Failures
| Symptom | Likely Cause | Fix |
|---|---|---|
| Cross-region replica stuck SYNCHRONIZING but high latency | WAN bandwidth bottleneck on log block ship | Confirm async commit; size ExpressRoute / VPN; monitor log_send_queue_size |
| Forced failover lost transactions | Async secondary was behind by N MB of log | Document RPO at this lag level; alert when log_send_queue_size_kb > threshold |
| AG won't fail back: "database not in valid state" | Old primary was force-failed; needs RESUME | ALTER DATABASE [DB] SET HADR RESUME on old primary |
| Listener returns wrong IP after region failover | DNS TTL too long; client cached old IP | Lower TTL on listener DNS to 30s; add MultiSubnetFailover=True to connection string |
| Cluster split-brain after WAN partition | No witness or witness in same partition | Place Cloud Witness in a 3rd region |
Async commit + automatic failover = forbidden combo. SQL won't let you set both. Auto-failover requires sync commit, which requires acceptable WAN latency (< 10ms ideally). Cross-region is almost always async + manual failover.
Variations to Try
- Add a second async replica in a 3rd region for additional DR redundancy
- Convert to distributed AG to chain two AGs across regions (no single-cluster span)
- Use Read-only routing so cross-region reporting workloads land on the DR replica
- Place backups on async secondary to keep them out of the primary region
- Test forced failover quarterly, document RPO/RTO observed
Production Application
- DR for tier-1 OLTP when PaaS Failover Group isn't viable (legacy SQL features, CLR, agent jobs).
- Regulated workload requiring cross-region recovery without leaving IaaS (custom encryption, third-party agents).
- Active-passive cross-region — keep DR cold-warm via async; pay only for VM compute, not duplicated PaaS S-tier.
- Anti-pattern: don't pick this for sub-second cross-region failover — use a distributed system (Cosmos DB, etc.) at the data tier instead.
Clean Up
# Remove the DR replica from AG (on current PRIMARY)
# ALTER AVAILABILITY GROUP [sqlha-ag] REMOVE REPLICA ON N'sqlvm-dr1';
# Remove DR node from cluster
# Remove-ClusterNode -Name "sqlvm-dr1" -Force
# Delete DR VM
az vm delete --resource-group rg-sql-ha --name sqlvm-dr1 --yes
# Remove peering
az network vnet peering delete --resource-group rg-sql-ha --name peer-primary-to-dr --vnet-name vnet-sql-ha
az network vnet peering delete --resource-group rg-sql-ha --name peer-dr-to-primary --vnet-name vnet-sql-dr
# Delete DR VNET and NSG
az network vnet delete --resource-group rg-sql-ha --name vnet-sql-dr
az network nsg delete --resource-group rg-sql-ha --name nsg-sql-dr
What's Next?
| Direction | Lab |
|---|---|
| Try managed HA instead of VMs | SQL MI Flavors & Replicas |
| Compare PaaS DR options | Geo-Replication & Failover |
| Try all migration paths | Complete Migration Scenarios |