Skip to main content

Lab: Cross-Region AlwaysOn Availability Groups

Extend the single-region AlwaysOn AG into a multi-region setup — deploy a 3rd SQL replica in a different Azure region with async commit, configure distributed AG or multi-subnet AG, and understand the trade-offs between same-region sync commit and cross-region async commit.

DP-300 Domain: 5 — Plan and Configure HA/DR (20-25%)

🏢 Real-World DBA Note

Real customer scenario: your primary AG runs in West Europe with sync commit and automatic failover. A 3rd replica in North Europe provides async DR. If the entire West Europe region goes down, you can manually failover to North Europe with minimal data loss (RPO < 5 seconds under normal load).

Why Cross-Region?

Same-Region vs Cross-Region AG
🏠
Same-Region AG (built in previous lab)
Sync commit, automatic failover
Zero data loss (RPO = 0)
Protects: VM/rack/fault domain failures
Does NOT protect: region-wide outages
🌍
Cross-Region AG (this lab)
Async commit to remote replica
Near-zero data loss (RPO < 5s)
Protects: entire region failures
Manual failover with potential data loss

Architecture

Cross-region AlwaysOn AG lab: West Europe sync pair (sqlvm-ha1 primary + sqlvm-ha2 sync secondary) with North Europe async DR replica via global VNET peering

What Gets Deployed

ComponentResourceRegionPurpose
Primary AGsqlvm-ha1 + sqlvm-ha2West EuropeSync commit pair with automatic failover
DR Replicasqlvm-dr1 (10.1.1.10)North EuropeAsync commit for region-level DR
VNET PeeringGlobal peeringCross-regionConnects the two VNETs
DR VNETvnet-sql-dr (10.1.0.0/16)North EuropeNetwork for DR replica
NSGnsg-sql-drNorth EuropeFirewall for DR subnet

Prerequisites

  • Completed the AlwaysOn HA Setup lab — the primary 2-node AG must be running
  • Azure subscription with quota for 1 additional Standard_D2ads_v6 VM
  • Existing domain sqlha.local running on dc-vm

Lab Steps

Deploy DR Infrastructure
Create VNET in North Europe, deploy SQL VM, configure VNET peering
Domain-Join DR Replica
Configure DNS to use dc-vm, join sqlvm-dr1 to sqlha.local domain
Install Clustering & Add to WSFC
Install failover clustering feature, add node to existing cluster
Add Async Replica to AG
Add sqlvm-dr1 as async commit replica to the existing AG
Test Cross-Region Failover
Force manual failover to DR, measure data loss, failback to primary
Compare Sync vs Async Performance
Measure commit latency, replication lag, and transaction throughput

Step 1: Deploy DR Infrastructure

$drResourceGroup = "rg-sql-ha"       # Same RG or separate — your choice
$drLocation = "northeurope"
$primaryVnet = "vnet-sql-ha"
$primaryRG = "rg-sql-ha"

# Create DR VNET in North Europe
az network vnet create `
--resource-group $drResourceGroup `
--name vnet-sql-dr `
--location $drLocation `
--address-prefixes 10.1.0.0/16 `
--subnet-name dr-subnet `
--subnet-prefixes 10.1.1.0/24

# Create NSG for DR subnet
az network nsg create `
--resource-group $drResourceGroup `
--name nsg-sql-dr `
--location $drLocation

# Allow SQL, HADR mirroring, RDP, and WSFC
az network nsg rule create --resource-group $drResourceGroup --nsg-name nsg-sql-dr `
--name AllowSQL --priority 100 --direction Inbound --access Allow `
--protocol Tcp --destination-port-range 1433 --source-address-prefix VirtualNetwork
az network nsg rule create --resource-group $drResourceGroup --nsg-name nsg-sql-dr `
--name AllowHADR --priority 200 --direction Inbound --access Allow `
--protocol Tcp --destination-port-range 5022 --source-address-prefix VirtualNetwork
az network nsg rule create --resource-group $drResourceGroup --nsg-name nsg-sql-dr `
--name AllowRDP --priority 300 --direction Inbound --access Allow `
--protocol Tcp --destination-port-range 3389 --source-address-prefix '*'
az network nsg rule create --resource-group $drResourceGroup --nsg-name nsg-sql-dr `
--name AllowWSFC --priority 400 --direction Inbound --access Allow `
--protocol '*' --destination-port-ranges 135 5985 49152-65535 --source-address-prefix VirtualNetwork

# Associate NSG with DR subnet
az network vnet subnet update `
--resource-group $drResourceGroup `
--vnet-name vnet-sql-dr `
--name dr-subnet `
--network-security-group nsg-sql-dr

# Deploy DR SQL VM
az vm create `
--resource-group $drResourceGroup `
--name sqlvm-dr1 `
--image MicrosoftSQLServer:sql2022-ws2022:enterprise-gen2:latest `
--size Standard_D2ads_v6 `
--admin-username azureadmin `
--admin-password '<YOUR_PASSWORD>' `
--vnet-name vnet-sql-dr `
--subnet dr-subnet `
--private-ip-address 10.1.1.10 `
--public-ip-sku Standard `
--nsg nsg-sql-dr `
--location $drLocation

Step 2: Configure VNET Peering

# Get VNET resource IDs
$vnet1Id = az network vnet show --resource-group $primaryRG --name $primaryVnet --query id -o tsv
$vnet2Id = az network vnet show --resource-group $drResourceGroup --name vnet-sql-dr --query id -o tsv

# Create peering: Primary → DR
az network vnet peering create `
--resource-group $primaryRG `
--name peer-primary-to-dr `
--vnet-name $primaryVnet `
--remote-vnet $vnet2Id `
--allow-vnet-access `
--allow-forwarded-traffic

# Create peering: DR → Primary
az network vnet peering create `
--resource-group $drResourceGroup `
--name peer-dr-to-primary `
--vnet-name vnet-sql-dr `
--remote-vnet $vnet1Id `
--allow-vnet-access `
--allow-forwarded-traffic

# Verify peering status
az network vnet peering list --resource-group $primaryRG --vnet-name $primaryVnet -o table
⚠️ Watch Out

VNET peering must be created in both directions. The peering status should show Connected on both sides. Traffic won't flow until both sides are Connected.

Step 3: Configure DNS & Domain Join DR Replica

# Update DR VNET DNS to point to the primary DC
az network vnet update `
--resource-group $drResourceGroup `
--name vnet-sql-dr `
--dns-servers 10.0.1.4

# Restart DR VM to pick up new DNS
az vm restart --resource-group $drResourceGroup --name sqlvm-dr1

RDP into sqlvm-dr1 and run:

# Verify DNS resolution to domain controller
nslookup sqlha.local
# Should resolve to 10.0.1.4

# Domain join
Add-Computer -DomainName "sqlha.local" -Credential (Get-Credential SQLHA\azureadmin) -Restart

After reboot, verify:

# Verify domain membership
(Get-WmiObject Win32_ComputerSystem).Domain
# Expected: sqlha.local

Step 4: Install Failover Clustering & Add to WSFC

RDP into sqlvm-dr1:

# Install Failover Clustering feature
Install-WindowsFeature Failover-Clustering -IncludeManagementTools

# Restart
Restart-Computer -Force

After reboot, from any existing cluster node (e.g., sqlvm-ha1):

# Add DR node to existing cluster
Add-ClusterNode -Name "sqlvm-dr1" -Cluster "sqlha-cluster"

# Verify — should show 3 nodes
Get-ClusterNode | Format-Table Name, State, NodeWeight

# Verify quorum — with 3 nodes, File Share Witness gives 4 votes (odd)
Get-ClusterQuorum | Format-List
🎯 Exam Focus

With 3 nodes + 1 File Share Witness = 4 votes. Majority quorum requires 3/4 votes, so you can lose 1 node. If you add a 4th replica later, quorum math changes. Always verify with Get-ClusterQuorum.

Step 5: Add Async Commit Replica to AG

On the current PRIMARY (connect via SSMS):

-- Step 1: Enable Always On on the DR replica first
-- (RDP into sqlvm-dr1, run in elevated PowerShell)
-- Enable-SqlAlwaysOn -ServerInstance "sqlvm-dr1" -Force

-- Step 2: On PRIMARY — add the async replica
ALTER AVAILABILITY GROUP [sqlha-ag]
ADD REPLICA ON N'sqlvm-dr1'
WITH (
ENDPOINT_URL = N'TCP://sqlvm-dr1.sqlha.local:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT, -- ← KEY DIFFERENCE
FAILOVER_MODE = MANUAL, -- ← Must be manual for async
SEEDING_MODE = AUTOMATIC, -- Automatic initial seeding
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY, -- Read-only routing target
READ_ONLY_ROUTING_URL = N'TCP://sqlvm-dr1.sqlha.local:1433'
)
);

On sqlvm-dr1:

-- Join the replica to the AG
ALTER AVAILABILITY GROUP [sqlha-ag] JOIN;

-- Grant AG permission for automatic seeding
ALTER AVAILABILITY GROUP [sqlha-ag] GRANT CREATE ANY DATABASE;

Verify the replica is synchronizing:

-- On PRIMARY — check all replicas
SELECT
r.replica_server_name,
r.availability_mode_desc,
r.failover_mode_desc,
rs.role_desc,
rs.synchronization_health_desc,
rs.connected_state_desc
FROM sys.availability_replicas r
JOIN sys.dm_hadr_availability_replica_states rs
ON r.replica_id = rs.replica_id
ORDER BY r.replica_server_name;

Expected output:

replica_server_nameavailability_mode_descfailover_mode_descrole_descsync_health
sqlvm-ha1SYNCHRONOUS_COMMITAUTOMATICPRIMARYHEALTHY
sqlvm-ha2SYNCHRONOUS_COMMITAUTOMATICSECONDARYHEALTHY
sqlvm-dr1ASYNCHRONOUS_COMMITMANUALSECONDARYHEALTHY

Step 6: Test Cross-Region Failover

Measure the replication lag first (on PRIMARY):

-- Check send/redo queue sizes for each replica
SELECT
ag.name AS ag_name,
r.replica_server_name,
rs.role_desc,
drs.database_id,
drs.synchronization_state_desc,
drs.log_send_queue_size, -- KB pending send
drs.redo_queue_size, -- KB pending redo on secondary
drs.log_send_rate, -- KB/sec being sent
drs.redo_rate, -- KB/sec being applied
drs.last_commit_time,
DATEDIFF(SECOND, drs.last_commit_time, GETUTCDATE()) AS lag_seconds
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas r ON drs.replica_id = r.replica_id
JOIN sys.availability_groups ag ON r.group_id = ag.group_id
ORDER BY r.replica_server_name;

Manual failover to DR (forced, with potential data loss):

-- On sqlvm-dr1 — FORCE failover (accepts potential data loss)
ALTER AVAILABILITY GROUP [sqlha-ag] FORCE_FAILOVER_ALLOW_DATA_LOSS;
⚠️ Watch Out

FORCE_FAILOVER_ALLOW_DATA_LOSS is the only way to failover to an async replica. Any transactions not yet replicated are lost. After forced failover, the old primary will be in RESOLVING state and needs to be manually resumed.

After testing, failback to primary region:

-- On the old primary (sqlvm-ha1) — resume after forced failover
ALTER DATABASE [YourDB] SET HADR RESUME;

-- On sqlvm-dr1 (current primary) — failover back to original primary
-- First, wait for sync to catch up, then:
ALTER AVAILABILITY GROUP [sqlha-ag] FAILOVER; -- Only works if target is sync commit

-- OR if sqlvm-ha1 is still async, force again:
ALTER AVAILABILITY GROUP [sqlha-ag] FORCE_FAILOVER_ALLOW_DATA_LOSS;

Step 7: Compare Sync vs Async Performance

Run this workload test on the PRIMARY to see the commit latency difference:

-- Create test table
CREATE TABLE dbo.LatencyTest (
Id INT IDENTITY PRIMARY KEY,
Payload NVARCHAR(200),
InsertedAt DATETIME2 DEFAULT SYSUTCDATETIME()
);

-- Measure insert throughput (single-row inserts to show commit latency)
DECLARE @start DATETIME2 = SYSUTCDATETIME();
DECLARE @i INT = 0;
WHILE @i < 1000
BEGIN
INSERT INTO dbo.LatencyTest (Payload)
VALUES (REPLICATE('X', 200));
SET @i += 1;
END
SELECT DATEDIFF(MILLISECOND, @start, SYSUTCDATETIME()) AS total_ms,
DATEDIFF(MILLISECOND, @start, SYSUTCDATETIME()) / 1000.0 AS avg_ms_per_insert;

Compare with the replica lag:

-- Real-time view of replication latency
SELECT
r.replica_server_name,
r.availability_mode_desc,
drs.log_send_queue_size AS send_queue_kb,
drs.redo_queue_size AS redo_queue_kb,
CASE
WHEN drs.last_commit_time IS NULL THEN 'N/A'
ELSE CAST(DATEDIFF(MILLISECOND, drs.last_commit_time, GETUTCDATE()) AS VARCHAR) + ' ms'
END AS replication_lag
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas r ON drs.replica_id = r.replica_id
WHERE drs.is_local = 0;

Key Learnings: Same Region vs Cross Region

AspectSame Region (Sync)Cross Region (Async)
Commit modeSynchronousAsynchronous
Failover modeAutomaticManual only
RPO0 (zero data loss)> 0 (depends on lag)
RTO~10-30 secondsManual action + ~30 sec
Network latency~1ms~15-30ms (EU regions)
Impact on commitsSlight — must wait for remote hardenNone — async fire and forget
Use caseHA within a regionDR across regions
Can an async commit replica use automatic failover?
Click to reveal answer
No. Async replicas can only use MANUAL failover (with FORCE_FAILOVER_ALLOW_DATA_LOSS). Automatic failover requires synchronous commit because the system must guarantee zero data loss.
1 / 3

Common Failures

SymptomLikely CauseFix
Cross-region replica stuck SYNCHRONIZING but high latencyWAN bandwidth bottleneck on log block shipConfirm async commit; size ExpressRoute / VPN; monitor log_send_queue_size
Forced failover lost transactionsAsync secondary was behind by N MB of logDocument RPO at this lag level; alert when log_send_queue_size_kb > threshold
AG won't fail back: "database not in valid state"Old primary was force-failed; needs RESUMEALTER DATABASE [DB] SET HADR RESUME on old primary
Listener returns wrong IP after region failoverDNS TTL too long; client cached old IPLower TTL on listener DNS to 30s; add MultiSubnetFailover=True to connection string
Cluster split-brain after WAN partitionNo witness or witness in same partitionPlace Cloud Witness in a 3rd region
⚠️ Watch Out

Async commit + automatic failover = forbidden combo. SQL won't let you set both. Auto-failover requires sync commit, which requires acceptable WAN latency (< 10ms ideally). Cross-region is almost always async + manual failover.

Variations to Try

  • Add a second async replica in a 3rd region for additional DR redundancy
  • Convert to distributed AG to chain two AGs across regions (no single-cluster span)
  • Use Read-only routing so cross-region reporting workloads land on the DR replica
  • Place backups on async secondary to keep them out of the primary region
  • Test forced failover quarterly, document RPO/RTO observed

Production Application

  • DR for tier-1 OLTP when PaaS Failover Group isn't viable (legacy SQL features, CLR, agent jobs).
  • Regulated workload requiring cross-region recovery without leaving IaaS (custom encryption, third-party agents).
  • Active-passive cross-region — keep DR cold-warm via async; pay only for VM compute, not duplicated PaaS S-tier.
  • Anti-pattern: don't pick this for sub-second cross-region failover — use a distributed system (Cosmos DB, etc.) at the data tier instead.

Clean Up

# Remove the DR replica from AG (on current PRIMARY)
# ALTER AVAILABILITY GROUP [sqlha-ag] REMOVE REPLICA ON N'sqlvm-dr1';

# Remove DR node from cluster
# Remove-ClusterNode -Name "sqlvm-dr1" -Force

# Delete DR VM
az vm delete --resource-group rg-sql-ha --name sqlvm-dr1 --yes

# Remove peering
az network vnet peering delete --resource-group rg-sql-ha --name peer-primary-to-dr --vnet-name vnet-sql-ha
az network vnet peering delete --resource-group rg-sql-ha --name peer-dr-to-primary --vnet-name vnet-sql-dr

# Delete DR VNET and NSG
az network vnet delete --resource-group rg-sql-ha --name vnet-sql-dr
az network nsg delete --resource-group rg-sql-ha --name nsg-sql-dr

What's Next?

DirectionLab
Try managed HA instead of VMsSQL MI Flavors & Replicas
Compare PaaS DR optionsGeo-Replication & Failover
Try all migration pathsComplete Migration Scenarios