Lab: Active Geo-Replication & Auto-Failover Groups
Deploy an Azure SQL Database with active geo-replication across two regions, configure an auto-failover group with a grace period, trigger failover, and measure real-world RPO/RTO.
DP-300 Domain: 5 — Plan and Configure HA/DR (20-25%)
Architecture
Geo-Replication vs Failover Groups
What Gets Deployed
| Component | Resource | Region |
|---|---|---|
| Primary SQL Server | sql-geo-primary-<random> | West Europe |
| Secondary SQL Server | sql-geo-secondary-<random> | North Europe |
| Database | GeoLabDB (S1 tier) | Both (replicated) |
| Failover Group | fg-geolab-<random> | Cross-region |
Lab Steps
Step 1: Deploy Primary Database
$resourceGroup = "rg-geo-lab"
$primaryLocation = "westeurope"
$secondaryLocation = "northeurope"
$suffix = Get-Random -Maximum 9999
$primaryServer = "sql-geo-primary-$suffix"
$secondaryServer = "sql-geo-secondary-$suffix"
$failoverGroup = "fg-geolab-$suffix"
$adminPassword = "<YOUR_STRONG_PASSWORD>"
# Create resource group
az group create --name $resourceGroup --location $primaryLocation
# Create primary SQL server
az sql server create `
--resource-group $resourceGroup `
--name $primaryServer `
--admin-user sqladmin `
--admin-password $adminPassword `
--location $primaryLocation
# Allow Azure services (for cross-region replication)
az sql server firewall-rule create `
--resource-group $resourceGroup `
--server $primaryServer `
--name AllowAzureServices `
--start-ip-address 0.0.0.0 `
--end-ip-address 0.0.0.0
# Allow your IP
$myIp = (Invoke-RestMethod -Uri https://api.ipify.org)
az sql server firewall-rule create `
--resource-group $resourceGroup `
--server $primaryServer `
--name AllowMyIP `
--start-ip-address $myIp `
--end-ip-address $myIp
# Create database with sample data
az sql db create `
--resource-group $resourceGroup `
--server $primaryServer `
--name GeoLabDB `
--service-objective S1 `
--sample-name AdventureWorksLT
Step 2: Create Geo-Replica
# Create secondary SQL server in another region
az sql server create `
--resource-group $resourceGroup `
--name $secondaryServer `
--admin-user sqladmin `
--admin-password $adminPassword `
--location $secondaryLocation
# Allow Azure services on secondary
az sql server firewall-rule create `
--resource-group $resourceGroup `
--server $secondaryServer `
--name AllowAzureServices `
--start-ip-address 0.0.0.0 `
--end-ip-address 0.0.0.0
# Allow your IP on secondary
az sql server firewall-rule create `
--resource-group $resourceGroup `
--server $secondaryServer `
--name AllowMyIP `
--start-ip-address $myIp `
--end-ip-address $myIp
# Create active geo-replication link
az sql db replica create `
--resource-group $resourceGroup `
--server $primaryServer `
--name GeoLabDB `
--partner-server $secondaryServer `
--partner-resource-group $resourceGroup `
--secondary-type Geo
Verify replication is active:
az sql db replica list-links `
--resource-group $resourceGroup `
--server $primaryServer `
--name GeoLabDB `
--query "[].{Role:role, State:replicationState, Partner:partnerServer}" `
-o table
Expected:
Role State Partner
-------- ----------- ----------------------
Primary CATCH_UP sql-geo-secondary-1234
CATCH_UP means the secondary is still syncing. Wait until it shows SEEDING → CATCH_UP → the replica is usable once it shows up in the portal as "Online". For a small S1 database, this takes 1-3 minutes.
Step 3: Configure Auto-Failover Group
# Create failover group
az sql failover-group create `
--resource-group $resourceGroup `
--server $primaryServer `
--name $failoverGroup `
--partner-server $secondaryServer `
--partner-resource-group $resourceGroup `
--failover-policy Automatic `
--grace-period 60
# Check failover group status
az sql failover-group show `
--resource-group $resourceGroup `
--server $primaryServer `
--name $failoverGroup `
--query "{Name:name, Role:replicationRole, GracePeriod:readWriteEndpoint.failoverPolicy}" `
-o table
Connection strings (use these in applications):
| Endpoint | DNS Name | Purpose |
|---|---|---|
| Read-Write | fg-geolab-<suffix>.database.windows.net | Routes to current primary |
| Read-Only | fg-geolab-<suffix>.secondary.database.windows.net | Routes to secondary replica |
Step 4: Test Manual Failover
# Insert a timestamp row on primary (for RPO measurement)
$timestamp = Get-Date -Format "yyyy-MM-dd HH:mm:ss"
Write-Host "Pre-failover timestamp: $timestamp"
# Trigger planned failover (no data loss)
az sql failover-group set-primary `
--resource-group $resourceGroup `
--server $secondaryServer `
--name $failoverGroup
# Verify roles swapped
az sql failover-group show `
--resource-group $resourceGroup `
--server $secondaryServer `
--name $failoverGroup `
--query "{Name:name, Role:replicationRole}" `
-o table
Expected: The secondary server is now Primary.
Step 5: Test Read-Only Routing
-- Connect to the read-only endpoint:
-- fg-geolab-<suffix>.secondary.database.windows.net
-- This query should work (reads are allowed)
SELECT TOP 10 CustomerID, CompanyName
FROM SalesLT.Customer
ORDER BY CustomerID;
-- This query should FAIL (writes are blocked on read-only replica)
INSERT INTO SalesLT.Customer (FirstName, LastName, CompanyName)
VALUES ('Test', 'User', 'Failover Test Inc.');
-- Error: Failed to update database because the database is read-only.
Use read-only routing to offload reporting queries from the primary. In customer scenarios, BI dashboards connect to the secondary endpoint, keeping the primary available for transactional workloads.
Step 6: Fail Back to Original Primary
# Fail back to original primary
az sql failover-group set-primary `
--resource-group $resourceGroup `
--server $primaryServer `
--name $failoverGroup
# Verify we're back
az sql failover-group show `
--resource-group $resourceGroup `
--server $primaryServer `
--name $failoverGroup `
--query "{Primary:replicationRole, GracePeriod:readWriteEndpoint.failoverWithDataLossGracePeriod}" `
-o table
Measuring RPO and RTO
| Metric | Planned Failover | Forced Failover |
|---|---|---|
| RPO | 0 (no data loss) | Depends on replication lag (typically < 5 sec) |
| RTO | ~30-60 seconds | Grace period + recovery (60 min default → configurable) |
| DNS TTL | 30 seconds | Application reconnects to same listener URL |
-- Check replication lag (run on primary)
SELECT
ag.partner_server,
ag.replication_state_desc,
ag.last_replication,
DATEDIFF(SECOND, ag.last_replication, GETUTCDATE()) AS lag_seconds
FROM sys.dm_geo_replication_link_status ag;
Verification Checklist
| Check | Expected | Command |
|---|---|---|
| Failover group exists | fg-geolab-* | az sql failover-group list ... |
| Read-write routes to primary | Queries succeed | Connect to .database.windows.net |
| Read-only routes to secondary | Reads succeed, writes fail | Connect to .secondary.database.windows.net |
| Planned failover completes | Roles swap, 0 data loss | az sql failover-group set-primary |
| Failback completes | Original roles restored | Same command, original server |
Common Failures
| Symptom | Likely Cause | Fix |
|---|---|---|
| Failover Group create fails: "server not empty on secondary region" | DB with same name exists on target server | Drop or rename target DB; FOG requires empty target |
| Geo-secondary stuck CATCH_UP forever | Initial seed still copying (large DB) | Wait — ~150 GB/hr typical; monitor sys.dm_geo_replication_link_status |
| App connects to read-only listener and gets read-write errors | Connection string uses RW listener for read traffic | Split: use <fog>.database.windows.net (RW) and <fog>.secondary.database.windows.net (RO) |
| Auto-failover never triggers despite outage | Grace period (default 1h) not yet elapsed | Tune --failover-policy Automatic --grace-period-with-data-loss-hours 1 |
| Forced failover loses transactions | Async replication lag at moment of failover | Track lag via replication_lag_sec; tune SLA |
Failover Group requires same service tier on both servers. Mixing General Purpose primary + Hyperscale secondary will reject FOG creation. Match tiers, then upgrade in lockstep.
Variations to Try
- Test planned failover vs forced failover — observe difference in
replication_lag_sec - Add a 3rd geo-secondary via active geo-replication (FOG supports up to 4 secondaries since 2024)
- Wire Action Group + alert on
replication_lag_sec > 60s - Drop the auto-failover policy temporarily and use manual failover only for compliance scenarios
- Pair FOG with LTR geo-redundant for layered DR + retention
Production Application
- PaaS SaaS app, multi-region active-passive, RTO < 60 sec → FOG with auto-failover, grace period 1h.
- Compliance-bound: explicit operator-driven failover only → FOG with manual failover policy.
- Multi-tenant SaaS, per-tenant database → FOG covers all DBs on the server in one flip.
- Anti-pattern: don't use FOG when you need sub-second RPO across regions — SQL is async cross-region. Use app-level dual-write or eventual consistency designs.
Clean Up
az group delete --name rg-geo-lab --yes --no-wait