Skip to main content

High Availability & Disaster Recovery

The HA/DR Architecture Map

HA/DR Architecture Map
🏢
Local HA (Built-in)
GP: Service Fabric + remote storage. BC: 4-node Always On AG with local SSD. Automatic.
🌍
Cross-Region DR
Failover Groups (auto failover + single endpoint) or Active Geo-Replication (manual, up to 4 replicas)
💾
Backup & Restore
PITR (7-35 days), LTR (up to 10 years), Geo-restore from GRS backup
🏢 Real-World DBA Note

Oracle DBA parallel: Local HA = RAC. Cross-Region DR = Data Guard. The big difference: in Azure, local HA is built-in to the service tier (you don't configure it). You only configure DR (Failover Groups, Geo-Replication).

Backup & Restore — The Safety Net

For a comprehensive deep dive into backup, restore, PITR, and LTR, see the dedicated page: Backup, Restore, PITR & LTR.

Automated Backups (Zero Configuration)

Backup TypeFrequencyYou Configure?
FullWeekly❌ Automatic
DifferentialEvery 12-24 hours❌ Automatic
Transaction logEvery 5-10 minutes❌ Automatic
SettingRangeDefault
Short-term retention (PITR)7-35 days7 days
Long-term retention (LTR)Up to 10 yearsNot configured
Backup storage redundancyLRS / ZRS / GRS / RA-GRSGRS
🎯 Exam Focus

Backup frequencies are exam favorites: Full = weekly, Diff = 12-24h, Log = 5-10 min. Point-in-time restore creates a new database — it NEVER overwrites the existing one. This catches many exam takers.

Restore Options

ScenarioMethodRPO
"Restore to 3 hours ago"PITR (point-in-time restore)Seconds (within retention)
"Restore deleted database"Deleted database restoreTo deletion time
"Keep backup for 5 years"Long-Term Retention (LTR)Weekly/monthly/yearly snapshots
"Restore to different region"Geo-restore from GRS backupUp to 1 hour lag

Backup Storage Redundancy — Architecture

Backup Storage Redundancy
💾
LRS
3 copies in one datacenter. Lowest cost. No cross-region protection.
🏢
ZRS
3 copies across availability zones. Protects against zone failures.
🌍
GRS
6 copies: 3 local + 3 in paired region. Default. Cross-region protection.
🔄
RA-GRS
GRS + read access to secondary region copies. Highest availability.
⚠️ Watch Out

Backup storage redundancy can only be set at database creation time (or within a short window for SQL DB). You CANNOT change from LRS to GRS after the fact. Choose wisely based on compliance requirements.

Active Geo-Replication

Active Geo-Replication Architecture

Asynchronous replication to up to 4 readable secondaries in any Azure region.

AspectDetail
Max secondaries4 (any region)
ReplicationAsynchronous
Readable?✅ Yes — offload reporting
Auto failover?❌ No — manual only
EndpointEach secondary has its own connection string
ScopePer-database

When to use: You need more than 1 secondary, or you want readable replicas in multiple regions for read locality.

For a detailed side-by-side comparison, see Failover Groups vs Geo-Replication.

Automatic failover with a single connection endpoint that doesn't change.

Architecture

Auto-Failover Groups Architecture
Failover Groups Architecture
✉️
Single Endpoint
fog-name.database.windows.net — never changes on failover. Apps need no connection string update.
Automatic Failover
Grace period (default 60 min) before auto-failover. Manual failover also available.
📚
Read-Only Endpoint
fog-name.secondary.database.windows.net routes to readable secondary for reporting.

Connection endpoints (they NEVER change):

  • Read-Write: fog-name.database.windows.net
  • Read-Only: fog-name.secondary.database.windows.net
🎯 Exam Focus

Failover Groups vs Geo-Replication — the key exam differences:

  1. Failover Groups = auto failover + single endpoint. Geo-Rep = manual failover + per-replica endpoint.
  2. Failover Groups can include multiple databases. Geo-Rep is per-database.
  3. Failover Groups have a grace period before auto-failover triggers (default 60 min).
  4. For MI: Failover Groups replicate ALL user databases (all or nothing).

Comparison Table

FeatureActive Geo-RepFailover GroupAlways On AG (VM)
ScopeAzure SQL DBAzure SQL DB + MISQL Server on VM
Max replicas41 partnerUp to 9
Auto failover✅ (within cluster)
Single endpoint✅ (Listener)
Cross-region
Read-only routing
Grace periodN/AConfigurable (default 60 min)N/A (cluster quorum)

RTO and RPO — The Business Conversation

SolutionRPO (data loss)RTO (downtime)
PITR (backup restore)5-10 minHours (depends on DB size)
Geo-Restore (GRS backup)Up to 1 hourHours
Active Geo-Replication< 5 seconds30 seconds (manual trigger)
Failover Groups (auto)< 5 secondsGrace period + ~30 seconds
BC tier read replica0 (synchronous)< 10 seconds
Always On AG (sync)0 (synchronous)< 30 seconds
🏢 Real-World DBA Note

Cost vs. RTO/RPO tradeoff: Automated backups (free) give you hours of RTO. Geo-Replication/Failover Groups (extra cost for secondary) give you seconds of RPO and ~30s RTO. Business Critical tier gives you near-zero downtime for local failures. Choose based on business requirements, not maximum capability.


Anti-Patterns

  • "PITR is our DR." PITR restores into the same region. A regional outage takes the backups with it (unless GRS / geo-restore is on). PITR is recovery-from-mistake, not recovery-from-region-loss.
  • "Geo-Replication = HA." Geo-Rep is async (RPO < 5 s but non-zero) and manual failover only. If the app has to fail over by itself, you need Failover Groups, not raw Geo-Rep.
  • "Set the grace period to 0 minutes for fastest failover." Grace period = 0 means flapping triggers cross-region failover. The recommended floor is 60 minutes for production. Faster RTO is what manual failover is for.
  • "BC tier replicas are a DR strategy." BC's 4 local replicas are zone HA, all in one region. They do not survive a region-wide outage. Pair BC with a Failover Group for DR.
  • "Use Always On AG on Azure VM the same way as on-prem." Azure-specific: you need Azure Load Balancer (or DNN listener), Standard SKU LB rules, floating IP, and a witness in a 3rd zone or Cloud Witness in Storage. Skipping any of these breaks failover silently.
  • "Test DR by restoring backups in a runbook." Real DR drill = trigger a forced failover on a Failover Group during business hours. If you only do planned failovers, you've never actually tested the unplanned path.
⚠️ Watch Out

Failover Group grace period is per-region, not per-DB. All databases in the FOG fail over together after the grace period elapses. You cannot fail over one DB while keeping the others on the primary.


Migration Between HA/DR Postures

From → ToPathCost
PITR-only → + Geo-RestoreSwitch backup storage to GRSFree; takes effect on next backup cycle
Geo-Restore → Active Geo-ReplicationProvision secondary; seed copyDoubles compute cost; seeding online
Active Geo-Rep → Failover GroupWrap existing replica into a FOGOnline; no data move; gain auto-failover endpoint
GP → BC (zone HA)Online scale within vCore~5–10 s disconnect at swap; 4 local replicas appear
Single region BC → BC + FOGAdd secondary in paired regionDoubles compute; one-time seed
VM with no AG → AG with sync replicaAdd 2nd VM, configure WSFC + AG + LBManual; cluster build is the work, not the DB seed
Premium DTU read-scale → BC vCore replicaReprovision to vCore BCBrief disconnect; explicit replica endpoint becomes available

Most upgrades are additive (provision a secondary, scale up a tier) — cheap and online. The only expensive step is VM AG because the work is cluster + LB, not the database itself.


Real Scenarios

  1. SaaS with 99.99 % SLA, single region OKGP + zone-redundant config. Driver: SLA met without DR cost. Trade-off: regional outage = downtime, accepted contractually.
  2. Banking app, 99.995 % + cross-region failoverBC + Failover Group, paired region. Driver: zero local RPO + auto-failover endpoint. Trade-off: 2× compute cost; reads from secondary forbidden by compliance.
  3. Reporting workload that can tolerate stale dataGP + Active Geo-Replication, secondary used for reads. Driver: offload reads, no FOG needed because failover is manual. Trade-off: app must know about two endpoints.
  4. SQL VM running ISV software, no PaaS optionAlways On AG, 2 sync nodes in zones + 1 async DR node in paired region, ALB DNN listener. Driver: ISV certification. Trade-off: customer owns the cluster + LB + witness lifecycle.
  5. Dev/test, recoverability matters but cost doesn'tPITR with 7-day retention, no Geo-Restore. Driver: recover from accidental DROP TABLE. Trade-off: regional outage = restore from scratch in another region from a separate copy.

Flashcards

How often are transaction log backups taken in Azure SQL DB?
Click to reveal answer
Every 5-10 minutes, automatically
1 / 8

Quiz

Q1/5
0 correct
A database needs automatic cross-region failover with a connection string that doesn't change. What should you use?