Lab: SQL Server Always On Availability Groups on Azure VMs
Complete, production-ready guide — deploy a 2-node SQL Server 2022 Always On AG on Azure VMs from scratch, with automatic failover, zero data loss, Internal Load Balancer, and a real-time failover testing web app.
DP-300 Domain: 5 — Plan and Configure HA/DR (20-25%)
Architecture
What Gets Deployed
| Component | Resource | Purpose |
|---|---|---|
| VNET | vnet-sql-ha (10.0.0.0/16) | Network isolation |
| NSG | nsg-sql-ha | Firewall rules for RDP, SQL, HADR |
| Domain Controller | dc-vm (10.0.1.4) | AD DS + DNS + File Share Witness |
| SQL VM 1 | sqlvm-ha1 (10.0.1.10) | SQL Server 2022 Enterprise replica |
| SQL VM 2 | sqlvm-ha2 (10.0.1.11) | SQL Server 2022 Enterprise replica |
| Load Balancer | ilb-sqlha (10.0.1.30) | Routes SQL traffic to AG primary |
| Flask App | dc-vm:8080 | Real-time failover testing dashboard |
Lab Steps (7 Parts)
Step 1: Deploy Infrastructure
# Variables
$resourceGroup = "rg-sql-ha"
$location = "westeurope"
# Create resource group
az group create --name $resourceGroup --location $location
# Deploy infrastructure (2 SQL VMs + networking)
az deployment group create `
--resource-group $resourceGroup `
--template-file sql-ha-vms.bicep `
--parameters location=$location adminUsername=azureadmin adminPassword='<YOUR_PASSWORD>'
# Deploy Domain Controller VM
az vm create `
--resource-group $resourceGroup `
--name dc-vm `
--image Win2022Datacenter `
--size Standard_D2ads_v6 `
--admin-username azureadmin `
--admin-password '<YOUR_PASSWORD>' `
--vnet-name vnet-sql-ha `
--subnet sql-subnet `
--private-ip-address 10.0.1.4 `
--public-ip-sku Standard `
--nsg nsg-sql-ha
Step 2: Install Active Directory
RDP into dc-vm and run:
# Install AD DS
Install-WindowsFeature AD-Domain-Services -IncludeManagementTools
# Promote to Domain Controller
Install-ADDSForest `
-DomainName "sqlha.local" `
-DomainNetBIOSName "SQLHA" `
-InstallDns `
-SafeModeAdministratorPassword (ConvertTo-SecureString "<DSRM_PASSWORD>" -AsPlainText -Force) `
-Force
Step 3: Configure DNS & Domain Join
# Update VNET DNS to point to DC
az network vnet update `
--resource-group rg-sql-ha `
--name vnet-sql-ha `
--dns-servers 10.0.1.4
# Restart SQL VMs to pick up new DNS
az vm restart --resource-group rg-sql-ha --name sqlvm-ha1
az vm restart --resource-group rg-sql-ha --name sqlvm-ha2
On each SQL VM:
# Join domain
Add-Computer -DomainName "sqlha.local" -Credential (Get-Credential SQLHA\azureadmin) -Restart
Step 4: Install Failover Clustering
On both SQL VMs:
Install-WindowsFeature Failover-Clustering -IncludeManagementTools
Step 5: Create Cluster
From either SQL VM:
# Validate cluster prerequisites
Test-Cluster -Node sqlvm-ha1,sqlvm-ha2
# Create cluster (no shared storage in Azure)
New-Cluster -Name sqlha-cluster `
-Node sqlvm-ha1,sqlvm-ha2 `
-StaticAddress 10.0.1.20 `
-NoStorage
Always use -NoStorage. Azure VMs don't share storage for traditional FCI. Always On AG uses database-level replication instead of shared disks.
Step 6: Configure Quorum (File Share Witness)
On dc-vm:
# Create witness share
New-Item -Path C:\ClusterWitness -ItemType Directory
New-SmbShare -Name ClusterWitness -Path C:\ClusterWitness -FullAccess "SQLHA\sqlha-cluster$"
From a SQL VM:
Set-ClusterQuorum -FileShareWitness "\\dc-vm\ClusterWitness"
Step 7: Enable Always On & Create AG
# Enable AG on both SQL instances
Enable-SqlAlwaysOn -ServerInstance "sqlvm-ha1" -Force
Enable-SqlAlwaysOn -ServerInstance "sqlvm-ha2" -Force
# Create endpoints, AG, and listener via T-SQL (run on primary)
-- Create database
CREATE DATABASE TestHADB;
ALTER DATABASE TestHADB SET RECOVERY FULL;
BACKUP DATABASE TestHADB TO DISK = 'NUL';
-- Create AG endpoint on both nodes (run on each)
CREATE ENDPOINT [Hadr_endpoint]
STATE = STARTED
AS TCP (LISTENER_PORT = 5022)
FOR DATABASE_MIRRORING (ROLE = ALL);
-- Create Availability Group (run on primary)
CREATE AVAILABILITY GROUP [sqlha-ag]
WITH (AUTOMATED_BACKUP_PREFERENCE = PRIMARY)
FOR DATABASE [TestHADB]
REPLICA ON
N'sqlvm-ha1' WITH (
ENDPOINT_URL = N'TCP://sqlvm-ha1.sqlha.local:5022',
FAILOVER_MODE = AUTOMATIC,
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT),
N'sqlvm-ha2' WITH (
ENDPOINT_URL = N'TCP://sqlvm-ha2.sqlha.local:5022',
FAILOVER_MODE = AUTOMATIC,
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT);
-- Join secondary (run on secondary)
ALTER AVAILABILITY GROUP [sqlha-ag] JOIN;
ALTER DATABASE [TestHADB] SET HADR AVAILABILITY GROUP = [sqlha-ag];
Verification
# Check AG health
SELECT
ag.name AS ag_name,
ars.role_desc,
ars.synchronization_health_desc,
ar.replica_server_name
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id;
Expected:
| ag_name | role_desc | synchronization_health | replica_server |
|---|---|---|---|
| sqlha-ag | PRIMARY | HEALTHY | sqlvm-ha1 |
| sqlha-ag | SECONDARY | HEALTHY | sqlvm-ha2 |
Common Failures
| Symptom | Likely Cause | Fix |
|---|---|---|
New-Cluster fails: "node not reachable" | NSG blocks cluster ports (3343, 5022, 59999) | Open intra-subnet RDP/cluster ports between VMs |
| Listener IP unreachable from app | ILB health probe (TCP 59999) not responding on primary | Verify probe rule on SQL with EnableAlwaysOn + correct floating IP |
| AG stuck in RESOLVING after failover | Quorum lost; only 1 node sees cluster | Force quorum on surviving node, then re-add others |
ALTER AVAILABILITY GROUP ... JOIN fails on secondary | Backup/restore chain broken or LSN mismatch | Re-seed: full backup + log backup + restore WITH NORECOVERY |
| Failover takes >30 sec | Sync replica is unhealthy or async-only | Confirm sync replica is SYNCHRONIZED before relying on auto-failover |
SQL Server service account must be domain account, not LocalSystem. AG uses Kerberos for inter-node auth. LocalSystem cannot register SPNs against the cluster name.
Variations to Try
- Add a 3rd async replica in another region → see Cross-Region AlwaysOn
- Switch to distributed AG (forwarder pattern) for cross-AG replication
- Replace ILB with DNN listener (SQL 2019+) — no LB needed
- Add read-only routing to offload reports to the secondary
- Configure backups on secondary (
BACKUP_PRIORITY) to offload primary IO
Production Application
- Tier-1 OLTP, single region, RPO 0 / RTO < 30 sec → this exact pattern (2 sync replicas, 1 file-share or cloud witness, ILB listener, automatic failover).
- Mission-critical SQL on IaaS where PaaS isn't an option (CLR, cross-DB transactions, agent jobs) — AG is the IaaS HA story.
- Compliance requires self-managed encryption keys + HA — AG + EKM provider.
- Anti-pattern: don't use this for DR-only across regions — add an async replica or use a separate FOG/distributed AG instead.
Clean Up
az group delete --name rg-sql-ha --yes --no-wait
Source code & full step-by-step guide: The complete project with Bicep templates, PowerShell scripts, draw.io architecture diagrams, and a Flask failover-testing app is available at github.com/ehabmostafa_microsoft/AlwaysOn-HA-Setup.