Skip to main content

SQL Server on VM — Always On Availability Groups (Deep Dive)

This page covers the complete architecture of Always On AG on Azure VMs — every component, why it's needed, and how to configure it.

Architecture Overview

SQL VM Always On AG Architecture
Always On AG Architecture
🏰
Active Directory + WSFC
Domain controller provides AD DS + DNS. WSFC provides health monitoring and automatic failover.
🗄️
AG + Replicas
Primary + secondary replicas with sync/async commit. Readable secondaries for reporting offload.
⚖️
ILB + Cloud Witness
Internal Load Balancer routes to primary via health probe. Cloud Witness provides quorum tiebreaker.

Component Breakdown

1. Active Directory Domain Services (AD DS)

Why needed: Windows Server Failover Clustering (WSFC) requires all nodes to be domain-joined. The cluster uses AD for authentication, computer objects, and the Cluster Name Object (CNO).

ComponentPurpose
Domain Controller VMProvides AD DS + DNS
DNSCritical for name resolution between cluster nodes
Domain JoinBoth SQL VMs must be joined to the same domain
Service AccountsSQL Server service runs under a domain account
# Install AD DS on the Domain Controller VM
Install-WindowsFeature AD-Domain-Services -IncludeManagementTools

# Promote to domain controller
Install-ADDSForest `
-DomainName "sqlha.local" `
-DomainNetBIOSName "SQLHA" `
-InstallDns `
-SafeModeAdministratorPassword (ConvertTo-SecureString "<STRONG_PASSWORD>" -AsPlainText -Force) `
-Force
🏢 Real-World DBA Note

In production, deploy two domain controllers in an Availability Set or across Availability Zones for AD redundancy. A single DC is a single point of failure for your entire AG.

2. Windows Server Failover Clustering (WSFC)

Why needed: Always On AG is built on top of WSFC. The cluster provides:

  • Health monitoring — detects node failures
  • Automatic failover — triggers AG failover when a node goes down
  • Quorum voting — determines which nodes form the majority
# Install Failover Clustering feature on BOTH SQL VMs
Install-WindowsFeature Failover-Clustering -IncludeManagementTools

# Validate the cluster (run from either node)
Test-Cluster -Node SQL-VM1,SQL-VM2

# Create the cluster
New-Cluster -Name "YOURCLUSTER" `
-Node SQL-VM1,SQL-VM2 `
-StaticAddress 10.0.0.10 `
-NoStorage
⚠️ Watch Out

Always use -NoStorage when creating the cluster. Azure VMs don't share storage, and Azure doesn't support shared disks for traditional failover cluster instances (FCI) without Storage Spaces Direct or Azure Shared Disks.

3. Quorum & Cloud Witness

Why needed: Quorum prevents split-brain scenarios. It decides which partition of nodes can keep running if there's a network partition.

Witness TypeWhen to Use
Cloud WitnessRecommended for Azure. Uses a blob in Azure Storage Account
File Share WitnessOn-prem or hybrid setups with a file server
Disk WitnessTraditional on-prem clusters with shared storage
No WitnessOnly with odd number of nodes (not recommended)
# Configure Cloud Witness (recommended for Azure)
Set-ClusterQuorum -CloudWitness `
-AccountName "yourstorageaccount" `
-AccessKey "your-storage-key"

How Cloud Witness works:

  1. Each node sends heartbeats to the cluster
  2. Cloud Witness acts as the tie-breaking vote
  3. If the primary goes down, the secondary + Cloud Witness = majority → failover proceeds
  4. Without a witness in a 2-node cluster, neither node has majority → cluster goes offline
🎯 Exam Focus

For DP-300, remember: A 2-node cluster without a witness cannot survive a single node failure because neither remaining node has majority. Cloud Witness is the recommended quorum witness in Azure — it's region-independent and doesn't need a separate VM.

4. Azure Internal Load Balancer (ILB)

This is the most critical Azure-specific component that differs from on-premises setup.

Why is a Load Balancer Needed?

On-premises, the AG Listener uses a virtual IP (VIP) registered in DNS via ARP (gratuitous ARP broadcast). In Azure, gratuitous ARP doesn't work because Azure networking doesn't allow VMs to claim arbitrary IPs.

The Azure ILB solves this by:

  1. Holding the Listener's virtual IP as its frontend IP
  2. Using a health probe on a custom port to detect which VM is the current primary
  3. Routing all traffic to only the active primary node
ILB Health Probe Flow
📡
LB Probes Both VMs
TCP probe on port 59999 every 5 seconds
Sent to both SQL VMs in backend pool
Primary Responds
Only current primary's cluster IP resource responds
SYN-ACK on port 59999
🔀
Traffic Routed
All port 1433 traffic → healthy (primary) node
Floating IP (Direct Server Return) enabled
🔄
Failover Detection
On AG failover, cluster moves IP resource
LB detects in ~10 seconds (2 failed probes) and reroutes

Load Balancer Configuration

# Create the internal load balancer
az network lb create \
--resource-group rg-sqlha \
--name ilb-sqlha \
--sku Standard \
--vnet-name vnet-sqlha \
--subnet subnet-sql \
--frontend-ip-name fe-ag-listener \
--private-ip-address 10.0.0.100 \
--backend-pool-name bp-sqlha

# Add both SQL VMs to the backend pool
az network nic ip-config address-pool add \
--address-pool bp-sqlha \
--lb-name ilb-sqlha \
--resource-group rg-sqlha \
--nic-name sql-vm1-nic \
--ip-config-name ipconfig1

az network nic ip-config address-pool add \
--address-pool bp-sqlha \
--lb-name ilb-sqlha \
--resource-group rg-sqlha \
--nic-name sql-vm2-nic \
--ip-config-name ipconfig1

# Create the health probe (THIS IS THE KEY!)
az network lb probe create \
--resource-group rg-sqlha \
--lb-name ilb-sqlha \
--name probe-ag \
--protocol tcp \
--port 59999 \
--interval 5 \
--threshold 2

# Create the load balancing rule
az network lb rule create \
--resource-group rg-sqlha \
--lb-name ilb-sqlha \
--name rule-ag-1433 \
--protocol tcp \
--frontend-port 1433 \
--backend-port 1433 \
--frontend-ip-name fe-ag-listener \
--backend-pool-name bp-sqlha \
--probe-name probe-ag \
--idle-timeout 30 \
--enable-floating-ip true
⚠️ Watch Out

--enable-floating-ip true (Direct Server Return) is mandatory for AG Listeners. Without it, the load balancer NATs the traffic and SQL Server won't accept the connection because the destination IP doesn't match the Listener IP configured on the NIC.

Health Probe — How the LB Knows Which Node is Primary

The LB sends a TCP probe to port 59999 on both VMs. Only the current primary responds to this probe. This is configured via a cluster IP resource linked to the AG Listener:

# Run on the active cluster node after creating the AG
# This configures the probe port on the AG Listener cluster resource

$ClusterNetworkName = "Cluster Network 1"
$IPResourceName = "IP Address 10.0.0.100" # The AG listener IP resource name
$ILBIP = "10.0.0.100" # Load balancer frontend IP
$ProbePort = 59999

Import-Module FailoverClusters

Get-ClusterResource $IPResourceName | Set-ClusterParameter `
-Multiple @{
"Address" = "$ILBIP";
"ProbePort" = $ProbePort;
"SubnetMask"= "255.255.255.255";
"Network" = "$ClusterNetworkName";
"EnableDhcp"= 0
}

# Restart the cluster resource for changes to take effect
Stop-ClusterResource $IPResourceName
Start-ClusterResource $IPResourceName

How it works end-to-end:

StepWhat Happens
1LB sends TCP SYN to port 59999 on both VMs every 5 seconds
2Only the primary replica's cluster IP resource responds (SYN-ACK)
3LB marks the responding node as "healthy"
4All traffic on port 1433 → forwarded to the healthy (primary) node
5On failover, the cluster moves the IP resource → new primary responds to probe
6LB detects the change within ~10 seconds (2 failed probes) and reroutes
🎯 Exam Focus

Key exam points about the Load Balancer:

  1. It replaces the gratuitous ARP mechanism that doesn't work in Azure
  2. Floating IP (Direct Server Return) must be enabled
  3. The health probe port (59999) is configured on the cluster IP resource, not in SQL Server
  4. Standard SKU LB is required (not Basic) for Availability Zone support

5. Availability Set vs Availability Zones

Availability SetAvailability Zones
ProtectionHardware/rack failuresEntire datacenter failures
SLA99.95%99.99%
LatencySub-millisecond~1-2ms between zones
LB SKUBasic or StandardStandard only
Sync commit?✅ Yes (low latency)✅ Yes (still low enough)
# Deploy VMs across Availability Zones
az vm create \
--name SQL-VM1 \
--resource-group rg-sqlha \
--zone 1 \
--image MicrosoftSQLServer:sql2022-ws2022:enterprise-gen2:latest \
--size Standard_E4ds_v5 \
--admin-username azureuser \
--admin-password '<YOUR_SECURE_PASSWORD>'

az vm create \
--name SQL-VM2 \
--resource-group rg-sqlha \
--zone 2 \
--image MicrosoftSQLServer:sql2022-ws2022:enterprise-gen2:latest \
--size Standard_E4ds_v5 \
--admin-username azureuser \
--admin-password '<YOUR_SECURE_PASSWORD>'

6. SQL Server Always On Configuration

-- Step 1: Enable Always On on both instances (requires restart)
-- Run via SQL Server Configuration Manager or PowerShell:
-- Enable-SqlAlwaysOn -ServerInstance SQL-VM1 -Force

-- Step 2: Create the database mirroring endpoint (BOTH nodes)
CREATE ENDPOINT [Hadr_endpoint]
STATE = STARTED
AS TCP (LISTENER_PORT = 5022)
FOR DATABASE_MIRRORING (ROLE = ALL);

-- Step 3: Grant connect permission to the service account
GRANT CONNECT ON ENDPOINT::[Hadr_endpoint]
TO [SQLHA\SqlService];

-- Step 4: Create the Availability Group (PRIMARY only)
CREATE AVAILABILITY GROUP [AG-Production]
WITH (
AUTOMATED_BACKUP_PREFERENCE = SECONDARY,
DB_FAILOVER = ON,
DTC_SUPPORT = NONE
)
FOR DATABASE [ProductionDB], [ReportingDB]
REPLICA ON
N'SQL-VM1' WITH (
ENDPOINT_URL = N'TCP://SQL-VM1.sqlha.local:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC,
SECONDARY_ROLE (ALLOW_CONNECTIONS = READ_ONLY)
),
N'SQL-VM2' WITH (
ENDPOINT_URL = N'TCP://SQL-VM2.sqlha.local:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC,
SECONDARY_ROLE (ALLOW_CONNECTIONS = READ_ONLY)
);

-- Step 5: Join secondary (run on SQL-VM2)
ALTER AVAILABILITY GROUP [AG-Production] JOIN;
ALTER AVAILABILITY GROUP [AG-Production]
GRANT CREATE ANY DATABASE; -- For automatic seeding

-- Step 6: Create the Listener
ALTER AVAILABILITY GROUP [AG-Production]
ADD LISTENER N'ag-listener' (
WITH IP ((N'10.0.0.100', N'255.255.255.255')),
PORT = 1433
);

7. Read-Only Routing

Route read-intent connections to the secondary replica automatically:

-- Configure read-only routing URLs
ALTER AVAILABILITY GROUP [AG-Production]
MODIFY REPLICA ON N'SQL-VM1' WITH (
PRIMARY_ROLE (ALLOW_CONNECTIONS = READ_WRITE),
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY,
READ_ONLY_ROUTING_URL = N'TCP://SQL-VM1.sqlha.local:1433'
)
);

ALTER AVAILABILITY GROUP [AG-Production]
MODIFY REPLICA ON N'SQL-VM2' WITH (
PRIMARY_ROLE (ALLOW_CONNECTIONS = READ_WRITE),
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY,
READ_ONLY_ROUTING_URL = N'TCP://SQL-VM2.sqlha.local:1433'
)
);

-- Set read-only routing list
ALTER AVAILABILITY GROUP [AG-Production]
MODIFY REPLICA ON N'SQL-VM1' WITH (
PRIMARY_ROLE (READ_ONLY_ROUTING_LIST = (N'SQL-VM2'))
);

ALTER AVAILABILITY GROUP [AG-Production]
MODIFY REPLICA ON N'SQL-VM2' WITH (
PRIMARY_ROLE (READ_ONLY_ROUTING_LIST = (N'SQL-VM1'))
);

Connection string for read-only routing:

Server=ag-listener;Database=ProductionDB;
ApplicationIntent=ReadOnly;MultiSubnetFailover=True
🏢 Real-World DBA Note

Always include MultiSubnetFailover=True in connection strings when using AG Listeners in Azure. This enables parallel connection attempts to all IPs, reducing failover reconnection time from ~30 seconds to ~1-2 seconds.

Complete Deployment Checklist

#StepDetails
1Deploy VNet + SubnetEnsure enough IP space for VMs + LB + Listener
2Deploy Domain ControllerAD DS + DNS
3Deploy SQL VMsSame subnet, Availability Set or Zones
4Domain-join SQL VMsConfigure DNS to point to DC
5Install Failover ClusteringOn both SQL VMs
6Create WSFC Cluster-NoStorage, static IP
7Configure Cloud WitnessStorage Account for quorum
8Enable Always OnSQL Server Configuration Manager → restart
9Create DB Mirroring EndpointsPort 5022 on both nodes
10Create Availability GroupSync commit + auto failover
11Create AG ListenerVirtual IP matching LB frontend
12Deploy Internal Load BalancerFrontend IP = Listener IP
13Configure Health ProbePort 59999, TCP, linked to cluster IP
14Configure LB RulePort 1433, floating IP enabled
15Configure Read-Only RoutingOptional but recommended
16Test FailoverManual failover + verify probe works

Network Ports Reference

PortProtocolPurpose
1433TCPSQL Server client connections
5022TCPDatabase mirroring endpoint (AG sync)
59999TCPLoad balancer health probe
135TCPRPC / Cluster communication
3343UDPCluster heartbeat
445TCPSMB (if using file share witness)

Flashcards

Why is an Azure Load Balancer required for AG Listener on Azure VMs?
Click to reveal answer
Azure VMs cannot use gratuitous ARP to claim virtual IPs. The ILB holds the Listener's IP and uses a health probe to route traffic only to the current primary replica.
1 / 10

Quiz

Q1/6
0 correct
Why can't an AG Listener use a virtual IP directly in Azure without a Load Balancer?