Skip to main content

Alerts, Notifications & Job Troubleshooting

Automation without monitoring is a liability. This page covers how to set up alerts, handle notifications, and troubleshoot job failures across all Azure SQL platforms.

Azure Monitor Alerts for Databases

Azure Monitor is the primary alerting service for all Azure SQL platforms:

Alert Types

Alert TypeTriggerLatencyCost
Metric AlertNumeric threshold (CPU > 80%)~1 minLow
Log AlertKQL query result (error count > 5)1-15 minPer query
Activity Log AlertControl plane event (failover, scale)Near real-timeFree
Smart DetectionML anomaly (from App Insights)MinutesIncluded

Critical Alerts Every DBA Should Configure

AlertMetric/SignalThresholdWhy
CPU Criticalcpu_percent> 90% for 10 minQuery/index problem or need to scale
Storage Fullstorage_percent> 85%Database will go read-only if full
Deadlocksdeadlock> 0Application concurrency issue
Failed Connectionsconnection_failed> 10/minAuth issue or connection pool exhaustion
DTU/IO Bottleneckdtu_consumption_percent> 95% for 15 minNeed to scale or optimize queries
Log Rate Hitlog_write_percent100% for 5 minWrite throttling — scale or batch writes
Workers Exhaustedworkers_percent> 80%Connection/query concurrency too high
Failover OccurredActivity LogAny failover eventInvestigate root cause
LTR Backup FailedActivity LogBackup failureCompliance risk
🎯 Exam Focus

DP-300 distinction: Metric alerts are best for real-time thresholds (CPU, DTU). Log alerts are best for complex conditions using KQL queries. Activity log alerts detect control-plane events (failover, configuration change). Know when to use which.

Action Groups — What Happens When an Alert Fires

An Action Group defines what actions to take when an alert triggers:

Action TypeUse Case
Email/SMSDBA notification
Azure FunctionAuto-remediation (scale up, kill sessions)
Logic AppWorkflow (Teams notification, ticket creation)
WebhookExternal system integration (PagerDuty, Slack)
Automation RunbookPowerShell auto-remediation
ITSMServiceNow/Cherwell ticket creation
🏢 Real-World DBA Note

Best practice pattern: Alert → Action Group → Logic App → Teams channel + Azure Automation Runbook for auto-remediation. Example: CPU > 95% for 15 min → auto-scale from 4 to 8 vCores → notify team in Teams → scale back down after 2 hours if CPU normalizes.

SQL Agent Job Troubleshooting

Common Failure Reasons

SymptomLikely CauseFix
Job failed — login failedCredential expired or permissions removedUpdate proxy/credential, check DB permissions
Job failed — timeoutLong-running query or blockingOptimize query, check for locks, increase timeout
Job failed — database not accessibleDB offline, failover, or renamedCheck AG status, failover group health
Job ran but did nothingWrong database context or IF conditionVerify USE [database], check step logic
Job succeeded but data wrongWrong schedule or overlapping executionCheck schedule, add WITH (NOLOCK) awareness
Job not startingSQL Agent service stopped or disabledVerify Agent is running, check startup type

Where to Look

SourceWhat You Find
Job History (SSMS)Per-step success/failure with error messages
msdb.dbo.sysjobhistoryProgrammatic access to job history
SQL Agent Error LogAgent-level errors (connectivity, service issues)
SQL Error LogDatabase-level errors during job execution
Windows Event LogOS-level errors affecting Agent service
Azure Monitor (MI)SQL Agent job metrics and logs

Enabling Extended Job History

By default, SQL Agent keeps limited history. Increase it:

SettingDefaultRecommended
Max job history rows1,00010,000+
Max rows per job1001,000
⚠️ Watch Out

msdb can grow large if you don't manage job history. Create a maintenance job to purge old history: EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = '<30 days ago>'. Schedule this monthly.

Elastic Job Troubleshooting

IssueCauseResolution
Job stuck in "Waiting for retry"Target DB unreachableCheck firewall, connectivity, credential
Job failed on some targetsDifferent schema or permissionsVerify T-SQL compatibility across all targets
Job not running on scheduleJob agent stopped or quota exceededCheck agent health in portal
"Could not connect"Credential doesn't exist on target DBCreate the user/credential on each target

Azure Automation Runbook Troubleshooting

IssueCauseResolution
"Forbidden" errorManaged Identity missing RBAC roleGrant required role (e.g., SQL DB Contributor)
Module not foundAz module not importedImport required Az modules in Automation Account
Runbook times outDefault timeout is 3 hoursOptimize script or split into multiple runbooks
Runbook succeeds but no effectWrong subscription/resource contextSet Set-AzContext -SubscriptionId explicitly

Monitoring Automation Health

Dashboard Checklist

What to MonitorWhere
SQL Agent job statusSSMS Job Activity Monitor / msdb queries
Elastic Job executionJob database jobs.job_executions view
Automation Runbook statusAzure Portal → Automation Account → Jobs
Alert rule healthAzure Monitor → Alerts → Summary
Action Group deliveryAzure Monitor → Service Health

Anti-Patterns

  • "Alert on every metric." Alert fatigue is the #1 reason on-call ignores alerts. Pick 3–5 SLOs per service (latency, error rate, saturation) and alert on burn-rate against them.
  • "Email-only Action Group." Email lands in inbox. On-call needs paging — wire to PagerDuty / Teams / SMS via Action Group → ITSM/webhook.
  • "Set static threshold for CPU = 80 %." Static thresholds drift as workloads change. Dynamic thresholds (smart detection) auto-baseline and alert on anomaly.
  • "One Action Group for everything." When a noisy alert fires in test, prod on-call gets paged. Separate Action Groups by environment + service.
  • "Use Activity Log alert for Microsoft.Sql/servers/databases/write to detect schema change." Activity Log is control plane. Schema changes are data plane (T-SQL DDL). Use SQL Audit → Log Analytics for that.
  • "Disable failed jobs alert because the job sometimes fails legitimately." Add a retry + grace period to the job, or alert on > N failures in M minutes. Don't disable.
⚠️ Watch Out

Action Group SMS / voice notifications have country-specific rate limits + costs. A storm of 100 alerts in 10 minutes can blow your SMS budget. Use Action Group → PagerDuty/Teams instead of SMS for high-volume scenarios.


Migration Between Alerting Models

From → ToPathCost
SQL Agent sp_notify_operator (email) → Azure Monitor metric alert + Action GroupWire metric sql_agent_job_failed → AGOne-time; gains routing + retry
Static threshold → Dynamic thresholdEdit alert rule, switch to "Dynamic"Free; auto-baselined
Email-only AG → Email + ITSM connectorAdd ITSM action; map fieldsConnector setup; integrates with ServiceNow / Jira
Separate alert per metric → Multi-resource + grouped alert rulesUse scope = subscription / RG; enable groupingCuts rule sprawl; harder to route per-DB
Manual ad-hoc alerts → Azure Policy + diagnostic settingsPolicy enforces diag settings; alert rules deploy via BicepDrift prevention; IaC discipline
Storage-only audit → LA + Sentinel detection rulesAdd LA destination + Sentinel ruleAdds Sentinel cost; gains correlation
Per-DB log alert → Cross-resource KQL alertUse union / workspace() across LA workspacesOne rule for fleet; auth setup needed
Email → webhook → PagerDutyReplace email action with webhook to PagerDutyProper paging; cost = PagerDuty subscription

Most expensive move: Sentinel adoption for SQL alerts (orchestration tier).


Real Scenarios

  1. MI + 80 DBs, want SLO-based alerting3 metric alerts (CPU, Storage, Connections) at MI scope + 2 log alerts (errors, deadlocks) on LA + 1 ITSM Action Group. Driver: 5 alerts cover 80 % of incidents.
  2. 24/7 SaaS, on-call rotationAll alerts → Action Group webhook → PagerDuty (rotation) + Teams channel for visibility. Driver: paging on real incidents, signal in chat for situational awareness.
  3. Multi-region failoverActivity Log alert on Microsoft.Sql/.../failover/action. Driver: instant notify when FOG initiates. Trade-off: AG must point to the surviving region.
  4. Cost guardrail on Azure SQL DB DTU modelMetric alert on dtu_consumption_percent > 90 % for 30 min. Driver: detect before throttle. Trade-off: noisy on bursty workloads — use dynamic threshold.
  5. DBA wants to monitor tempdb saturation across SQL VM fleetWorkbook on Azure Monitor for VMs counter \SQLServer:Databases(tempdb)\Data File(s) Size (KB) → alert at 80 % of provisioned. Driver: prevent tempdb-out-of-space outages.

Flashcards

What are the 3 types of Azure Monitor alerts?
Click to reveal answer
1) Metric alerts (numeric thresholds, ~1 min), 2) Log alerts (KQL queries, 1-15 min), 3) Activity log alerts (control-plane events, near real-time).
1 / 5

Quiz

Q1/5
0 correct
You want to be alerted when your Azure SQL Database CPU stays above 90% for 10 minutes. What type of alert should you create?