Skip to main content

Alerts, Notifications & Job Troubleshooting

Automation without monitoring is a liability. This page covers how to set up alerts, handle notifications, and troubleshoot job failures across all Azure SQL platforms.

Azure Monitor Alerts for Databases

Azure Monitor is the primary alerting service for all Azure SQL platforms:

Alert Types

Alert TypeTriggerLatencyCost
Metric AlertNumeric threshold (CPU > 80%)~1 minLow
Log AlertKQL query result (error count > 5)1-15 minPer query
Activity Log AlertControl plane event (failover, scale)Near real-timeFree
Smart DetectionML anomaly (from App Insights)MinutesIncluded

Critical Alerts Every DBA Should Configure

AlertMetric/SignalThresholdWhy
CPU Criticalcpu_percent> 90% for 10 minQuery/index problem or need to scale
Storage Fullstorage_percent> 85%Database will go read-only if full
Deadlocksdeadlock> 0Application concurrency issue
Failed Connectionsconnection_failed> 10/minAuth issue or connection pool exhaustion
DTU/IO Bottleneckdtu_consumption_percent> 95% for 15 minNeed to scale or optimize queries
Log Rate Hitlog_write_percent100% for 5 minWrite throttling — scale or batch writes
Workers Exhaustedworkers_percent> 80%Connection/query concurrency too high
Failover OccurredActivity LogAny failover eventInvestigate root cause
LTR Backup FailedActivity LogBackup failureCompliance risk
🎯 Exam Focus

DP-300 distinction: Metric alerts are best for real-time thresholds (CPU, DTU). Log alerts are best for complex conditions using KQL queries. Activity log alerts detect control-plane events (failover, configuration change). Know when to use which.

Action Groups — What Happens When an Alert Fires

An Action Group defines what actions to take when an alert triggers:

Action TypeUse Case
Email/SMSDBA notification
Azure FunctionAuto-remediation (scale up, kill sessions)
Logic AppWorkflow (Teams notification, ticket creation)
WebhookExternal system integration (PagerDuty, Slack)
Automation RunbookPowerShell auto-remediation
ITSMServiceNow/Cherwell ticket creation
🏢 Real-World DBA Note

Best practice pattern: Alert → Action Group → Logic App → Teams channel + Azure Automation Runbook for auto-remediation. Example: CPU > 95% for 15 min → auto-scale from 4 to 8 vCores → notify team in Teams → scale back down after 2 hours if CPU normalizes.

SQL Agent Job Troubleshooting

Common Failure Reasons

SymptomLikely CauseFix
Job failed — login failedCredential expired or permissions removedUpdate proxy/credential, check DB permissions
Job failed — timeoutLong-running query or blockingOptimize query, check for locks, increase timeout
Job failed — database not accessibleDB offline, failover, or renamedCheck AG status, failover group health
Job ran but did nothingWrong database context or IF conditionVerify USE [database], check step logic
Job succeeded but data wrongWrong schedule or overlapping executionCheck schedule, add WITH (NOLOCK) awareness
Job not startingSQL Agent service stopped or disabledVerify Agent is running, check startup type

Where to Look

SourceWhat You Find
Job History (SSMS)Per-step success/failure with error messages
msdb.dbo.sysjobhistoryProgrammatic access to job history
SQL Agent Error LogAgent-level errors (connectivity, service issues)
SQL Error LogDatabase-level errors during job execution
Windows Event LogOS-level errors affecting Agent service
Azure Monitor (MI)SQL Agent job metrics and logs

Enabling Extended Job History

By default, SQL Agent keeps limited history. Increase it:

SettingDefaultRecommended
Max job history rows1,00010,000+
Max rows per job1001,000
⚠️ Watch Out

msdb can grow large if you don't manage job history. Create a maintenance job to purge old history: EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = '<30 days ago>'. Schedule this monthly.

Elastic Job Troubleshooting

IssueCauseResolution
Job stuck in "Waiting for retry"Target DB unreachableCheck firewall, connectivity, credential
Job failed on some targetsDifferent schema or permissionsVerify T-SQL compatibility across all targets
Job not running on scheduleJob agent stopped or quota exceededCheck agent health in portal
"Could not connect"Credential doesn't exist on target DBCreate the user/credential on each target

Azure Automation Runbook Troubleshooting

IssueCauseResolution
"Forbidden" errorManaged Identity missing RBAC roleGrant required role (e.g., SQL DB Contributor)
Module not foundAz module not importedImport required Az modules in Automation Account
Runbook times outDefault timeout is 3 hoursOptimize script or split into multiple runbooks
Runbook succeeds but no effectWrong subscription/resource contextSet Set-AzContext -SubscriptionId explicitly

Monitoring Automation Health

Dashboard Checklist

What to MonitorWhere
SQL Agent job statusSSMS Job Activity Monitor / msdb queries
Elastic Job executionJob database jobs.job_executions view
Automation Runbook statusAzure Portal → Automation Account → Jobs
Alert rule healthAzure Monitor → Alerts → Summary
Action Group deliveryAzure Monitor → Service Health

Flashcards

What are the 3 types of Azure Monitor alerts?
Click to reveal answer
1) Metric alerts (numeric thresholds, ~1 min), 2) Log alerts (KQL queries, 1-15 min), 3) Activity log alerts (control-plane events, near real-time).
1 / 5

Quiz

Q1/5
0 correct
You want to be alerted when your Azure SQL Database CPU stays above 90% for 10 minutes. What type of alert should you create?