Alerts, Notifications & Job Troubleshooting
Automation without monitoring is a liability. This page covers how to set up alerts, handle notifications, and troubleshoot job failures across all Azure SQL platforms.
Azure Monitor Alerts for Databases
Azure Monitor is the primary alerting service for all Azure SQL platforms:
Alert Types
| Alert Type | Trigger | Latency | Cost |
|---|---|---|---|
| Metric Alert | Numeric threshold (CPU > 80%) | ~1 min | Low |
| Log Alert | KQL query result (error count > 5) | 1-15 min | Per query |
| Activity Log Alert | Control plane event (failover, scale) | Near real-time | Free |
| Smart Detection | ML anomaly (from App Insights) | Minutes | Included |
Critical Alerts Every DBA Should Configure
| Alert | Metric/Signal | Threshold | Why |
|---|---|---|---|
| CPU Critical | cpu_percent | > 90% for 10 min | Query/index problem or need to scale |
| Storage Full | storage_percent | > 85% | Database will go read-only if full |
| Deadlocks | deadlock | > 0 | Application concurrency issue |
| Failed Connections | connection_failed | > 10/min | Auth issue or connection pool exhaustion |
| DTU/IO Bottleneck | dtu_consumption_percent | > 95% for 15 min | Need to scale or optimize queries |
| Log Rate Hit | log_write_percent | 100% for 5 min | Write throttling — scale or batch writes |
| Workers Exhausted | workers_percent | > 80% | Connection/query concurrency too high |
| Failover Occurred | Activity Log | Any failover event | Investigate root cause |
| LTR Backup Failed | Activity Log | Backup failure | Compliance risk |
DP-300 distinction: Metric alerts are best for real-time thresholds (CPU, DTU). Log alerts are best for complex conditions using KQL queries. Activity log alerts detect control-plane events (failover, configuration change). Know when to use which.
Action Groups — What Happens When an Alert Fires
An Action Group defines what actions to take when an alert triggers:
| Action Type | Use Case |
|---|---|
| Email/SMS | DBA notification |
| Azure Function | Auto-remediation (scale up, kill sessions) |
| Logic App | Workflow (Teams notification, ticket creation) |
| Webhook | External system integration (PagerDuty, Slack) |
| Automation Runbook | PowerShell auto-remediation |
| ITSM | ServiceNow/Cherwell ticket creation |
Best practice pattern: Alert → Action Group → Logic App → Teams channel + Azure Automation Runbook for auto-remediation. Example: CPU > 95% for 15 min → auto-scale from 4 to 8 vCores → notify team in Teams → scale back down after 2 hours if CPU normalizes.
SQL Agent Job Troubleshooting
Common Failure Reasons
| Symptom | Likely Cause | Fix |
|---|---|---|
| Job failed — login failed | Credential expired or permissions removed | Update proxy/credential, check DB permissions |
| Job failed — timeout | Long-running query or blocking | Optimize query, check for locks, increase timeout |
| Job failed — database not accessible | DB offline, failover, or renamed | Check AG status, failover group health |
| Job ran but did nothing | Wrong database context or IF condition | Verify USE [database], check step logic |
| Job succeeded but data wrong | Wrong schedule or overlapping execution | Check schedule, add WITH (NOLOCK) awareness |
| Job not starting | SQL Agent service stopped or disabled | Verify Agent is running, check startup type |
Where to Look
| Source | What You Find |
|---|---|
| Job History (SSMS) | Per-step success/failure with error messages |
msdb.dbo.sysjobhistory | Programmatic access to job history |
| SQL Agent Error Log | Agent-level errors (connectivity, service issues) |
| SQL Error Log | Database-level errors during job execution |
| Windows Event Log | OS-level errors affecting Agent service |
| Azure Monitor (MI) | SQL Agent job metrics and logs |
Enabling Extended Job History
By default, SQL Agent keeps limited history. Increase it:
| Setting | Default | Recommended |
|---|---|---|
| Max job history rows | 1,000 | 10,000+ |
| Max rows per job | 100 | 1,000 |
msdb can grow large if you don't manage job history. Create a maintenance job to purge old history: EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = '<30 days ago>'. Schedule this monthly.
Elastic Job Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
| Job stuck in "Waiting for retry" | Target DB unreachable | Check firewall, connectivity, credential |
| Job failed on some targets | Different schema or permissions | Verify T-SQL compatibility across all targets |
| Job not running on schedule | Job agent stopped or quota exceeded | Check agent health in portal |
| "Could not connect" | Credential doesn't exist on target DB | Create the user/credential on each target |
Azure Automation Runbook Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
| "Forbidden" error | Managed Identity missing RBAC role | Grant required role (e.g., SQL DB Contributor) |
| Module not found | Az module not imported | Import required Az modules in Automation Account |
| Runbook times out | Default timeout is 3 hours | Optimize script or split into multiple runbooks |
| Runbook succeeds but no effect | Wrong subscription/resource context | Set Set-AzContext -SubscriptionId explicitly |
Monitoring Automation Health
Dashboard Checklist
| What to Monitor | Where |
|---|---|
| SQL Agent job status | SSMS Job Activity Monitor / msdb queries |
| Elastic Job execution | Job database jobs.job_executions view |
| Automation Runbook status | Azure Portal → Automation Account → Jobs |
| Alert rule health | Azure Monitor → Alerts → Summary |
| Action Group delivery | Azure Monitor → Service Health |