Alerts, Notifications & Job Troubleshooting
Automation without monitoring is a liability. This page covers how to set up alerts, handle notifications, and troubleshoot job failures across all Azure SQL platforms.
Azure Monitor Alerts for Databases
Azure Monitor is the primary alerting service for all Azure SQL platforms:
Alert Types
| Alert Type | Trigger | Latency | Cost |
|---|---|---|---|
| Metric Alert | Numeric threshold (CPU > 80%) | ~1 min | Low |
| Log Alert | KQL query result (error count > 5) | 1-15 min | Per query |
| Activity Log Alert | Control plane event (failover, scale) | Near real-time | Free |
| Smart Detection | ML anomaly (from App Insights) | Minutes | Included |
Critical Alerts Every DBA Should Configure
| Alert | Metric/Signal | Threshold | Why |
|---|---|---|---|
| CPU Critical | cpu_percent | > 90% for 10 min | Query/index problem or need to scale |
| Storage Full | storage_percent | > 85% | Database will go read-only if full |
| Deadlocks | deadlock | > 0 | Application concurrency issue |
| Failed Connections | connection_failed | > 10/min | Auth issue or connection pool exhaustion |
| DTU/IO Bottleneck | dtu_consumption_percent | > 95% for 15 min | Need to scale or optimize queries |
| Log Rate Hit | log_write_percent | 100% for 5 min | Write throttling — scale or batch writes |
| Workers Exhausted | workers_percent | > 80% | Connection/query concurrency too high |
| Failover Occurred | Activity Log | Any failover event | Investigate root cause |
| LTR Backup Failed | Activity Log | Backup failure | Compliance risk |
DP-300 distinction: Metric alerts are best for real-time thresholds (CPU, DTU). Log alerts are best for complex conditions using KQL queries. Activity log alerts detect control-plane events (failover, configuration change). Know when to use which.
Action Groups — What Happens When an Alert Fires
An Action Group defines what actions to take when an alert triggers:
| Action Type | Use Case |
|---|---|
| Email/SMS | DBA notification |
| Azure Function | Auto-remediation (scale up, kill sessions) |
| Logic App | Workflow (Teams notification, ticket creation) |
| Webhook | External system integration (PagerDuty, Slack) |
| Automation Runbook | PowerShell auto-remediation |
| ITSM | ServiceNow/Cherwell ticket creation |
Best practice pattern: Alert → Action Group → Logic App → Teams channel + Azure Automation Runbook for auto-remediation. Example: CPU > 95% for 15 min → auto-scale from 4 to 8 vCores → notify team in Teams → scale back down after 2 hours if CPU normalizes.
SQL Agent Job Troubleshooting
Common Failure Reasons
| Symptom | Likely Cause | Fix |
|---|---|---|
| Job failed — login failed | Credential expired or permissions removed | Update proxy/credential, check DB permissions |
| Job failed — timeout | Long-running query or blocking | Optimize query, check for locks, increase timeout |
| Job failed — database not accessible | DB offline, failover, or renamed | Check AG status, failover group health |
| Job ran but did nothing | Wrong database context or IF condition | Verify USE [database], check step logic |
| Job succeeded but data wrong | Wrong schedule or overlapping execution | Check schedule, add WITH (NOLOCK) awareness |
| Job not starting | SQL Agent service stopped or disabled | Verify Agent is running, check startup type |
Where to Look
| Source | What You Find |
|---|---|
| Job History (SSMS) | Per-step success/failure with error messages |
msdb.dbo.sysjobhistory | Programmatic access to job history |
| SQL Agent Error Log | Agent-level errors (connectivity, service issues) |
| SQL Error Log | Database-level errors during job execution |
| Windows Event Log | OS-level errors affecting Agent service |
| Azure Monitor (MI) | SQL Agent job metrics and logs |
Enabling Extended Job History
By default, SQL Agent keeps limited history. Increase it:
| Setting | Default | Recommended |
|---|---|---|
| Max job history rows | 1,000 | 10,000+ |
| Max rows per job | 100 | 1,000 |
msdb can grow large if you don't manage job history. Create a maintenance job to purge old history: EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = '<30 days ago>'. Schedule this monthly.
Elastic Job Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
| Job stuck in "Waiting for retry" | Target DB unreachable | Check firewall, connectivity, credential |
| Job failed on some targets | Different schema or permissions | Verify T-SQL compatibility across all targets |
| Job not running on schedule | Job agent stopped or quota exceeded | Check agent health in portal |
| "Could not connect" | Credential doesn't exist on target DB | Create the user/credential on each target |
Azure Automation Runbook Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
| "Forbidden" error | Managed Identity missing RBAC role | Grant required role (e.g., SQL DB Contributor) |
| Module not found | Az module not imported | Import required Az modules in Automation Account |
| Runbook times out | Default timeout is 3 hours | Optimize script or split into multiple runbooks |
| Runbook succeeds but no effect | Wrong subscription/resource context | Set Set-AzContext -SubscriptionId explicitly |
Monitoring Automation Health
Dashboard Checklist
| What to Monitor | Where |
|---|---|
| SQL Agent job status | SSMS Job Activity Monitor / msdb queries |
| Elastic Job execution | Job database jobs.job_executions view |
| Automation Runbook status | Azure Portal → Automation Account → Jobs |
| Alert rule health | Azure Monitor → Alerts → Summary |
| Action Group delivery | Azure Monitor → Service Health |
Anti-Patterns
- "Alert on every metric." Alert fatigue is the #1 reason on-call ignores alerts. Pick 3–5 SLOs per service (latency, error rate, saturation) and alert on burn-rate against them.
- "Email-only Action Group." Email lands in inbox. On-call needs paging — wire to PagerDuty / Teams / SMS via Action Group → ITSM/webhook.
- "Set static threshold for CPU = 80 %." Static thresholds drift as workloads change. Dynamic thresholds (smart detection) auto-baseline and alert on anomaly.
- "One Action Group for everything." When a noisy alert fires in test, prod on-call gets paged. Separate Action Groups by environment + service.
- "Use Activity Log alert for
Microsoft.Sql/servers/databases/writeto detect schema change." Activity Log is control plane. Schema changes are data plane (T-SQL DDL). Use SQL Audit → Log Analytics for that. - "Disable failed jobs alert because the job sometimes fails legitimately." Add a retry + grace period to the job, or alert on
> N failures in M minutes. Don't disable.
Action Group SMS / voice notifications have country-specific rate limits + costs. A storm of 100 alerts in 10 minutes can blow your SMS budget. Use Action Group → PagerDuty/Teams instead of SMS for high-volume scenarios.
Migration Between Alerting Models
| From → To | Path | Cost |
|---|---|---|
SQL Agent sp_notify_operator (email) → Azure Monitor metric alert + Action Group | Wire metric sql_agent_job_failed → AG | One-time; gains routing + retry |
| Static threshold → Dynamic threshold | Edit alert rule, switch to "Dynamic" | Free; auto-baselined |
| Email-only AG → Email + ITSM connector | Add ITSM action; map fields | Connector setup; integrates with ServiceNow / Jira |
| Separate alert per metric → Multi-resource + grouped alert rules | Use scope = subscription / RG; enable grouping | Cuts rule sprawl; harder to route per-DB |
| Manual ad-hoc alerts → Azure Policy + diagnostic settings | Policy enforces diag settings; alert rules deploy via Bicep | Drift prevention; IaC discipline |
| Storage-only audit → LA + Sentinel detection rules | Add LA destination + Sentinel rule | Adds Sentinel cost; gains correlation |
| Per-DB log alert → Cross-resource KQL alert | Use union / workspace() across LA workspaces | One rule for fleet; auth setup needed |
| Email → webhook → PagerDuty | Replace email action with webhook to PagerDuty | Proper paging; cost = PagerDuty subscription |
Most expensive move: Sentinel adoption for SQL alerts (orchestration tier).
Real Scenarios
- MI + 80 DBs, want SLO-based alerting → 3 metric alerts (CPU, Storage, Connections) at MI scope + 2 log alerts (errors, deadlocks) on LA + 1 ITSM Action Group. Driver: 5 alerts cover 80 % of incidents.
- 24/7 SaaS, on-call rotation → All alerts → Action Group webhook → PagerDuty (rotation) + Teams channel for visibility. Driver: paging on real incidents, signal in chat for situational awareness.
- Multi-region failover → Activity Log alert on
Microsoft.Sql/.../failover/action. Driver: instant notify when FOG initiates. Trade-off: AG must point to the surviving region. - Cost guardrail on Azure SQL DB DTU model → Metric alert on
dtu_consumption_percent > 90 % for 30 min. Driver: detect before throttle. Trade-off: noisy on bursty workloads — use dynamic threshold. - DBA wants to monitor
tempdbsaturation across SQL VM fleet → Workbook on Azure Monitor for VMs counter\SQLServer:Databases(tempdb)\Data File(s) Size (KB)→ alert at 80 % of provisioned. Driver: prevent tempdb-out-of-space outages.