Alerts, Notifications & Job Troubleshooting

Automation without monitoring is a liability. This page covers how to set up alerts, handle notifications, and troubleshoot job failures across all Azure SQL platforms.

Azure Monitor Alerts for Databases

Azure Monitor is the primary alerting service for all Azure SQL platforms:

Alert Types

Alert Type	Trigger	Latency	Cost
Metric Alert	Numeric threshold (CPU > 80%)	~1 min	Low
Log Alert	KQL query result (error count > 5)	1-15 min	Per query
Activity Log Alert	Control plane event (failover, scale)	Near real-time	Free
Smart Detection	ML anomaly (from App Insights)	Minutes	Included

Critical Alerts Every DBA Should Configure

Alert	Metric/Signal	Threshold	Why
CPU Critical	`cpu_percent`	> 90% for 10 min	Query/index problem or need to scale
Storage Full	`storage_percent`	> 85%	Database will go read-only if full
Deadlocks	`deadlock`	> 0	Application concurrency issue
Failed Connections	`connection_failed`	> 10/min	Auth issue or connection pool exhaustion
DTU/IO Bottleneck	`dtu_consumption_percent`	> 95% for 15 min	Need to scale or optimize queries
Log Rate Hit	`log_write_percent`	100% for 5 min	Write throttling — scale or batch writes
Workers Exhausted	`workers_percent`	> 80%	Connection/query concurrency too high
Failover Occurred	Activity Log	Any failover event	Investigate root cause
LTR Backup Failed	Activity Log	Backup failure	Compliance risk

🎯 Exam Focus

DP-300 distinction: Metric alerts are best for real-time thresholds (CPU, DTU). Log alerts are best for complex conditions using KQL queries. Activity log alerts detect control-plane events (failover, configuration change). Know when to use which.

Action Groups — What Happens When an Alert Fires

An Action Group defines what actions to take when an alert triggers:

Action Type	Use Case
Email/SMS	DBA notification
Azure Function	Auto-remediation (scale up, kill sessions)
Logic App	Workflow (Teams notification, ticket creation)
Webhook	External system integration (PagerDuty, Slack)
Automation Runbook	PowerShell auto-remediation
ITSM	ServiceNow/Cherwell ticket creation

🏢 Real-World DBA Note

Best practice pattern: Alert → Action Group → Logic App → Teams channel + Azure Automation Runbook for auto-remediation. Example: CPU > 95% for 15 min → auto-scale from 4 to 8 vCores → notify team in Teams → scale back down after 2 hours if CPU normalizes.

SQL Agent Job Troubleshooting

Common Failure Reasons

Symptom	Likely Cause	Fix
Job failed — login failed	Credential expired or permissions removed	Update proxy/credential, check DB permissions
Job failed — timeout	Long-running query or blocking	Optimize query, check for locks, increase timeout
Job failed — database not accessible	DB offline, failover, or renamed	Check AG status, failover group health
Job ran but did nothing	Wrong database context or IF condition	Verify `USE [database]`, check step logic
Job succeeded but data wrong	Wrong schedule or overlapping execution	Check schedule, add `WITH (NOLOCK)` awareness
Job not starting	SQL Agent service stopped or disabled	Verify Agent is running, check startup type

Where to Look

Source	What You Find
Job History (SSMS)	Per-step success/failure with error messages
`msdb.dbo.sysjobhistory`	Programmatic access to job history
SQL Agent Error Log	Agent-level errors (connectivity, service issues)
SQL Error Log	Database-level errors during job execution
Windows Event Log	OS-level errors affecting Agent service
Azure Monitor (MI)	SQL Agent job metrics and logs

Enabling Extended Job History

By default, SQL Agent keeps limited history. Increase it:

Setting	Default	Recommended
Max job history rows	1,000	10,000+
Max rows per job	100	1,000

⚠️ Watch Out

msdb can grow large if you don't manage job history. Create a maintenance job to purge old history: EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = '<30 days ago>'. Schedule this monthly.

Elastic Job Troubleshooting

Issue	Cause	Resolution
Job stuck in "Waiting for retry"	Target DB unreachable	Check firewall, connectivity, credential
Job failed on some targets	Different schema or permissions	Verify T-SQL compatibility across all targets
Job not running on schedule	Job agent stopped or quota exceeded	Check agent health in portal
"Could not connect"	Credential doesn't exist on target DB	Create the user/credential on each target

Azure Automation Runbook Troubleshooting

Issue	Cause	Resolution
"Forbidden" error	Managed Identity missing RBAC role	Grant required role (e.g., SQL DB Contributor)
Module not found	Az module not imported	Import required Az modules in Automation Account
Runbook times out	Default timeout is 3 hours	Optimize script or split into multiple runbooks
Runbook succeeds but no effect	Wrong subscription/resource context	Set `Set-AzContext -SubscriptionId` explicitly

Monitoring Automation Health

Dashboard Checklist

What to Monitor	Where
SQL Agent job status	SSMS Job Activity Monitor / msdb queries
Elastic Job execution	Job database `jobs.job_executions` view
Automation Runbook status	Azure Portal → Automation Account → Jobs
Alert rule health	Azure Monitor → Alerts → Summary
Action Group delivery	Azure Monitor → Service Health

Anti-Patterns

"Alert on every metric." Alert fatigue is the #1 reason on-call ignores alerts. Pick 3–5 SLOs per service (latency, error rate, saturation) and alert on burn-rate against them.
"Email-only Action Group." Email lands in inbox. On-call needs paging — wire to PagerDuty / Teams / SMS via Action Group → ITSM/webhook.
"Set static threshold for CPU = 80 %." Static thresholds drift as workloads change. Dynamic thresholds (smart detection) auto-baseline and alert on anomaly.
"One Action Group for everything." When a noisy alert fires in test, prod on-call gets paged. Separate Action Groups by environment + service.
"Use Activity Log alert for Microsoft.Sql/servers/databases/write to detect schema change." Activity Log is control plane. Schema changes are data plane (T-SQL DDL). Use SQL Audit → Log Analytics for that.
"Disable failed jobs alert because the job sometimes fails legitimately." Add a retry + grace period to the job, or alert on > N failures in M minutes. Don't disable.

⚠️ Watch Out

Action Group SMS / voice notifications have country-specific rate limits + costs. A storm of 100 alerts in 10 minutes can blow your SMS budget. Use Action Group → PagerDuty/Teams instead of SMS for high-volume scenarios.

Migration Between Alerting Models

From → To	Path	Cost
SQL Agent `sp_notify_operator` (email) → Azure Monitor metric alert + Action Group	Wire metric `sql_agent_job_failed` → AG	One-time; gains routing + retry
Static threshold → Dynamic threshold	Edit alert rule, switch to "Dynamic"	Free; auto-baselined
Email-only AG → Email + ITSM connector	Add ITSM action; map fields	Connector setup; integrates with ServiceNow / Jira
Separate alert per metric → Multi-resource + grouped alert rules	Use scope = subscription / RG; enable grouping	Cuts rule sprawl; harder to route per-DB
Manual ad-hoc alerts → Azure Policy + diagnostic settings	Policy enforces diag settings; alert rules deploy via Bicep	Drift prevention; IaC discipline
Storage-only audit → LA + Sentinel detection rules	Add LA destination + Sentinel rule	Adds Sentinel cost; gains correlation
Per-DB log alert → Cross-resource KQL alert	Use `union` / `workspace()` across LA workspaces	One rule for fleet; auth setup needed
Email → webhook → PagerDuty	Replace email action with webhook to PagerDuty	Proper paging; cost = PagerDuty subscription

Most expensive move: Sentinel adoption for SQL alerts (orchestration tier).

Real Scenarios

MI + 80 DBs, want SLO-based alerting → 3 metric alerts (CPU, Storage, Connections) at MI scope + 2 log alerts (errors, deadlocks) on LA + 1 ITSM Action Group. Driver: 5 alerts cover 80 % of incidents.
24/7 SaaS, on-call rotation → All alerts → Action Group webhook → PagerDuty (rotation) + Teams channel for visibility. Driver: paging on real incidents, signal in chat for situational awareness.
Multi-region failover → Activity Log alert on Microsoft.Sql/.../failover/action. Driver: instant notify when FOG initiates. Trade-off: AG must point to the surviving region.
Cost guardrail on Azure SQL DB DTU model → Metric alert on dtu_consumption_percent > 90 % for 30 min. Driver: detect before throttle. Trade-off: noisy on bursty workloads — use dynamic threshold.
DBA wants to monitor tempdb saturation across SQL VM fleet → Workbook on Azure Monitor for VMs counter \SQLServer:Databases(tempdb)\Data File(s) Size (KB) → alert at 80 % of provisioned. Driver: prevent tempdb-out-of-space outages.

Flashcards

What are the 3 types of Azure Monitor alerts?

Click to reveal answer

1) Metric alerts (numeric thresholds, ~1 min), 2) Log alerts (KQL queries, 1-15 min), 3) Activity log alerts (control-plane events, near real-time).

1 / 5

Quiz

Q1/5

0 correct

You want to be alerted when your Azure SQL Database CPU stays above 90% for 10 minutes. What type of alert should you create?

Azure Monitor Alerts for Databases​

Alert Types​

Critical Alerts Every DBA Should Configure​

Action Groups — What Happens When an Alert Fires​

SQL Agent Job Troubleshooting​

Common Failure Reasons​

Where to Look​

Enabling Extended Job History​

Elastic Job Troubleshooting​

Azure Automation Runbook Troubleshooting​

Monitoring Automation Health​

Dashboard Checklist​

Anti-Patterns​

Migration Between Alerting Models​

Real Scenarios​

Flashcards​

Quiz​