Monitoring Tools Matrix
Too many monitoring tools, not sure which to use? This page cuts through the confusion — one place to understand what each tool does, when to use it, and how they relate.
The Matrix
| Tool | Best For | SQL DB | MI | VM | Complexity | Exam Weight |
|---|---|---|---|---|---|---|
| Query Store | Query plan history, regression detection | ✅ | ✅ | ✅ | Low | 🎯 High |
| DMVs | Real-time diagnostics (waits, sessions, index stats) | ✅ | ✅ | ✅ | Medium | 🎯 High |
| Azure Monitor Metrics | Platform-level health (CPU%, DTU%, IO%) | ✅ | ✅ | Partial | Low | 🎯 High |
| Azure Monitor Logs (KQL) | Deep log analysis, custom queries | ✅ | ✅ | ✅ | Medium | 📖 Medium |
| Extended Events | Lightweight targeted tracing | ✅ | ✅ | ✅ | High | 📖 Medium |
| Intelligent Insights | ML-based anomaly detection | ✅ | ✅ | ❌ | Low | 📖 Medium |
| Query Performance Insight | Portal-based top query view | ✅ | ❌ | ❌ | Low | 📖 Low |
| SQL Insights (Azure Monitor) | Centralized fleet monitoring | ✅ | ✅ | ✅ | Medium | 📚 Low |
| Database Watcher | Real-time Azure SQL estate monitoring | ✅ | ✅ | ❌ | Low | 📚 New |
| SQL Assessment | Best practice configuration checks | ❌ | ❌ | ✅ | Low | 📖 Medium |
| Activity Log | Control-plane events (scale, failover, config) | ✅ | ✅ | ✅ | Low | 🎯 Medium |
DP-300 focuses on: Query Store (plan regression, forced plans), DMVs (wait stats, missing indexes), Azure Monitor Metrics (alerting), and Activity Log (failover detection). Know these 4 well.
"If You Suspect X, Start With Y"
| Symptom | Start With | Then Go Deeper |
|---|---|---|
| Slow queries | Query Store → Regressed Queries view | DMVs: sys.dm_exec_query_stats |
| High CPU | Azure Monitor: cpu_percent metric | DMVs: sys.dm_exec_requests → find top CPU queries |
| Blocking / deadlocks | DMVs: sys.dm_exec_requests + sys.dm_tran_locks | Extended Events: blocked_process_report |
| I/O bottleneck | Azure Monitor: physical_data_read_percent | DMVs: sys.dm_io_virtual_file_stats |
| Missing indexes | DMVs: sys.dm_db_missing_index_* | Query Store: find queries that would benefit |
| Plan regression | Query Store → "Regressed Queries" | Automatic Tuning: FORCE_LAST_GOOD_PLAN |
| Connection failures | Azure Monitor: connection_failed metric | Activity Log: check for failover events |
| Throttling (log rate) | Azure Monitor: log_write_percent = 100% | Scale up vCores or move to BC tier |
| Unexpected failover | Activity Log: filter by failover events | Azure Monitor: check resource health |
| Memory pressure | DMVs: sys.dm_os_memory_clerks | Query Store: find memory-heavy queries |
| tempdb contention | DMVs: sys.dm_exec_requests (wait type = PAGELATCH) | Add tempdb files (VM only) |
| Index fragmentation | DMVs: sys.dm_db_index_physical_stats | Schedule rebuild/reorganize maintenance |
Query Store — Quick Decision Guide
Query Store is your flight recorder — it captures every query's plan and performance over time.
| Question | Answer |
|---|---|
| Is it enabled by default? | ✅ Yes on Azure SQL DB. ❌ Manual on VM. ✅ Yes on MI. |
| Where is data stored? | In the user database (not system DB) |
| Can I force a specific plan? | ✅ Yes — like Oracle SQL Profiles |
| Does it detect regressions? | ✅ Yes — "Regressed Queries" view |
| Does it auto-fix regressions? | ✅ With FORCE_LAST_GOOD_PLAN (Automatic Tuning) |
| Max retention? | Configurable (default 30 days, size-based cleanup) |
When to Use Query Store
| Scenario | Use Query Store? |
|---|---|
| "Why did this query slow down last week?" | ✅ |
| "Which queries use the most CPU?" | ✅ |
| "Did a plan change cause regression?" | ✅ |
| "What's happening right now?" | ❌ Use DMVs |
| "What wait types are most common?" | ✅ (wait stats per query in QS) |
DMV Quick Reference
DMVs give you real-time, in-the-moment diagnostics. They don't store history — for history, use Query Store.
Essential DMVs for DP-300
| DMV | What It Shows | Use When |
|---|---|---|
sys.dm_exec_requests | Currently running queries + waits | "What's running right now?" |
sys.dm_exec_sessions | Active sessions and connection info | "Who's connected?" |
sys.dm_exec_query_stats | Aggregated query performance (CPU, reads, duration) | "Top resource consumers" |
sys.dm_os_wait_stats | Server-wide wait statistics | "What is SQL Server waiting on?" |
sys.dm_db_missing_index_* | Missing index suggestions | "Which indexes should I create?" |
sys.dm_db_index_physical_stats | Index fragmentation levels | "Which indexes need maintenance?" |
sys.dm_io_virtual_file_stats | File-level I/O statistics | "Is storage the bottleneck?" |
sys.dm_tran_locks | Current lock information | "What's causing blocking?" |
sys.dm_exec_sql_text() | SQL text for a given handle | Used with other DMVs via CROSS APPLY |
sys.dm_exec_query_plan() | Execution plan XML | Used with other DMVs via CROSS APPLY |
Key Wait Types to Know
| Wait Type | Meaning | Common Fix |
|---|---|---|
CXPACKET / CXCONSUMER | Parallelism waits | Tune MAXDOP, Cost Threshold |
PAGEIOLATCH_* | Waiting for I/O (data from disk) | Missing indexes, scale storage |
LCK_M_* | Lock waits (blocking) | Optimize queries, shorter transactions |
WRITELOG | Log write waits | Scale storage, move log to faster disk |
SOS_SCHEDULER_YIELD | CPU pressure | Scale up, optimize queries |
RESOURCE_SEMAPHORE | Memory grant waits | Reduce query memory needs, scale up |
PAGELATCH_* | tempdb contention | Add tempdb files |
Extended Events — When DMVs Aren't Enough
Extended Events (XEvents) are the lightweight tracing engine. Use when you need to capture specific events over time.
| Use Case | XEvent to Use |
|---|---|
| Capture deadlock graphs | xml_deadlock_report |
| Track long-running queries | sql_batch_completed with duration filter |
| Monitor blocking | blocked_process_report (set threshold first) |
| Audit login failures | error_reported with error 18456 |
| Track query timeouts | attention event |
| Monitor file growth | database_file_size_change |
Extended Events vs SQL Trace/Profiler: SQL Trace and SQL Server Profiler are deprecated. Extended Events is the replacement. The exam expects you to know XEvents, not Profiler. XEvents have lower overhead and support Azure SQL DB/MI.
Azure-Native Monitoring Tools
Azure Monitor Metrics
- What: Platform-level numeric time-series (CPU%, DTU%, IO%, sessions)
- Latency: ~1 minute
- Retention: 93 days
- Cost: Free (platform metrics)
- Best for: Real-time health dashboards and metric alerts
Azure Monitor Logs (Log Analytics)
- What: Structured event records queryable with KQL
- Latency: 2-5 minutes
- Retention: 30-730 days
- Cost: Per GB ingested
- Best for: Deep analysis, audit queries, custom dashboards
Intelligent Insights
- What: ML-based performance anomaly detection
- Works with: Azure SQL DB and MI
- Best for: Automatic detection of regression, blocking, resource limits
- Output: JSON insight in diagnostic logs
Query Performance Insight
- What: Portal blade showing top queries by CPU/duration/execution count
- Works with: Azure SQL DB only
- Powered by: Query Store data
- Best for: Quick visual "what's consuming resources?"
Database Watcher (Preview)
- What: Near-real-time monitoring for Azure SQL estate
- Works with: Azure SQL DB and MI
- Best for: Fleet monitoring, dashboards, data collection to Azure Data Explorer
- Key benefit: Low-latency SQL-level monitoring without agents
SQL Assessment (VM only)
- What: Configuration best-practice checker
- Works with: SQL on VM (via SQL IaaS Agent)
- Best for: Validating VM storage, MAXDOP, tempdb, memory settings
- Output: Recommendations in Azure Portal
Common Troubleshooting Workflow
Step 1: Detect
| Signal | Tool |
|---|---|
| High CPU alert fires | Azure Monitor → cpu_percent |
| User reports "application is slow" | Azure Monitor → check all metrics |
| Scheduled job failed | SQL Agent history or Azure Monitor alerts |
Step 2: Diagnose
| Finding | Next Action |
|---|---|
| CPU > 90% sustained | DMVs → sys.dm_exec_requests → find top CPU queries |
| One specific query is slow | Query Store → check plan history → compare plans |
| Many queries suddenly slow | Query Store → Regressed Queries → check for plan changes |
| Blocking detected | DMVs → sys.dm_exec_requests (blocking_session_id) |
| I/O > 90% | DMVs → sys.dm_io_virtual_file_stats → check latency |
Step 3: Fix
| Root Cause | Fix |
|---|---|
| Bad query plan | Force good plan in Query Store |
| Missing index | Create index from sys.dm_db_missing_index_details |
| Resource limit hit | Scale up (more vCores, BC tier, or Hyperscale) |
| Blocking chain | Optimize the blocking query, shorten transactions |
| Parameter sniffing | Use OPTIMIZE FOR, RECOMPILE, or PSP (compat 160) |
Tool Relationships
| Tool | Feeds Into | Powered By |
|---|---|---|
| Query Store | Query Performance Insight, Automatic Tuning | SQL engine internal capture |
| DMVs | Custom scripts, monitoring solutions | SQL engine real-time state |
| Azure Monitor Metrics | Metric Alerts, Dashboards, Workbooks | Platform telemetry |
| Azure Monitor Logs | Log Alerts, KQL queries, Workbooks | Diagnostic Settings export |
| Extended Events | Custom analysis, deadlock review | SQL engine event framework |
| Intelligent Insights | Azure Monitor Logs | ML analysis of Query Store + DMV data |
| Database Watcher | Azure Data Explorer dashboards | Direct SQL data collection |
Anti-Patterns
- "Use Extended Events for everything because it's lightweight." XE is lightweight per event — a session capturing every
sql_statement_completedon a busy OLTP server still produces gigabytes/hour and saturates the file target. Scope XE narrowly (filter bydatabase_id,duration > N ms). - "Query Store is free, turn it on with defaults." Defaults capture every ad-hoc query and fill
MAX_STORAGE_SIZE_MB(default 1 GB) in days on a busy server. Then QS goes read-only and you stop collecting new data. SetQUERY_CAPTURE_MODE = AUTO(skips trivial), bump storage on busy systems, monitorsys.database_query_store_options. - "DMVs and Query Store overlap, pick one." They answer different questions: DMVs = right now, QS = history. You need both.
- "Database Watcher replaces Query Store." No — Database Watcher collects data including from QS and ships it to ADX for cross-server analysis. QS still has to be on at each source DB.
- "Azure Monitor Metrics will tell me which query is slow." Metrics are server-level (DTU %, CPU %, log IO). They tell you something is slow, not which query. Drill into Query Store / QPI for the query identity.
- "Just enable Diagnostic Settings to Log Analytics for everything." Per-GB ingest pricing. Sending all DMV/QS data to LA on 50 servers = surprise bill. Send what you'll query (errors, blocks, deadlocks) and keep deep diagnostics local.
Query Store survives failover but not restore-to-different-server unless you copy sys.query_store_* tables. A RESTORE of a backup from another server brings the QS data with the database. PITR to a new DB also preserves QS. But manual schema-only deploys lose all QS history.
Migration Between Tools
| From → To | Path | Cost |
|---|---|---|
| SQL Profiler / Trace → Extended Events | Convert trace to XE session via sp_SQLskills_ConvertTraceToExtendedEvents or rewrite | One-time rewrite; gain lightweight capture |
| Custom DMV scripts → Query Store | Enable QS, replace per-query DMV polling with sys.query_store_* views | Online; QS stores history DMVs throw away |
| Query Store → Database Watcher | Provision DW + ADX cluster; enable per-DB collection | Adds ADX cost; gain cross-server analysis |
| Diagnostic Settings to Storage → Log Analytics | Re-route Diagnostic Setting | Higher per-GB cost; gain KQL + alerts |
| Diagnostic Settings to Log Analytics → Event Hub | Add Event Hub destination | For SIEM/3rd-party export; doesn't replace LA |
| Manual XE analysis → Intelligent Insights | Enable Diagnostic Setting SQLInsights to LA | ML auto-detects regressions; doesn't replace deep dive |
| On-prem Performance Monitor → Azure Monitor Metrics | Use platform metrics + Workbooks | Free for basic metrics; Insights tier adds cost |
| Custom dashboards (Grafana) → Workbooks | Re-author in Workbook JSON | Native to portal; can still pair with Grafana for cross-cloud |
The expensive moves are adding Database Watcher (ADX cluster) and shipping everything to Log Analytics (per-GB ingest).
Real Scenarios
- Single Azure SQL DB, intermittent slowdowns → Query Store + Query Performance Insight + Automatic Tuning (FORCE_LAST_GOOD_PLAN). Driver: built-in, free, addresses regressions automatically. Trade-off: bound to one DB — no cross-server view.
- Fleet of 200 MIs across regions → Database Watcher → ADX → Workbook. Driver: cross-server query patterns visible. Trade-off: ADX cluster ~$300/mo minimum + per-DB collection cost.
- Random deadlocks once a week → XE session capturing
xml_deadlock_reportonly, ring buffer target. Driver: targeted capture, near-zero overhead. Trade-off: ring buffer means losing data on restart — alert on the event to grab it. - Compliance requires 90-day audit trail of all logins → Auditing to Log Analytics + retention policy. Driver: KQL queryable, alerting native. Trade-off: LA ingest cost — limit to logon/logoff events, not data DML.
- DBA wants real-time dashboard for ops floor → Azure Monitor Workbook with metrics + DMV-derived metrics via SQL Insights. Driver: refresh ≤ 1 min, no extra agent. Trade-off: Insights tier pricing per monitored target.