Advanced Monitoring & Optimization
The Complete Monitoring Stackβ
Extended Events (XEvents)β
The modern, lightweight replacement for SQL Trace/Profiler. Think of it as Oracle's event tracing (10046, 10053) but better.
Architectureβ
Key Predefined Sessionsβ
| Session | What It Captures | Enabled By Default |
|---|---|---|
| system_health | Errors, deadlocks, memory issues | β Yes |
| AlwaysOn_health | AG state changes, failover events | β (if AG configured) |
Common XEvent Use Casesβ
| Scenario | Events to Capture |
|---|---|
| Slow queries | sql_statement_completed (filter: duration > 5s) |
| Deadlocks | xml_deadlock_report (or check system_health) |
| Login failures | error_reported (error_number = 18456) |
| Plan changes | query_post_compilation_showplan |
| Wait analysis | wait_info, wait_completed |
Extended Events replaces SQL Profiler (Profiler is deprecated). XEvents uses less CPU, supports filtering at the source (predicates), and works on Azure SQL DB, MI, and VMs. system_health session is always running β check it first for deadlocks and errors.
Oracle DBA parallel: Extended Events = Oracle's ALTER SESSION SET EVENTS + DBMS_MONITOR + AWR event tracing, but unified in one framework. The system_health session is like Oracle's incident capture β always running, catches critical events automatically.
Create an XEvent session β enable orderβ
Duration in XEvents is microseconds, not milliseconds. Filtering WHERE duration > 5000 captures everything over 5 ms (way too noisy). Use 5000000 for 5 seconds. Also: on Azure SQL DB the only file target is Azure Blob storage β a SAS-secured container, not a local path.
Database Watcher (Preview)β
Near real-time monitoring dashboard that doesn't require manual setup of diagnostic pipelines.
| Feature | Details |
|---|---|
| Data store | Azure Data Explorer (Kusto) |
| Query language | KQL |
| Latency | Near real-time (~seconds) |
| Supported | Azure SQL DB, MI |
| Dashboards | Pre-built Grafana dashboards |
| Key metrics | Sessions, waits, query stats, resource usage, blocking |
Database Watcher is Microsoft's newest monitoring solution β it streams data to Azure Data Explorer for KQL analysis. If the exam mentions "near real-time monitoring with KQL dashboards" β Database Watcher.
SQL Insights (Azure Monitor)β
Deeper telemetry than basic Azure Monitor metrics. Uses a monitoring VM to collect DMV data.
Resource Governor (VM & MI)β
Isolates workloads by limiting CPU, memory, and I/O per workload group.
Architecture components:
- Resource Pool β defines CPU/memory/I/O limits
- Workload Group β maps to a resource pool with additional settings (request limits)
- Classifier Function β routes incoming sessions to workload groups based on rules
Resource Governor enable orderβ
ALTER RESOURCE GOVERNOR RECONFIGURE is the step everyone forgets. Pools, groups, and classifier all exist but no traffic is being classified until you reconfigure. Verify with sys.dm_resource_governor_workload_groups after running a workload.
Resource Governor is available on SQL Server on VM and MI (not Azure SQL DB). If the exam says "prevent reporting queries from consuming all CPU" β Resource Governor. For Azure SQL DB, use read replicas (BC tier) or Named Replicas (Hyperscale) to isolate workloads instead.
Oracle DBA parallel: Resource Governor = Oracle Resource Manager. Resource Pools = Consumer Groups. Classifier Function = Resource Manager plan rules. Same concept: prevent one workload from starving another.
Intelligent Query Processing (IQP)β
A family of automatic query optimization features. No code changes needed β just upgrade compatibility level.
Key exam-tested IQP features:
| Feature | Problem It Solves | Compat Level |
|---|---|---|
| Memory Grant Feedback | Queries request too much/too little memory (spills to disk) | 150 |
| Adaptive Joins | Optimizer picks wrong join type at compile time | 150 |
| Batch Mode on Rowstore | Analytics queries slow on OLTP tables | 150 |
| Parameter Sensitive Plan | Parameter sniffing β bad plan for some values | 160 |
| DOP Feedback | Over/under-parallelism wastes resources | 160 |
IQP requires the right compatibility level (150 = SQL 2019, 160 = SQL 2022). If the exam asks "how to enable adaptive joins" β set compatibility level to 150+. No query changes needed β it's automatic.
Statistics Maintenanceβ
Statistics tell the query optimizer about data distribution. Stale statistics β bad plans β poor performance.
| Setting | Behavior | Best For |
|---|---|---|
| AUTO_UPDATE_STATISTICS | Updates when ~20% of rows change | Most workloads (default: ON) |
| AUTO_UPDATE_STATISTICS_ASYNC | Updates in background (doesn't block queries) | OLTP (avoids query compilation delays) |
| Manual UPDATE STATISTICS | On-demand, full scan or sample % | After bulk loads, partition switches |
After large data loads: always run UPDATE STATISTICS manually. Auto-update won't trigger until the 20% threshold. For partitioned tables, update stats on specific partitions with UPDATE STATISTICS ... WITH RESAMPLE ON PARTITIONS(N).
Execution Plans β What to Look Forβ
| Red Flag in Plan | What It Means | Fix |
|---|---|---|
| Table/Index Scan (on large table) | Missing index or unSARGable WHERE | Create covering index |
| Key Lookup (high cost) | Index doesn't include needed columns | Add INCLUDE columns |
| Sort (with warning β οΈ) | Memory spill to tempdb | Add index matching ORDER BY, or increase memory |
| Parallelism (CXPACKET waits) | Skewed distribution across threads | Update stats, limit MAXDOP |
| Thick arrows (many rows) | Cardinality estimate wrong | Update stats, rewrite query |
| Nested Loops (on large sets) | Should be Hash Match | Update stats, hint if needed |
Oracle DBA parallel: Execution plans in SQL Server are like Oracle explain plans. Key difference: SQL Server shows actual rows vs estimated in actual plans (like Oracle's V$SQL_PLAN_STATISTICS_ALL). Look for big gaps between estimated and actual β that means stale stats.
Anti-Patternsβ
- "Use SQL Profiler in production." Profiler is heavy (per-event handshake) and deprecated. Extended Events is the supported, low-overhead path. The exam treats this as a wrong answer.
- "Capture every event in an XEvent session." XE is lightweight only when targeted. A session capturing
sql_statement_completedfor all DBs writes GBs/hour and degrades the very thing you're measuring. Filter with predicates (DB, login, duration > N). - "
system_healthis enough β we don't need a custom session."system_healthcovers errors, deadlocks, memory issues but not statement-level capture. Add a custom XE session for query tuning workloads. - "Read execution plans from SSMS only." Plans live in Query Store for Azure SQL DB/MI β query
sys.query_store_planto get the plan history without re-running. - "Estimated plan = actual plan." They differ when stats are stale. The Estimated plan uses optimizer guesses; Actual fills in measured rows. Big gap = stats / parameter sniffing problem.
- "Save XE captures to ring buffer in prod." Ring buffer is in-memory and per-instance β lost on restart, capped in size. Persist to Event File target for production tuning runs.
XE Event File on Azure SQL DB writes to Azure Storage β not local disk. Configure a SAS URL pointing to a storage account in the same region as the DB. Cross-region writes silently fail.
Migration Between Monitoring Approachesβ
| From β To | Path | Cost |
|---|---|---|
| SQL Profiler β Extended Events | Convert via Profiler "Export" β "XE Session" | Lower overhead; modern tooling |
| Manual DMV polling β Query Store | Enable QS at DB level | Free, automatic plan + runtime capture |
| QS only β QS + Database Watcher | Add Watcher; keep QS as source | Watcher gives cross-server view; ADX cost |
system_health β custom XE session | Create session, target Event File on Azure Storage | Targeted capture; storage cost |
| Local XE files β Storage + Log Analytics | Pipeline XE β Storage β LA via Logic App | KQL across XE data; LA ingest cost |
| Plan capture via SSMS only β Plan Cache + QS | DMV-based plan history + QS forced plans | Survives plan eviction; QS forces plans across recompiles |
| Workload trace via Profiler β Distributed Replay | XE capture + replay on test box | Reproduces concurrency; Replay infra to set up |
Most expensive moves: piping XE β Storage β LA (custom pipeline) and standing up Distributed Replay infrastructure.
Real Scenariosβ
- Performance regression after a release β Query Store βTop Resource Consumersβ + force previous plan. Driver: roll back the plan without code change. Trade-off: forced plan has cost overhead vs natural recompile.
- Intermittent deadlocks in prod β
system_healthdeadlock graph + custom XE session forlock_acquiredfiltered to the affected DB. Driver: cheap detection + targeted deep-dive. Trade-off: detailed XE has measurable overhead β enable for a tuning window only. - ISV asks for repro on customer issue β XE capture on customer DB for 1 hour + Distributed Replay against test box. Driver: real concurrency. Trade-off: Replay infra setup cost.
- Estate-wide query baseline β Database Watcher + ADX Workbook trending p95 / p99 per DB. Driver: cross-server view. Trade-off: ADX cost + initial setup.
- Dev wants to understand a slow query β
SET STATISTICS IO, TIME ON+ Actual Execution Plan in SSMS + Query Store wait stats. Driver: free, in-the-moment. Trade-off: dev DB only β don't run on prod ad-hoc.