Advanced Monitoring & Optimization

The Complete Monitoring Stack

Monitoring → Diagnosis → Action

🔍

1. Real-Time Monitoring

DMVs (live sessions, waits)

Extended Events (lightweight tracing)

Database Watcher (near real-time dashboards)

📊

2. Historical Analysis

Query Store (plan history, regressions)

SQL Insights (Azure Monitor deep telemetry)

Intelligent Insights (ML anomaly detection)

🔧

3. Performance Optimization

Automatic Tuning (force plans, auto-index)

IQP (adaptive joins, memory feedback)

Resource Governor (workload isolation)

Statistics Maintenance

Extended Events (XEvents)

The modern, lightweight replacement for SQL Trace/Profiler. Think of it as Oracle's event tracing (10046, 10053) but better.

Architecture

🎯

Events

What to capture

sql_statement_completed

xml_deadlock_report

error_reported

🔍

Predicates

Filters (when to capture)

duration > 5000ms

error_number = 18456

📎

Actions

Extra data to collect

sql_text

plan_handle

session_id

💾

Targets

Where to store

ring_buffer

event_file

histogram

Key Predefined Sessions

Session	What It Captures	Enabled By Default
system_health	Errors, deadlocks, memory issues	✅ Yes
AlwaysOn_health	AG state changes, failover events	✅ (if AG configured)

Common XEvent Use Cases

Scenario	Events to Capture
Slow queries	`sql_statement_completed` (filter: duration > 5s)
Deadlocks	`xml_deadlock_report` (or check system_health)
Login failures	`error_reported` (error_number = 18456)
Plan changes	`query_post_compilation_showplan`
Wait analysis	`wait_info`, `wait_completed`

🎯 Exam Focus

Extended Events replaces SQL Profiler (Profiler is deprecated). XEvents uses less CPU, supports filtering at the source (predicates), and works on Azure SQL DB, MI, and VMs. system_health session is always running — check it first for deadlocks and errors.

🏢 Real-World DBA Note

Oracle DBA parallel: Extended Events = Oracle's ALTER SESSION SET EVENTS + DBMS_MONITOR + AWR event tracing, but unified in one framework. The system_health session is like Oracle's incident capture — always running, catches critical events automatically.

Create an XEvent session — enable order

XEvent session lifecycle

CREATE EVENT SESSION

Name the session

Pick events (e.g. sql_statement_completed)

Add predicates (filters)

WHERE duration > 5000000 (microseconds!)

Filter at source = lowest overhead

Add actions

Extra columns: sql_text, plan_handle, session_id

Cost is per event — add only what you need

Add target

ring_buffer (memory) for live tail

event_file (.xel) for long capture

Azure SQL DB → file target on Blob storage

Start + query

ALTER EVENT SESSION ... STATE = START

Read with sys.fn_xe_file_target_read_file or DMV sys.dm_xe_session_targets

Common ordering trap

Duration in XEvents is microseconds, not milliseconds. Filtering WHERE duration > 5000 captures everything over 5 ms (way too noisy). Use 5000000 for 5 seconds. Also: on Azure SQL DB the only file target is Azure Blob storage — a SAS-secured container, not a local path.

Database Watcher (Preview)

Near real-time monitoring dashboard that doesn't require manual setup of diagnostic pipelines.

Feature	Details
Data store	Azure Data Explorer (Kusto)
Query language	KQL
Latency	Near real-time (~seconds)
Supported	Azure SQL DB, MI
Dashboards	Pre-built Grafana dashboards
Key metrics	Sessions, waits, query stats, resource usage, blocking

🎯 Exam Focus

Database Watcher is Microsoft's newest monitoring solution — it streams data to Azure Data Explorer for KQL analysis. If the exam mentions "near real-time monitoring with KQL dashboards" → Database Watcher.

SQL Insights (Azure Monitor)

Deeper telemetry than basic Azure Monitor metrics. Uses a monitoring VM to collect DMV data.

🗄️

Azure SQL DB/MI/VM

Source database

DMV queries collected

📡

Monitoring VM (Telegraf)

Collects DMV data

Pushes to Log Analytics

📊

Log Analytics

Pre-built Workbooks

Top queries, waits, sessions

Azure Monitor Alerts

Resource Governor (VM & MI)

Isolates workloads by limiting CPU, memory, and I/O per workload group.

Resource Governor Architecture

🏊

Resource Pool: OLTP

CPU: 60%, Memory: 60% — Production workload

📊

Resource Pool: Reporting

CPU: 30%, Memory: 30% — BI queries

⚙️

Resource Pool: Default

CPU: 10%, Memory: 10% — Admin tasks

Architecture components:

Resource Pool — defines CPU/memory/I/O limits
Workload Group — maps to a resource pool with additional settings (request limits)
Classifier Function — routes incoming sessions to workload groups based on rules

Resource Governor enable order

Configure Resource Governor

CREATE RESOURCE POOL

MIN/MAX_CPU_PERCENT, memory, IO limits

One pool per workload tier

CREATE WORKLOAD GROUP

USING <pool>

Set MAX_DOP, request_max_memory_grant, etc.

CREATE classifier function

In master DB, schema-bound

Returns workload group name based on APP_NAME, login, etc.

Bind classifier

ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.fn_classifier)

RECONFIGURE

ALTER RESOURCE GOVERNOR RECONFIGURE

Without this, nothing takes effect

Common ordering trap

ALTER RESOURCE GOVERNOR RECONFIGURE is the step everyone forgets. Pools, groups, and classifier all exist but no traffic is being classified until you reconfigure. Verify with sys.dm_resource_governor_workload_groups after running a workload.

🎯 Exam Focus

Resource Governor is available on SQL Server on VM and MI (not Azure SQL DB). If the exam says "prevent reporting queries from consuming all CPU" → Resource Governor. For Azure SQL DB, use read replicas (BC tier) or Named Replicas (Hyperscale) to isolate workloads instead.

🏢 Real-World DBA Note

Oracle DBA parallel: Resource Governor = Oracle Resource Manager. Resource Pools = Consumer Groups. Classifier Function = Resource Manager plan rules. Same concept: prevent one workload from starving another.

Intelligent Query Processing (IQP)

A family of automatic query optimization features. No code changes needed — just upgrade compatibility level.

IQP Features (Compatibility Level 150+)

🧠

Memory Grant Feedback

Auto-adjust memory grants over multiple executions

🔀

Adaptive Joins

Hash or Nested Loop decided at runtime based on actual rows

⚡

Batch Mode on Rowstore

Analytics speed on OLTP tables without columnstore

📐

Approximate Count

APPROX_COUNT_DISTINCT — fast, ~2% error

📋

Table Variable Deferred

Accurate cardinality for table variables

📌

Optimized Plan Forcing

Faster forced plan compilation via Query Store

🎯

Parameter Sensitive Plan

Multiple plans for parameter sniffing (Compat 160)

🔄

DOP Feedback

Auto-adjust parallelism degree (Compat 160)

📈

CE Feedback

Auto-adjust cardinality estimates (Compat 160)

Key exam-tested IQP features:

Feature	Problem It Solves	Compat Level
Memory Grant Feedback	Queries request too much/too little memory (spills to disk)	150
Adaptive Joins	Optimizer picks wrong join type at compile time	150
Batch Mode on Rowstore	Analytics queries slow on OLTP tables	150
Parameter Sensitive Plan	Parameter sniffing → bad plan for some values	160
DOP Feedback	Over/under-parallelism wastes resources	160

🎯 Exam Focus

IQP requires the right compatibility level (150 = SQL 2019, 160 = SQL 2022). If the exam asks "how to enable adaptive joins" → set compatibility level to 150+. No query changes needed — it's automatic.

Statistics Maintenance

Statistics tell the query optimizer about data distribution. Stale statistics → bad plans → poor performance.

Setting	Behavior	Best For
AUTO_UPDATE_STATISTICS	Updates when ~20% of rows change	Most workloads (default: ON)
AUTO_UPDATE_STATISTICS_ASYNC	Updates in background (doesn't block queries)	OLTP (avoids query compilation delays)
Manual UPDATE STATISTICS	On-demand, full scan or sample %	After bulk loads, partition switches

🎯 Exam Focus

After large data loads: always run UPDATE STATISTICS manually. Auto-update won't trigger until the 20% threshold. For partitioned tables, update stats on specific partitions with UPDATE STATISTICS ... WITH RESAMPLE ON PARTITIONS(N).

Execution Plans — What to Look For

Red Flag in Plan	What It Means	Fix
Table/Index Scan (on large table)	Missing index or unSARGable WHERE	Create covering index
Key Lookup (high cost)	Index doesn't include needed columns	Add INCLUDE columns
Sort (with warning ⚠️)	Memory spill to tempdb	Add index matching ORDER BY, or increase memory
Parallelism (CXPACKET waits)	Skewed distribution across threads	Update stats, limit MAXDOP
Thick arrows (many rows)	Cardinality estimate wrong	Update stats, rewrite query
Nested Loops (on large sets)	Should be Hash Match	Update stats, hint if needed

✅

Good Plan

Index Seek → Nested Loop (small inner) → SELECT

Fast, minimal I/O, targeted reads

❌

Bad Plan

Table Scan → Hash Match (high memory) → Sort ⚠️ (spill) → SELECT

Slow, excessive I/O, memory spills

🏢 Real-World DBA Note

Oracle DBA parallel: Execution plans in SQL Server are like Oracle explain plans. Key difference: SQL Server shows actual rows vs estimated in actual plans (like Oracle's V$SQL_PLAN_STATISTICS_ALL). Look for big gaps between estimated and actual — that means stale stats.

Anti-Patterns

"Use SQL Profiler in production." Profiler is heavy (per-event handshake) and deprecated. Extended Events is the supported, low-overhead path. The exam treats this as a wrong answer.
"Capture every event in an XEvent session." XE is lightweight only when targeted. A session capturing sql_statement_completed for all DBs writes GBs/hour and degrades the very thing you're measuring. Filter with predicates (DB, login, duration > N).
"system_health is enough — we don't need a custom session." system_health covers errors, deadlocks, memory issues but not statement-level capture. Add a custom XE session for query tuning workloads.
"Read execution plans from SSMS only." Plans live in Query Store for Azure SQL DB/MI — query sys.query_store_plan to get the plan history without re-running.
"Estimated plan = actual plan." They differ when stats are stale. The Estimated plan uses optimizer guesses; Actual fills in measured rows. Big gap = stats / parameter sniffing problem.
"Save XE captures to ring buffer in prod." Ring buffer is in-memory and per-instance — lost on restart, capped in size. Persist to Event File target for production tuning runs.

⚠️ Watch Out

XE Event File on Azure SQL DB writes to Azure Storage — not local disk. Configure a SAS URL pointing to a storage account in the same region as the DB. Cross-region writes silently fail.

Migration Between Monitoring Approaches

From → To	Path	Cost
SQL Profiler → Extended Events	Convert via Profiler "Export" → "XE Session"	Lower overhead; modern tooling
Manual DMV polling → Query Store	Enable QS at DB level	Free, automatic plan + runtime capture
QS only → QS + Database Watcher	Add Watcher; keep QS as source	Watcher gives cross-server view; ADX cost
`system_health` → custom XE session	Create session, target Event File on Azure Storage	Targeted capture; storage cost
Local XE files → Storage + Log Analytics	Pipeline XE → Storage → LA via Logic App	KQL across XE data; LA ingest cost
Plan capture via SSMS only → Plan Cache + QS	DMV-based plan history + QS forced plans	Survives plan eviction; QS forces plans across recompiles
Workload trace via Profiler → Distributed Replay	XE capture + replay on test box	Reproduces concurrency; Replay infra to set up

Most expensive moves: piping XE → Storage → LA (custom pipeline) and standing up Distributed Replay infrastructure.

Real Scenarios

Performance regression after a release → Query Store ‘Top Resource Consumers’ + force previous plan. Driver: roll back the plan without code change. Trade-off: forced plan has cost overhead vs natural recompile.
Intermittent deadlocks in prod → system_health deadlock graph + custom XE session for lock_acquired filtered to the affected DB. Driver: cheap detection + targeted deep-dive. Trade-off: detailed XE has measurable overhead — enable for a tuning window only.
ISV asks for repro on customer issue → XE capture on customer DB for 1 hour + Distributed Replay against test box. Driver: real concurrency. Trade-off: Replay infra setup cost.
Estate-wide query baseline → Database Watcher + ADX Workbook trending p95 / p99 per DB. Driver: cross-server view. Trade-off: ADX cost + initial setup.
Dev wants to understand a slow query → SET STATISTICS IO, TIME ON + Actual Execution Plan in SSMS + Query Store wait stats. Driver: free, in-the-moment. Trade-off: dev DB only — don't run on prod ad-hoc.

Flashcards

What are Extended Events and why use them?

Click to reveal answer

Lightweight event tracing system replacing SQL Profiler. Uses Events + Predicates (filters) + Actions (extra data) + Targets (storage). Lower overhead than Profiler.

1 / 8

Quiz

Q1/4

0 correct

You need to isolate reporting queries so they don't consume more than 30% CPU on a SQL Managed Instance. What feature do you use?

The Complete Monitoring Stack​

Extended Events (XEvents)​

Architecture​

Key Predefined Sessions​

Common XEvent Use Cases​

Create an XEvent session — enable order​

Database Watcher (Preview)​

SQL Insights (Azure Monitor)​

Resource Governor (VM & MI)​

Resource Governor enable order​

Intelligent Query Processing (IQP)​

Statistics Maintenance​

Execution Plans — What to Look For​

Anti-Patterns​

Migration Between Monitoring Approaches​

Real Scenarios​

Flashcards​

Quiz​

The Complete Monitoring Stack

Extended Events (XEvents)

Architecture

Key Predefined Sessions

Common XEvent Use Cases

Create an XEvent session — enable order

Database Watcher (Preview)

SQL Insights (Azure Monitor)

Resource Governor (VM & MI)

Resource Governor enable order

Intelligent Query Processing (IQP)

Statistics Maintenance

Execution Plans — What to Look For

Anti-Patterns

Migration Between Monitoring Approaches

Real Scenarios

Flashcards

Quiz