Skip to main content

Advanced Monitoring & Optimization

The Complete Monitoring Stack​

Monitoring β†’ Diagnosis β†’ Action
πŸ”
1. Real-Time Monitoring
DMVs (live sessions, waits)
Extended Events (lightweight tracing)
Database Watcher (near real-time dashboards)
πŸ“Š
2. Historical Analysis
Query Store (plan history, regressions)
SQL Insights (Azure Monitor deep telemetry)
Intelligent Insights (ML anomaly detection)
πŸ”§
3. Performance Optimization
Automatic Tuning (force plans, auto-index)
IQP (adaptive joins, memory feedback)
Resource Governor (workload isolation)
Statistics Maintenance

Extended Events (XEvents)​

The modern, lightweight replacement for SQL Trace/Profiler. Think of it as Oracle's event tracing (10046, 10053) but better.

Architecture​

🎯
Events
What to capture
sql_statement_completed
xml_deadlock_report
error_reported
πŸ”
Predicates
Filters (when to capture)
duration > 5000ms
error_number = 18456
πŸ“Ž
Actions
Extra data to collect
sql_text
plan_handle
session_id
πŸ’Ύ
Targets
Where to store
ring_buffer
event_file
histogram

Key Predefined Sessions​

SessionWhat It CapturesEnabled By Default
system_healthErrors, deadlocks, memory issuesβœ… Yes
AlwaysOn_healthAG state changes, failover eventsβœ… (if AG configured)

Common XEvent Use Cases​

ScenarioEvents to Capture
Slow queriessql_statement_completed (filter: duration > 5s)
Deadlocksxml_deadlock_report (or check system_health)
Login failureserror_reported (error_number = 18456)
Plan changesquery_post_compilation_showplan
Wait analysiswait_info, wait_completed
🎯 Exam Focus

Extended Events replaces SQL Profiler (Profiler is deprecated). XEvents uses less CPU, supports filtering at the source (predicates), and works on Azure SQL DB, MI, and VMs. system_health session is always running β€” check it first for deadlocks and errors.

🏒 Real-World DBA Note

Oracle DBA parallel: Extended Events = Oracle's ALTER SESSION SET EVENTS + DBMS_MONITOR + AWR event tracing, but unified in one framework. The system_health session is like Oracle's incident capture β€” always running, catches critical events automatically.

Create an XEvent session β€” enable order​

XEvent session lifecycle
1
CREATE EVENT SESSION
Name the session
Pick events (e.g. sql_statement_completed)
2
Add predicates (filters)
WHERE duration > 5000000 (microseconds!)
Filter at source = lowest overhead
3
Add actions
Extra columns: sql_text, plan_handle, session_id
Cost is per event β€” add only what you need
4
Add target
ring_buffer (memory) for live tail
event_file (.xel) for long capture
Azure SQL DB β†’ file target on Blob storage
5
Start + query
ALTER EVENT SESSION ... STATE = START
Read with sys.fn_xe_file_target_read_file or DMV sys.dm_xe_session_targets
Common ordering trap

Duration in XEvents is microseconds, not milliseconds. Filtering WHERE duration > 5000 captures everything over 5 ms (way too noisy). Use 5000000 for 5 seconds. Also: on Azure SQL DB the only file target is Azure Blob storage β€” a SAS-secured container, not a local path.

Database Watcher (Preview)​

Near real-time monitoring dashboard that doesn't require manual setup of diagnostic pipelines.

FeatureDetails
Data storeAzure Data Explorer (Kusto)
Query languageKQL
LatencyNear real-time (~seconds)
SupportedAzure SQL DB, MI
DashboardsPre-built Grafana dashboards
Key metricsSessions, waits, query stats, resource usage, blocking
🎯 Exam Focus

Database Watcher is Microsoft's newest monitoring solution β€” it streams data to Azure Data Explorer for KQL analysis. If the exam mentions "near real-time monitoring with KQL dashboards" β†’ Database Watcher.

SQL Insights (Azure Monitor)​

Deeper telemetry than basic Azure Monitor metrics. Uses a monitoring VM to collect DMV data.

πŸ—„οΈ
Azure SQL DB/MI/VM
Source database
DMV queries collected
πŸ“‘
Monitoring VM (Telegraf)
Collects DMV data
Pushes to Log Analytics
πŸ“Š
Log Analytics
Pre-built Workbooks
Top queries, waits, sessions
Azure Monitor Alerts

Resource Governor (VM & MI)​

Isolates workloads by limiting CPU, memory, and I/O per workload group.

Resource Governor Architecture
🏊
Resource Pool: OLTP
CPU: 60%, Memory: 60% β€” Production workload
πŸ“Š
Resource Pool: Reporting
CPU: 30%, Memory: 30% β€” BI queries
βš™οΈ
Resource Pool: Default
CPU: 10%, Memory: 10% β€” Admin tasks

Architecture components:

  1. Resource Pool β€” defines CPU/memory/I/O limits
  2. Workload Group β€” maps to a resource pool with additional settings (request limits)
  3. Classifier Function β€” routes incoming sessions to workload groups based on rules

Resource Governor enable order​

Configure Resource Governor
1
CREATE RESOURCE POOL
MIN/MAX_CPU_PERCENT, memory, IO limits
One pool per workload tier
2
CREATE WORKLOAD GROUP
USING <pool>
Set MAX_DOP, request_max_memory_grant, etc.
3
CREATE classifier function
In master DB, schema-bound
Returns workload group name based on APP_NAME, login, etc.
4
Bind classifier
ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.fn_classifier)
5
RECONFIGURE
ALTER RESOURCE GOVERNOR RECONFIGURE
Without this, nothing takes effect
Common ordering trap

ALTER RESOURCE GOVERNOR RECONFIGURE is the step everyone forgets. Pools, groups, and classifier all exist but no traffic is being classified until you reconfigure. Verify with sys.dm_resource_governor_workload_groups after running a workload.

🎯 Exam Focus

Resource Governor is available on SQL Server on VM and MI (not Azure SQL DB). If the exam says "prevent reporting queries from consuming all CPU" β†’ Resource Governor. For Azure SQL DB, use read replicas (BC tier) or Named Replicas (Hyperscale) to isolate workloads instead.

🏒 Real-World DBA Note

Oracle DBA parallel: Resource Governor = Oracle Resource Manager. Resource Pools = Consumer Groups. Classifier Function = Resource Manager plan rules. Same concept: prevent one workload from starving another.

Intelligent Query Processing (IQP)​

A family of automatic query optimization features. No code changes needed β€” just upgrade compatibility level.

IQP Features (Compatibility Level 150+)
🧠
Memory Grant Feedback
Auto-adjust memory grants over multiple executions
πŸ”€
Adaptive Joins
Hash or Nested Loop decided at runtime based on actual rows
⚑
Batch Mode on Rowstore
Analytics speed on OLTP tables without columnstore
πŸ“
Approximate Count
APPROX_COUNT_DISTINCT β€” fast, ~2% error
πŸ“‹
Table Variable Deferred
Accurate cardinality for table variables
πŸ“Œ
Optimized Plan Forcing
Faster forced plan compilation via Query Store
🎯
Parameter Sensitive Plan
Multiple plans for parameter sniffing (Compat 160)
πŸ”„
DOP Feedback
Auto-adjust parallelism degree (Compat 160)
πŸ“ˆ
CE Feedback
Auto-adjust cardinality estimates (Compat 160)

Key exam-tested IQP features:

FeatureProblem It SolvesCompat Level
Memory Grant FeedbackQueries request too much/too little memory (spills to disk)150
Adaptive JoinsOptimizer picks wrong join type at compile time150
Batch Mode on RowstoreAnalytics queries slow on OLTP tables150
Parameter Sensitive PlanParameter sniffing β†’ bad plan for some values160
DOP FeedbackOver/under-parallelism wastes resources160
🎯 Exam Focus

IQP requires the right compatibility level (150 = SQL 2019, 160 = SQL 2022). If the exam asks "how to enable adaptive joins" β†’ set compatibility level to 150+. No query changes needed β€” it's automatic.

Statistics Maintenance​

Statistics tell the query optimizer about data distribution. Stale statistics β†’ bad plans β†’ poor performance.

SettingBehaviorBest For
AUTO_UPDATE_STATISTICSUpdates when ~20% of rows changeMost workloads (default: ON)
AUTO_UPDATE_STATISTICS_ASYNCUpdates in background (doesn't block queries)OLTP (avoids query compilation delays)
Manual UPDATE STATISTICSOn-demand, full scan or sample %After bulk loads, partition switches
🎯 Exam Focus

After large data loads: always run UPDATE STATISTICS manually. Auto-update won't trigger until the 20% threshold. For partitioned tables, update stats on specific partitions with UPDATE STATISTICS ... WITH RESAMPLE ON PARTITIONS(N).

Execution Plans β€” What to Look For​

Red Flag in PlanWhat It MeansFix
Table/Index Scan (on large table)Missing index or unSARGable WHERECreate covering index
Key Lookup (high cost)Index doesn't include needed columnsAdd INCLUDE columns
Sort (with warning ⚠️)Memory spill to tempdbAdd index matching ORDER BY, or increase memory
Parallelism (CXPACKET waits)Skewed distribution across threadsUpdate stats, limit MAXDOP
Thick arrows (many rows)Cardinality estimate wrongUpdate stats, rewrite query
Nested Loops (on large sets)Should be Hash MatchUpdate stats, hint if needed
βœ…
Good Plan
Index Seek β†’ Nested Loop (small inner) β†’ SELECT
Fast, minimal I/O, targeted reads
❌
Bad Plan
Table Scan β†’ Hash Match (high memory) β†’ Sort ⚠️ (spill) β†’ SELECT
Slow, excessive I/O, memory spills
🏒 Real-World DBA Note

Oracle DBA parallel: Execution plans in SQL Server are like Oracle explain plans. Key difference: SQL Server shows actual rows vs estimated in actual plans (like Oracle's V$SQL_PLAN_STATISTICS_ALL). Look for big gaps between estimated and actual β€” that means stale stats.


Anti-Patterns​

  • "Use SQL Profiler in production." Profiler is heavy (per-event handshake) and deprecated. Extended Events is the supported, low-overhead path. The exam treats this as a wrong answer.
  • "Capture every event in an XEvent session." XE is lightweight only when targeted. A session capturing sql_statement_completed for all DBs writes GBs/hour and degrades the very thing you're measuring. Filter with predicates (DB, login, duration > N).
  • "system_health is enough β€” we don't need a custom session." system_health covers errors, deadlocks, memory issues but not statement-level capture. Add a custom XE session for query tuning workloads.
  • "Read execution plans from SSMS only." Plans live in Query Store for Azure SQL DB/MI β€” query sys.query_store_plan to get the plan history without re-running.
  • "Estimated plan = actual plan." They differ when stats are stale. The Estimated plan uses optimizer guesses; Actual fills in measured rows. Big gap = stats / parameter sniffing problem.
  • "Save XE captures to ring buffer in prod." Ring buffer is in-memory and per-instance β€” lost on restart, capped in size. Persist to Event File target for production tuning runs.
⚠️ Watch Out

XE Event File on Azure SQL DB writes to Azure Storage β€” not local disk. Configure a SAS URL pointing to a storage account in the same region as the DB. Cross-region writes silently fail.


Migration Between Monitoring Approaches​

From β†’ ToPathCost
SQL Profiler β†’ Extended EventsConvert via Profiler "Export" β†’ "XE Session"Lower overhead; modern tooling
Manual DMV polling β†’ Query StoreEnable QS at DB levelFree, automatic plan + runtime capture
QS only β†’ QS + Database WatcherAdd Watcher; keep QS as sourceWatcher gives cross-server view; ADX cost
system_health β†’ custom XE sessionCreate session, target Event File on Azure StorageTargeted capture; storage cost
Local XE files β†’ Storage + Log AnalyticsPipeline XE β†’ Storage β†’ LA via Logic AppKQL across XE data; LA ingest cost
Plan capture via SSMS only β†’ Plan Cache + QSDMV-based plan history + QS forced plansSurvives plan eviction; QS forces plans across recompiles
Workload trace via Profiler β†’ Distributed ReplayXE capture + replay on test boxReproduces concurrency; Replay infra to set up

Most expensive moves: piping XE β†’ Storage β†’ LA (custom pipeline) and standing up Distributed Replay infrastructure.


Real Scenarios​

  1. Performance regression after a release β†’ Query Store β€˜Top Resource Consumers’ + force previous plan. Driver: roll back the plan without code change. Trade-off: forced plan has cost overhead vs natural recompile.
  2. Intermittent deadlocks in prod β†’ system_health deadlock graph + custom XE session for lock_acquired filtered to the affected DB. Driver: cheap detection + targeted deep-dive. Trade-off: detailed XE has measurable overhead β€” enable for a tuning window only.
  3. ISV asks for repro on customer issue β†’ XE capture on customer DB for 1 hour + Distributed Replay against test box. Driver: real concurrency. Trade-off: Replay infra setup cost.
  4. Estate-wide query baseline β†’ Database Watcher + ADX Workbook trending p95 / p99 per DB. Driver: cross-server view. Trade-off: ADX cost + initial setup.
  5. Dev wants to understand a slow query β†’ SET STATISTICS IO, TIME ON + Actual Execution Plan in SSMS + Query Store wait stats. Driver: free, in-the-moment. Trade-off: dev DB only β€” don't run on prod ad-hoc.

Flashcards​

What are Extended Events and why use them?
Click to reveal answer
Lightweight event tracing system replacing SQL Profiler. Uses Events + Predicates (filters) + Actions (extra data) + Targets (storage). Lower overhead than Profiler.
1 / 8

Quiz​

Q1/4
0 correct
You need to isolate reporting queries so they don't consume more than 30% CPU on a SQL Managed Instance. What feature do you use?