SQL Server expert troubleshooting guide

This page is a field guide for engineers who troubleshoot SQL Server under pressure. It goes below the surface: execution plans, optimizer behavior, SQLOS, DMVs, blocking chains, TempDB allocation, and the scripts you can run in a lab before you need them in production.

Use the scripts in a non-production database such as SqlTroubleshootingLab. Several examples intentionally create slow queries, blocking, or TempDB pressure.

🧠 Architecture map

This guide follows the same path a request follows inside the engine: client call, compilation, execution, waits, storage, TempDB, and diagnostic evidence. Use the diagram as a triage map during incidents rather than treating each topic as isolated theory.

Under Sev-A pressure, keep the mental flow simple:

Client -> compile: confirm whether the workload is sending stable, parameterized calls or ad-hoc text with mismatched data types.
Compile -> execute: check whether the selected plan matches the data shape, parameter value, statistics, and available indexes.
Execute -> waits: classify whether the request is burning CPU or waiting on locks, I/O, memory grants, workers, TempDB, or the client.
Waits -> resources: map the wait to the resource before changing anything. LCK_M_* means concurrency, PAGEIOLATCH_* means physical reads, RESOURCE_SEMAPHORE means memory grant pressure, and PAGELATCH_* on TempDB allocation pages means latch contention.
Resources -> diagnostics: capture DMV and Query Store evidence before cache clears, restarts, kills, or configuration changes remove the trail.

Component	When things go wrong	Check first	Common root causes	Quick remediation
Parser / algebrizer	Compile errors, sudden failures after deployment, permissions errors	Error text, deployed SQL, object names, data types	Renamed objects, invalid schema binding, implicit conversion from client parameters	Fix code or parameter type; avoid emergency server changes
Optimizer	High CPU, scans, wrong join type, unstable duration	Actual plan, Query Store, statistics date, estimated vs actual rows	Parameter sniffing, stale stats, skewed data, non-SARGable predicates	Force last good plan if needed; update stats or rewrite predicate after evidence
Execution plan	Memory spills, excessive DOP, repeated lookups	Plan warnings, memory grant, DOP, key lookup count	Bad cardinality, missing covering index, row goal, implicit conversion	Add targeted index, fix estimate, tune query-level DOP or recompile strategy
SQLOS	High signal waits, runnable queue, THREADPOOL	`sys.dm_os_schedulers`, `sys.dm_os_wait_stats`, active workers	CPU saturation, worker starvation, connection storm, blocking fan-out	Stop runaway work, reduce concurrency, fix pooling, scale only as mitigation
Buffer pool / storage	`PAGEIOLATCH_*`, high physical reads, file latency	`sys.dm_io_virtual_file_stats`, logical reads, plan access methods	Scans, cold cache, undersized memory, slow data files	Reduce reads first; then tune storage or memory
Lock manager	Timeouts, blocked sessions, deadlocks	`sys.dm_exec_requests`, `sys.dm_tran_locks`, deadlock graph	Long transactions, missing indexes, lock escalation, isolation mismatch	Find lead blocker; shorten transactions; batch writes; consider RCSI after testing
TempDB	Spills, TempDB full, `PAGELATCH_*`, version store growth	`sys.dm_db_file_space_usage`, session/task space usage, snapshot transactions	Sort/hash spills, long RCSI readers, too few or uneven files	Tune spilling query; end long reader; pre-size equal files; fix allocation contention
DMVs / Query Store / XEvents	Missing history, unclear root cause, competing narratives	Live DMVs for now, Query Store for history, XEvents for deadlocks	Evidence lost after restart, plan eviction, incomplete snapshot	Capture first, then act; persist snapshots for intermittent incidents

⚠️ Watch Out

During a live incident, do not start with DBCC FREEPROCCACHE, SQL Server restart, blanket index creation, or NOLOCK changes. Those actions can remove evidence, cause wider regressions, or trade correctness for speed. Capture evidence first, restore service second, and perform root-cause tuning after the blast radius is contained.

Incident triage path

Start with the symptom

Capture the production symptom

Record the impacted app, database, and time window

Decide current vs historical

Current: use active DMVs, waits, blockers

Historical: use Query Store and wait trends

Classify the dominant signal

CPU, I/O, locks, or memory

Avoid tuning before the bottleneck is named

Follow the matching evidence

Plans and estimates for CPU

Reads and latency for I/O

Blocking chains for locks

Grants and spills for memory

Fix root cause, then validate

Apply the smallest safe mitigation

Recheck waits, duration, reads, and plan shape

🔀 Oracle → Azure

Oracle parallel: Treat SQL Server Query Store plus DMVs as the closest day-to-day equivalent of AWR, ASH, V$SQL, and V$SESSION_WAIT. The mapping works for workload history and live diagnosis, but SQL Server exposes different wait names, plan operators, and locking semantics.

🧠 Execution plans

Use this section to read plans as troubleshooting evidence: estimates, operators, warnings, memory grants, and plan regressions.

1. Concept (simple explanation)

An execution plan is SQL Server's recipe for running a query. It shows which tables SQL Server reads, which indexes it uses, how rows are joined, where data is sorted, and where memory is needed.

Read plans as evidence, not artwork. The important question is not "does the plan look complex?" The important question is: "Where did SQL Server estimate wrong, read too much, spill to disk, or choose an operator that does not match the data?"

Common operators:

Operator	Simple meaning	Performance signal
Index seek	Navigate directly to a range in a B-tree	Usually good, but can still read many rows
Index/table scan	Read a full structure	Fine for large ranges; bad for selective predicates
Key lookup	Fetch missing columns from clustered index	Expensive when repeated many times
Hash match	Build/probe hash table for joins or aggregates	Good for large sets; can spill if memory grant is low
Nested loops	For each outer row, search inner input	Good for small outer input; terrible when estimates are low
Sort	Order rows	Needs memory; spills to TempDB if grant is too small

2. How it works internally

SQL Server turns a query into a plan through four main stages:

Compilation to execution flow

Parse SQL text

Check syntax

Tokenize T-SQL

Build the initial parse tree

Bind names and types

Resolve objects and columns

Check permissions

Validate data types

Estimate cardinality

Read statistics histograms

Estimate row counts

Model data distribution

Search plan alternatives

Compare join orders

Choose access methods

Evaluate serial vs parallel

Pick and cache the plan

Select lowest estimated cost

Create memory grant

Reuse from cache when safe

Execute with workers

Run iterator operators

Request rows from child operators

Track waits, CPU, reads, and spills

Internally:

The parser checks syntax and builds a tree.
The algebrizer resolves tables, columns, data types, ownership chains, and permissions.
The optimizer uses statistics histograms, density vectors, constraints, indexes, and row goals to estimate row counts.
The optimizer evaluates physical choices: join order, join type, access method, serial vs parallel plan, and memory grants.
The execution engine runs the selected plan using the iterator model. Each operator asks its child operator for rows.
The plan may be cached so the next execution avoids compile CPU.

The optimizer is cost-based, but the cost is only as good as the estimates. Bad statistics, implicit conversions, parameter sniffing, table variables, skewed data, and non-SARGable predicates can all push the optimizer toward the wrong plan.

When things go wrong, the usual chain is: bad cardinality -> bad join/access choice -> wrong memory grant or DOP -> high CPU, spills, or excessive reads. For example, a predicate estimated at 10 rows but returning 10 million rows can choose nested loops plus key lookups, burn CPU, and overload storage even though every individual operator looks valid.

Prioritize plan investigation this way:

First: compare estimated rows vs actual rows at the first large divergence.
Next: inspect warnings for spills, implicit conversions, missing join predicates, and excessive grants.
Then: decide whether the problem is compile-time metadata, runtime parameter sensitivity, or missing physical access path.
Later: consider hints. Hints can stabilize service, but they also freeze assumptions.

3. Real-world problem scenario

Contoso has an order lookup procedure that usually returns one customer. After a weekend data load, the API latency jumps from 80 ms to 8 seconds. CPU rises, reads spike, and the plan shows an index scan with a hash join. The root cause is an implicit conversion: the application sends nvarchar while the indexed column is varchar, so SQL Server converts the column and cannot seek efficiently.

4. How to diagnose

Capture the actual execution plan, not only the estimated plan.
Compare estimated vs actual rows on each operator.
Look for warnings: spills, missing indexes, implicit conversions, residual predicates, and excessive memory grants.
Check whether the plan is new by using Query Store.
Check whether the query is parameter-sensitive: fast for one parameter, slow for another.

Use Query Store early when the question is "what changed?". It tells you whether the query regressed because the plan changed, the same plan slowed down, or the runtime volume changed. Use live DMVs when the question is "what is hurting us right now?".

Do not assume a scan is bad or a seek is good. A seek with millions of key lookups can be worse than a controlled scan. The incident signal is total work: CPU, reads, elapsed time, waits, and memory spills.

Diagnostic query for cached plans with implicit conversions:

🔍 Diagnostic query

SELECT TOP (25)
    DB_NAME(st.dbid) AS database_name,
    qs.execution_count,
    qs.total_worker_time / 1000 AS total_cpu_ms,
    qs.total_logical_reads,
    SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
        ((CASE qs.statement_end_offset
            WHEN -1 THEN DATALENGTH(st.text)
            ELSE qs.statement_end_offset
          END - qs.statement_start_offset) / 2) + 1) AS statement_text,
    qp.query_plan
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) AS qp
WHERE CAST(qp.query_plan AS nvarchar(max)) LIKE N'%CONVERT_IMPLICIT%'
ORDER BY qs.total_worker_time DESC;

Query Store plan regression check:

🔍 Diagnostic query

SELECT TOP (20)
    qsq.query_id,
    qsp.plan_id,
    rs.count_executions,
    rs.avg_duration / 1000.0 AS avg_duration_ms,
    rs.avg_cpu_time / 1000.0 AS avg_cpu_ms,
    rs.avg_logical_io_reads,
    qt.query_sql_text
FROM sys.query_store_runtime_stats AS rs
JOIN sys.query_store_plan AS qsp
    ON rs.plan_id = qsp.plan_id
JOIN sys.query_store_query AS qsq
    ON qsp.query_id = qsq.query_id
JOIN sys.query_store_query_text AS qt
    ON qsq.query_text_id = qt.query_text_id
ORDER BY rs.avg_duration DESC;

5. Fix / optimisation

If estimates are wrong, update statistics or improve predicate shape.
If a key lookup runs thousands of times, add included columns or rewrite the query.
If an implicit conversion blocks seeks, align parameter and column data types.
If parameter sniffing causes plan instability, use Query Store plan forcing, OPTION (RECOMPILE), OPTIMIZE FOR, or SQL Server 2022 Parameter Sensitive Plan optimization.
If a sort or hash spills, reduce rows, add an index that provides order, or investigate memory grants.

The fix works because it gives the optimizer better metadata or a cheaper physical access path.

6. Example scripts

Reproduce a non-SARGable predicate:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.PlanLabOrders;
GO

CREATE TABLE dbo.PlanLabOrders
(
    OrderId int IDENTITY(1,1) NOT NULL CONSTRAINT PK_PlanLabOrders PRIMARY KEY,
    CustomerCode varchar(20) NOT NULL,
    OrderDate date NOT NULL,
    Amount money NOT NULL
);

CREATE INDEX IX_PlanLabOrders_OrderDate ON dbo.PlanLabOrders(OrderDate);

WITH n AS
(
    SELECT TOP (200000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
    FROM sys.all_objects AS a
    CROSS JOIN sys.all_objects AS b
)
INSERT dbo.PlanLabOrders(CustomerCode, OrderDate, Amount)
SELECT
    CONCAT('C', RIGHT(CONCAT('00000', rn % 5000), 5)),
    DATEADD(day, -(rn % 1500), CONVERT(date, SYSDATETIME())),
    (rn % 1000) + 1
FROM n;
GO

SET STATISTICS IO, TIME ON;

-- Bad: function on indexed column prevents a direct range seek.
SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE YEAR(OrderDate) = YEAR(SYSDATETIME());

-- Better: range predicate is SARGable.
DECLARE @start date = DATEFROMPARTS(YEAR(SYSDATETIME()), 1, 1);
DECLARE @end date = DATEADD(year, 1, @start);

SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE OrderDate >= @start
  AND OrderDate < @end;

Diagnose row estimate error in the actual plan:

🔍 Diagnostic query

SET STATISTICS XML ON;
GO
SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE YEAR(OrderDate) = YEAR(SYSDATETIME());
GO
SET STATISTICS XML OFF;

Fix with a computed column when code cannot change immediately:

🚀 Optimization query

ALTER TABLE dbo.PlanLabOrders
ADD OrderYear AS YEAR(OrderDate) PERSISTED;
GO

CREATE INDEX IX_PlanLabOrders_OrderYear ON dbo.PlanLabOrders(OrderYear);
GO

SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE OrderYear = YEAR(SYSDATETIME());

7. Key insights (expert notes)

Estimated cost is not elapsed time. It is an optimizer model.
A seek is not always good. A seek that reads 80 percent of a table can be worse than a scan.
The most useful plan comparison is estimated rows vs actual rows.
Missing index hints are per-query suggestions. Consolidate them before creating indexes.
Warnings on operators often explain the incident faster than the visible operator names.

🧠 Performance tuning

Use this section to move from symptom to bottleneck before choosing an index, query rewrite, isolation change, or temporary mitigation.

1. Concept (simple explanation)

Performance tuning is the discipline of reducing wasted work. SQL Server runs faster when it reads fewer pages, compiles less often, waits less, uses memory correctly, and avoids unnecessary contention.

Tuning is not only indexing. It includes query shape, schema design, statistics, isolation levels, TempDB, plan stability, memory grants, parallelism, and application call patterns.

2. How it works internally

Most performance incidents land in one of four buckets:

Performance bottleneck map

CPU

CPU pressure

Bad plans and excessive parallelism

High signal waits or runnable tasks

Check operators, estimates, DOP, and Query Store

I/O

I/O pressure

Scans, missing indexes, or cold cache

High logical reads or file latency

Check access methods and file stats

LCK

Concurrency waits

Long transactions or isolation mismatch

Blocking chains, deadlocks, or lock escalation

Find the head blocker first

MEM

Memory pressure

Sort/hash spills or plan cache pressure

Memory grants, TempDB usage, RESOURCE_SEMAPHORE

Fix estimates before adding capacity

Inside SQL Server, a request uses a worker on a scheduler. When it cannot continue, it records a wait. Good tuning starts by identifying where time is going:

CPU time means the worker is executing.
Wait time means the worker is stalled on something.
Logical reads show buffer pool work.
Physical reads show storage access.
Writes and log flushes show transaction log pressure.
Spills show TempDB used because memory was insufficient.

Production tuning is triage before surgery. During a Sev-A, separate restore service actions from permanent fix actions:

Restore service: stop or throttle the runaway query, force a known-good plan, reduce concurrency, add temporary capacity, or kill a confirmed blocker after rollback risk is understood.
Permanent fix: correct cardinality, index design, query shape, transaction scope, application call pattern, or isolation model.

Common failure patterns:

Parameter sniffing: one cached plan is good for small customers and disastrous for large customers.
Memory grant pressure: large sorts/hashes wait on RESOURCE_SEMAPHORE or spill because estimates are wrong.
Parallelism skew: high DOP uses many workers but one branch does most of the work; watch CXPACKET, CXCONSUMER, and actual row distribution.
Lock escalation: many row/key locks become a table lock, expanding the blocking impact.
TempDB contention: the real query issue is a spill, but the visible symptom is TempDB I/O or latch pressure.

3. Real-world problem scenario

Fabrikam runs a reporting query every 15 minutes. It used to finish in 30 seconds. It now runs for 12 minutes and blocks OLTP writes. The query joins five tables, scans a 90 GB order table, spills a hash aggregate to TempDB, and runs with degree of parallelism 8. The issue is a combination of stale statistics, a missing covering index, and the report running under read committed against hot transactional rows.

4. How to diagnose

Confirm whether the problem is current or historical.
For current issues, check active requests, waits, blockers, CPU, reads, writes, and grants.
For historical issues, use Query Store and wait stats.
Compare before/after plans if the issue is a regression.
Validate whether scaling masks the problem or fixes the real bottleneck.

Check in this order when the bridge is crowded:

Is SQL Server actually the bottleneck, or is the app waiting on network/client processing?
Which sessions are consuming CPU, reads, writes, memory grants, or blocking others right now?
Is there one dominant query/session, or is the whole workload degraded?
Did the plan or runtime profile change compared with the last known-good window?
What is the safest mitigation that restores service without destroying evidence?

Current expensive requests:

🔍 Diagnostic query

SELECT
    r.session_id,
    r.status,
    r.command,
    r.cpu_time AS cpu_ms,
    r.total_elapsed_time AS elapsed_ms,
    r.logical_reads,
    r.reads,
    r.writes,
    r.wait_type,
    r.wait_time AS wait_ms,
    r.blocking_session_id,
    DB_NAME(r.database_id) AS database_name,
    SUBSTRING(t.text, (r.statement_start_offset / 2) + 1,
        ((CASE r.statement_end_offset
            WHEN -1 THEN DATALENGTH(t.text)
            ELSE r.statement_end_offset
          END - r.statement_start_offset) / 2) + 1) AS running_statement
FROM sys.dm_exec_requests AS r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID
ORDER BY r.cpu_time DESC;

🔍 Diagnostic query

SELECT TOP (25)
    qs.execution_count,
    qs.total_logical_reads,
    qs.total_logical_reads / NULLIF(qs.execution_count, 0) AS avg_logical_reads,
    qs.total_worker_time / 1000 AS total_cpu_ms,
    qs.total_elapsed_time / 1000 AS total_duration_ms,
    DB_NAME(st.dbid) AS database_name,
    SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
        ((CASE qs.statement_end_offset
            WHEN -1 THEN DATALENGTH(st.text)
            ELSE qs.statement_end_offset
          END - qs.statement_start_offset) / 2) + 1) AS statement_text
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY qs.total_logical_reads DESC;

Wait profile since last restart:

🔍 Diagnostic query

SELECT TOP (30)
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    signal_wait_time_ms,
    wait_time_ms - signal_wait_time_ms AS resource_wait_ms,
    100.0 * wait_time_ms / SUM(wait_time_ms) OVER () AS pct_of_waits
FROM sys.dm_os_wait_stats
WHERE wait_type NOT LIKE 'SLEEP%'
  AND wait_type NOT IN
  (
    'BROKER_TASK_STOP', 'BROKER_TO_FLUSH', 'BROKER_EVENTHANDLER',
    'SQLTRACE_BUFFER_FLUSH', 'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT',
    'LAZYWRITER_SLEEP', 'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH',
    'XE_TIMER_EVENT', 'XE_DISPATCHER_WAIT', 'LOGMGR_QUEUE'
  )
ORDER BY wait_time_ms DESC;

5. Fix / optimisation

High logical reads: add or redesign indexes, rewrite predicates, reduce selected columns.
High CPU: fix bad plans, scalar UDFs, excessive parallelism, implicit conversions, and row-by-row logic.
I/O waits: reduce reads first, then check storage latency and file layout.
Blocking: shorten transactions, add useful indexes, change isolation to row versioning when appropriate.
Spills: improve cardinality, indexes, memory grant feedback, or query shape.
Plan regression: use Query Store to force a known-good plan while fixing root cause.

Do not tune from a single metric. High CPU with low reads points differently than high CPU with massive reads. WRITELOG points to commit/log throughput. ASYNC_NETWORK_IO often means the database is waiting for the client to consume rows, not that SQL Server needs more CPU.

6. Example scripts

Reproduce a key lookup problem:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.TuningLabCustomerOrders;
GO

CREATE TABLE dbo.TuningLabCustomerOrders
(
    OrderId int IDENTITY(1,1) NOT NULL CONSTRAINT PK_TuningLabCustomerOrders PRIMARY KEY,
    CustomerId int NOT NULL,
    OrderDate datetime2(0) NOT NULL,
    StatusCode char(1) NOT NULL,
    Amount money NOT NULL,
    Notes varchar(200) NULL
);

CREATE INDEX IX_TuningLabCustomerOrders_CustomerId
ON dbo.TuningLabCustomerOrders(CustomerId);

WITH n AS
(
    SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
    FROM sys.all_objects AS a
    CROSS JOIN sys.all_objects AS b
)
INSERT dbo.TuningLabCustomerOrders(CustomerId, OrderDate, StatusCode, Amount, Notes)
SELECT rn % 1000,
       DATEADD(minute, -rn, SYSDATETIME()),
       CASE WHEN rn % 10 = 0 THEN 'C' ELSE 'O' END,
       rn % 2500,
       REPLICATE('x', 50)
FROM n;
GO

SET STATISTICS IO, TIME ON;

SELECT OrderId, CustomerId, OrderDate, StatusCode, Amount
FROM dbo.TuningLabCustomerOrders
WHERE CustomerId = 42;

Diagnose missing coverage:

🔍 Diagnostic query

SELECT
    migs.user_seeks,
    migs.avg_total_user_cost,
    migs.avg_user_impact,
    mid.statement AS table_name,
    mid.equality_columns,
    mid.inequality_columns,
    mid.included_columns
FROM sys.dm_db_missing_index_group_stats AS migs
JOIN sys.dm_db_missing_index_groups AS mig
    ON migs.group_handle = mig.index_group_handle
JOIN sys.dm_db_missing_index_details AS mid
    ON mig.index_handle = mid.index_handle
WHERE mid.database_id = DB_ID()
ORDER BY migs.avg_user_impact DESC;

Fix with a covering index:

🚀 Optimization query

CREATE INDEX IX_TuningLabCustomerOrders_CustomerId_Covering
ON dbo.TuningLabCustomerOrders(CustomerId)
INCLUDE (OrderDate, StatusCode, Amount);
GO

SELECT OrderId, CustomerId, OrderDate, StatusCode, Amount
FROM dbo.TuningLabCustomerOrders
WHERE CustomerId = 42;

7. Key insights (expert notes)

Tune by bottleneck, not by habit.
SELECT * is a performance design decision. It widens indexes, memory grants, network usage, and lookups.
Query Store is often safer than clearing plan cache during an incident.
Avoid changing five things at once. You will not know which fix worked.
Scaling up can be a valid mitigation, but it should not be the only diagnosis.

🧠 SQL Server internals

Use this section to connect waits and resource symptoms back to SQLOS, memory, storage, logging, locking, and TempDB internals.

1. Concept (simple explanation)

SQL Server internals explain what the engine is doing when your query runs: scheduling workers, managing memory, reading 8 KB pages, writing log records, taking locks, creating row versions, and using TempDB for temporary work.

When you understand internals, wait stats become meaningful. PAGEIOLATCH_SH is not just a code. It means a worker is waiting for a data page to be read from storage into the buffer pool.

2. How it works internally

Core pieces:

Component	What it does	What you monitor
SQLOS	Scheduling, workers, memory, waits	`sys.dm_os_schedulers`, wait stats
Buffer pool	Caches 8 KB data pages	PLE, physical reads, buffer descriptors
Plan cache	Stores compiled plans	cache size, single-use plans, compile CPU
Transaction log	Guarantees durability with write-ahead logging	log flush waits, VLFs, log reuse waits
Lock manager	Coordinates concurrent access	locks, blocking chains, deadlocks
TempDB	Shared workspace for temp objects and spills	file contention, version store, internal objects

Incident lens for these components:

SQLOS matters because every running request needs workers and schedulers. If runnable tasks pile up, adding indexes will not immediately clear CPU starvation.
Buffer pool matters because logical reads are memory work and physical reads are storage work. A plan that reads 50 million pages can saturate either path.
Plan cache matters because the wrong reused plan can make a good procedure fail intermittently.
Transaction log matters because commits wait for log hardening. Slow log writes become application latency.
Lock manager matters because one long transaction can make a healthy server look unavailable.
TempDB matters because spills, versions, and temporary objects share a finite workspace across the whole instance.

Data access path:

Page read and log-write path

Worker requests a page

The query operator asks for an 8 KB data page

The worker checks the buffer pool first

Serve a logical read

If the page is cached, SQL Server returns it from memory

This is still counted as a logical read

Issue a physical read

If the page is missing, SQL Server reads it from the data file

The visible wait is often PAGEIOLATCH_*

Cache and return the page

The page enters the buffer pool

Future requests can avoid physical I/O

Harden log records first

For changes, the log record is written before the dirty page flushes

Slow log writes show up as WRITELOG latency

Write-ahead logging means the log record must be hardened before the changed data page is written to the data file. That is why slow log storage can hurt write-heavy systems even when data files are fast.

3. Real-world problem scenario

Tailwind Traders sees periodic stalls during checkout. CPU is low, but application threads wait. SQL Server wait stats show WRITELOG, and the transaction log file has thousands of small virtual log files after years of tiny autogrowth increments. The actual issue is log throughput and poor log file growth history, not missing indexes.

4. How to diagnose

Check wait stats first to identify the broad subsystem.
Check scheduler pressure for runnable queue buildup.
Check memory clerks for unusual memory consumers.
Check database file latency.
Check log reuse waits and VLF count for transaction log health.

When the system is unstable, take a quick baseline snapshot before changing anything: active requests, wait stats delta, file latency, scheduler runnable queue, memory grants, blockers, and top Query Store regressions. A five-minute evidence window is often enough to prevent a wrong fix.

Scheduler health:

🔍 Diagnostic query

SELECT
    scheduler_id,
    cpu_id,
    status,
    is_online,
    current_tasks_count,
    runnable_tasks_count,
    current_workers_count,
    active_workers_count,
    work_queue_count
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255
ORDER BY scheduler_id;

File latency:

🔍 Diagnostic query

SELECT
    DB_NAME(vfs.database_id) AS database_name,
    mf.type_desc,
    mf.physical_name,
    vfs.num_of_reads,
    vfs.io_stall_read_ms,
    vfs.io_stall_read_ms / NULLIF(vfs.num_of_reads, 0) AS avg_read_ms,
    vfs.num_of_writes,
    vfs.io_stall_write_ms,
    vfs.io_stall_write_ms / NULLIF(vfs.num_of_writes, 0) AS avg_write_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
JOIN sys.master_files AS mf
    ON vfs.database_id = mf.database_id
   AND vfs.file_id = mf.file_id
ORDER BY avg_write_ms DESC;

Memory clerk overview:

🔍 Diagnostic query

SELECT TOP (25)
    type,
    pages_kb / 1024 AS pages_mb,
    virtual_memory_committed_kb / 1024 AS virtual_committed_mb
FROM sys.dm_os_memory_clerks
ORDER BY pages_kb DESC;

5. Fix / optimisation

For WRITELOG, reduce transaction size, improve log storage latency, avoid row-by-row commits, and pre-size log files.
For SOS_SCHEDULER_YIELD, find CPU-heavy queries and excessive parallelism.
For PAGEIOLATCH_*, reduce reads first; storage tuning comes after query/index tuning.
For plan cache bloat, parameterize safely and avoid ad-hoc literal storms.
For memory pressure, set max server memory correctly and find large grants or clerks.

Avoid broad instance-level changes unless the evidence points there. Changing max server memory, MAXDOP, cost threshold, or clearing cache during a live incident can affect every workload. Prefer query-level mitigation when one query or procedure is the cause.

6. Example scripts

Reproduce plan cache bloat:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.CacheLab;
CREATE TABLE dbo.CacheLab (Id int NOT NULL PRIMARY KEY, ValueText varchar(50) NOT NULL);

INSERT dbo.CacheLab(Id, ValueText)
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 'cache-test'
FROM sys.all_objects;
GO

DECLARE @i int = 1;
DECLARE @sql nvarchar(max);

WHILE @i <= 200
BEGIN
    SET @sql = N'SELECT ValueText FROM dbo.CacheLab WHERE Id = ' + CONVERT(nvarchar(20), @i) + N';';
    EXEC sys.sp_executesql @sql;
    SET @i += 1;
END;

Diagnose single-use ad-hoc plans:

🔍 Diagnostic query

SELECT TOP (25)
    cp.objtype,
    cp.usecounts,
    cp.size_in_bytes / 1024 AS size_kb,
    st.text
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) AS st
WHERE cp.objtype = 'Adhoc'
  AND cp.usecounts = 1
ORDER BY cp.size_in_bytes DESC;

Fix with parameterization in application code or sp_executesql:

🚀 Optimization query

DECLARE @id int = 42;

EXEC sys.sp_executesql
    N'SELECT ValueText FROM dbo.CacheLab WHERE Id = @Id;',
    N'@Id int',
    @Id = @id;

7. Key insights (expert notes)

SQL Server waits are a time accounting system. They are not all problems.
Signal wait time points to CPU scheduling pressure. Resource wait time points to the thing being waited on.
The buffer pool caches pages, not rows.
The transaction log is sequential, but it is often the critical path for writes.
Clearing cache may hide the evidence you need. Capture first, clear only with intent.

🧠 Dynamic management views (DMVs)

Use this section to capture live SQL Server evidence before remediation changes remove the incident trail.

1. Concept (simple explanation)

DMVs are SQL Server's live inspection windows. They expose what is running, what has run, where time was spent, which locks exist, how much TempDB is used, and how the engine is behaving.

Think of DMVs as a toolbox. One DMV rarely gives the full answer. You usually join request, session, SQL text, plan, wait, lock, and database metadata.

In production, DMVs are your first live triage layer:

Start with sys.dm_exec_requests to find what is active.
Join to sys.dm_exec_sessions to identify owner, host, and application.
Use sys.dm_exec_sql_text and sys.dm_exec_query_plan to capture what is running.
Use waits, locks, grants, and TempDB usage DMVs to explain why it is slow.
Use Query Store when you need history beyond what plan cache can retain.

2. How it works internally

DMVs read engine state kept in memory. Some reset when SQL Server restarts. Some reset when a database is closed, an object is rebuilt, a plan leaves cache, or Query Store is cleaned.

That volatility matters during incidents. A restart can make dashboards look clean while the root cause remains. A plan eviction can erase the only cached plan showing the bad estimate. Capture before remediation whenever the situation allows.

Important DMV families:

Need	DMV family
Active sessions and requests	`sys.dm_exec_sessions`, `sys.dm_exec_requests`
SQL text and plans	`sys.dm_exec_sql_text`, `sys.dm_exec_query_plan`
Cached query history	`sys.dm_exec_query_stats`
Waits	`sys.dm_os_wait_stats`, `sys.dm_exec_session_wait_stats`
Locks and transactions	`sys.dm_tran_locks`, `sys.dm_tran_active_transactions`
Missing indexes	`sys.dm_db_missing_index_*`
Index usage	`sys.dm_db_index_usage_stats`
TempDB usage	`sys.dm_db_task_space_usage`, `sys.dm_db_session_space_usage`
File latency	`sys.dm_io_virtual_file_stats`

3. Real-world problem scenario

A production web API is timing out. The application team says "SQL Server is down." The server is online. DMVs show 120 sessions waiting behind one open transaction from a reporting tool. The lead blocker ran an update and never committed. Without DMVs, the incident looks like general slowness. With DMVs, it is one blocking session.

4. How to diagnose

Fast triage query:

Before running deeper analysis, answer four questions:

Who is active right now?
Who is waiting, and on what?
Who is blocking whom?
Which statement is associated with the largest CPU, elapsed time, reads, writes, or memory grant?

🔍 Diagnostic query

SELECT
    s.session_id,
    s.login_name,
    s.host_name,
    s.program_name,
    r.status,
    r.command,
    r.blocking_session_id,
    r.wait_type,
    r.wait_time AS wait_ms,
    r.cpu_time AS cpu_ms,
    r.total_elapsed_time AS elapsed_ms,
    r.logical_reads,
    r.writes,
    DB_NAME(COALESCE(r.database_id, s.database_id)) AS database_name,
    t.text AS sql_text
FROM sys.dm_exec_sessions AS s
LEFT JOIN sys.dm_exec_requests AS r
    ON s.session_id = r.session_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE s.is_user_process = 1
ORDER BY r.total_elapsed_time DESC, s.session_id;

Find top statements in cache:

🔍 Diagnostic query

SELECT TOP (20)
    qs.execution_count,
    qs.total_worker_time / 1000 AS total_cpu_ms,
    qs.total_elapsed_time / 1000 AS total_duration_ms,
    qs.total_logical_reads,
    qs.total_logical_writes,
    qs.max_grant_kb,
    qs.max_used_grant_kb,
    SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
        ((CASE qs.statement_end_offset
            WHEN -1 THEN DATALENGTH(st.text)
            ELSE qs.statement_end_offset
          END - qs.statement_start_offset) / 2) + 1) AS statement_text
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY qs.total_worker_time DESC;

5. Fix / optimisation

DMVs tell you where to act:

Kill or correct a lead blocker only after confirming rollback impact.
Add an index only after validating the query pattern and write overhead.
Force a plan only after identifying a good historical plan.
Adjust server settings only after proving the setting is the bottleneck.
Persist DMV snapshots if the problem is intermittent.

Do not treat DMV output as a final diagnosis without context. A missing index suggestion can be valid for one query and harmful for the write workload. A high wait count can be normal if the wait time is tiny. A sleeping session is harmless unless it holds locks or an open transaction.

6. Example scripts

Create a lightweight incident snapshot table:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.DmvIncidentSnapshot;
GO

CREATE TABLE dbo.DmvIncidentSnapshot
(
    SnapshotTime datetime2(0) NOT NULL DEFAULT SYSDATETIME(),
    SessionId smallint NOT NULL,
    BlockingSessionId smallint NULL,
    WaitType nvarchar(60) NULL,
    CpuMs int NULL,
    ElapsedMs int NULL,
    LogicalReads bigint NULL,
    SqlText nvarchar(max) NULL
);
GO

INSERT dbo.DmvIncidentSnapshot
(
    SessionId,
    BlockingSessionId,
    WaitType,
    CpuMs,
    ElapsedMs,
    LogicalReads,
    SqlText
)
SELECT
    r.session_id,
    NULLIF(r.blocking_session_id, 0),
    r.wait_type,
    r.cpu_time,
    r.total_elapsed_time,
    r.logical_reads,
    t.text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID;

SELECT *
FROM dbo.DmvIncidentSnapshot
ORDER BY SnapshotTime DESC, ElapsedMs DESC;

Diagnose index usage:

🔍 Diagnostic query

SELECT
    OBJECT_SCHEMA_NAME(i.object_id) AS schema_name,
    OBJECT_NAME(i.object_id) AS table_name,
    i.name AS index_name,
    i.type_desc,
    us.user_seeks,
    us.user_scans,
    us.user_lookups,
    us.user_updates
FROM sys.indexes AS i
LEFT JOIN sys.dm_db_index_usage_stats AS us
    ON i.object_id = us.object_id
   AND i.index_id = us.index_id
   AND us.database_id = DB_ID()
WHERE i.object_id > 100
ORDER BY us.user_updates DESC, us.user_seeks ASC;

Convert DMV evidence into a candidate index review:

🚀 Optimization query

SELECT
    'Candidate index review' AS action_type,
    mid.statement AS table_name,
    mid.equality_columns,
    mid.inequality_columns,
    mid.included_columns,
    migs.user_seeks,
    migs.avg_user_impact
FROM sys.dm_db_missing_index_group_stats AS migs
JOIN sys.dm_db_missing_index_groups AS mig
    ON migs.group_handle = mig.index_group_handle
JOIN sys.dm_db_missing_index_details AS mid
    ON mig.index_handle = mid.index_handle
WHERE mid.database_id = DB_ID()
ORDER BY migs.user_seeks DESC;

7. Key insights (expert notes)

DMVs are volatile. Capture evidence during the incident.
sys.dm_exec_query_stats only shows plans still in cache. Use Query Store for durable history.
Missing index DMVs do not suggest filtered indexes or index consolidation.
Index usage stats reset after restart and can mislead after maintenance windows.
Always join DMVs back to database, object, and query context.

🧠 Locking and blocking

Use this section to separate normal locking from harmful blocking, identify the head blocker, and choose the safest remediation.

1. Concept (simple explanation)

Locking protects data correctness. Blocking happens when one session holds a lock that another session needs. Blocking is not automatically bad. Long blocking chains are bad because they turn normal concurrency into application timeouts.

Deadlocks are different. A deadlock is a cycle where two or more sessions wait on each other, so SQL Server chooses a victim and rolls it back.

2. How it works internally

SQL Server lock manager tracks locks on resources such as rows, keys, pages, tables, allocation units, metadata, and databases.

Common lock modes:

Lock mode	Meaning	Compatible with shared reads?
S	Shared read lock	Yes, with other S locks
U	Update lock, used before converting to X	Limited
X	Exclusive write lock	No
IS/IX	Intent locks at higher levels	Depends on target
Sch-S	Schema stability	Usually yes
Sch-M	Schema modification	No

Blocking chain example:

Blocking chain anatomy

Head blocker

Open transaction updated Orders

Holds the X lock

May be active or sleeping

ROW

Locked resource

Orders row or key range

The shared resource all sessions need

Wait resource points you here

Blocked reader

SELECT Orders waits on LCK_M_S

Victim of the blocker

Fixing the reader does not clear the chain

Blocked writer

UPDATE Orders waits on LCK_M_X

May extend the queue

Check transaction age before KILL

Isolation level matters:

Read committed uses shared locks for reads by default.
Read committed snapshot isolation (RCSI) uses row versions for reads, reducing reader/writer blocking.
Snapshot isolation uses transaction-level row versioning.
Serializable protects ranges and can increase blocking.

Lock escalation is another failure pattern to watch. SQL Server may escalate many row or key locks to a table lock when lock memory pressure or lock count thresholds are reached. The query that was supposed to touch a small set can suddenly block the whole table if estimates are wrong, indexes are missing, or the transaction modifies too many rows.

3. Real-world problem scenario

A nightly ETL job updates 3 million customer rows in one transaction. During the update, the customer portal times out because read queries wait on LCK_M_S. The ETL is technically correct, but its transaction scope is too large for the OLTP workload.

4. How to diagnose

Identify the lead blocker.
Find what SQL it is running or what transaction it left open.
Check lock resources and wait types.
Determine whether killing it will cause a large rollback.
For deadlocks, capture the deadlock graph from Extended Events system_health.

Prioritize blocking like this:

Find the head blocker, not just the session with the longest wait.
Capture login, host, program, SQL text, transaction age, and row count if available.
Decide whether the blocker is making progress, idle in transaction, or stuck behind another resource.
Estimate rollback risk before issuing KILL.
After service is restored, fix the access pattern so the same blocker does not return tomorrow.

Blocking chain query:

🔍 Diagnostic query

WITH blocking AS
(
    SELECT
        r.session_id,
        r.blocking_session_id,
        r.wait_type,
        r.wait_time,
        r.wait_resource,
        r.database_id,
        r.sql_handle
    FROM sys.dm_exec_requests AS r
    WHERE r.blocking_session_id <> 0
)
SELECT
    b.session_id AS blocked_session_id,
    b.blocking_session_id,
    b.wait_type,
    b.wait_time AS wait_ms,
    b.wait_resource,
    DB_NAME(b.database_id) AS database_name,
    blocked_text.text AS blocked_sql,
    blocker_text.text AS blocker_sql
FROM blocking AS b
OUTER APPLY sys.dm_exec_sql_text(b.sql_handle) AS blocked_text
LEFT JOIN sys.dm_exec_requests AS br
    ON b.blocking_session_id = br.session_id
OUTER APPLY sys.dm_exec_sql_text(br.sql_handle) AS blocker_text
ORDER BY b.wait_time DESC;

Open transaction details:

🔍 Diagnostic query

SELECT
    s.session_id,
    s.login_name,
    s.host_name,
    s.program_name,
    at.transaction_begin_time,
    DATEDIFF(minute, at.transaction_begin_time, SYSDATETIME()) AS transaction_age_minutes,
    at.transaction_type,
    at.transaction_state
FROM sys.dm_tran_session_transactions AS st
JOIN sys.dm_tran_active_transactions AS at
    ON st.transaction_id = at.transaction_id
JOIN sys.dm_exec_sessions AS s
    ON st.session_id = s.session_id
WHERE s.is_user_process = 1
ORDER BY at.transaction_begin_time;

5. Fix / optimisation

Add indexes so updates and reads touch fewer rows.
Keep transactions short and predictable.
Batch large modifications.
Use RCSI when read/write blocking is common and the application can accept statement-level versioned reads.
Use consistent object access order to reduce deadlock risk.
Avoid user interaction while a transaction is open.

Do not use NOLOCK as the default fix. It can return dirty, duplicated, or missing rows and can still be blocked by schema modification locks. If readers should not block writers, evaluate RCSI with TempDB capacity and application correctness in mind.

Enable RCSI only after testing TempDB version store impact:

🚀 Optimization query

ALTER DATABASE YourDatabaseName
SET READ_COMMITTED_SNAPSHOT ON
WITH ROLLBACK IMMEDIATE;

6. Example scripts

Session 1: reproduce blocking:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.BlockingLab;
CREATE TABLE dbo.BlockingLab
(
    Id int NOT NULL CONSTRAINT PK_BlockingLab PRIMARY KEY,
    StatusCode char(1) NOT NULL,
    ModifiedAt datetime2(0) NOT NULL
);

INSERT dbo.BlockingLab(Id, StatusCode, ModifiedAt)
VALUES (1, 'N', SYSDATETIME()), (2, 'N', SYSDATETIME());
GO

BEGIN TRAN;
UPDATE dbo.BlockingLab
SET StatusCode = 'P', ModifiedAt = SYSDATETIME()
WHERE Id = 1;

-- Leave this transaction open while Session 2 runs.

Session 2: blocked read:

🛠 Setup script

USE tempdb;
GO

SELECT *
FROM dbo.BlockingLab
WHERE Id = 1;

Diagnose from Session 3:

🔍 Diagnostic query

SELECT
    r.session_id,
    r.blocking_session_id,
    r.wait_type,
    r.wait_time,
    r.wait_resource,
    t.text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.blocking_session_id <> 0;

Fix from Session 1:

🚀 Optimization query

COMMIT TRAN;
-- Or ROLLBACK TRAN if this was a test or bad transaction.

Batch update pattern:

🚀 Optimization query

WHILE 1 = 1
BEGIN
    UPDATE TOP (5000) dbo.BlockingLab
    SET StatusCode = 'A', ModifiedAt = SYSDATETIME()
    WHERE StatusCode = 'N';

    IF @@ROWCOUNT = 0 BREAK;
END;

7. Key insights (expert notes)

The blocker may be sleeping. A sleeping session can still hold an open transaction.
Do not kill first and diagnose later. Capture SQL text, login, host, transaction age, and rollback risk.
NOLOCK can read uncommitted data, duplicate rows, miss rows, and fail with allocation errors. It is not a blocking strategy.
RCSI reduces reader/writer blocking but moves work to TempDB version store.
Deadlock priority can protect critical sessions, but it does not fix the access pattern.

🧠 TempDB

Use this section to identify whether TempDB pressure comes from user objects, internal spills, version store, allocation contention, or storage latency.

1. Concept (simple explanation)

TempDB is SQL Server's shared scratch database. It stores temporary tables, table variables, worktables, sort/hash spill data, online index build work, row versions, triggers, and internal objects.

When TempDB is slow or contended, unrelated workloads can all become slow because they share the same workspace.

2. How it works internally

TempDB is recreated every time SQL Server starts. It uses simple recovery and minimal logging where possible, but it still writes to disk.

Main consumers:

Consumer	Example
User objects	`#temp` tables and table variables
Internal objects	Sorts, hashes, spools, cursors
Version store	RCSI, snapshot isolation, online index operations
Metadata	Temp object creation and drop tracking

Allocation contention often appears on pages such as PFS, GAM, and SGAM. Modern SQL Server versions include improvements, but file layout and workload pattern still matter.

TempDB incidents usually fall into three categories:

Space exhaustion: TempDB fills because a query spills, a temp table grows, or version store cleanup is blocked.
Allocation contention: many sessions create/drop temp objects or allocate pages at the same time, showing PAGELATCH_* waits.
I/O pressure: sorts, hashes, spools, and version store reads/writes overload TempDB storage, showing PAGEIOLATCH_* or slow file latency.

The priority is to identify the consumer before adding space. Extra space can restore service, but it does not tell you whether the cause is one report, RCSI version retention, or systemic spill pressure.

TempDB pressure map

SORT

Sort and hash spills

Internal objects consume TempDB

Usually caused by bad estimates or missing indexes

Check memory grants and spill warnings

#TMP

#temp tables

User objects allocate TempDB pages

Wide temp tables and missing temp indexes add pressure

Check session and task space usage

VER

Row versions

RCSI, snapshot isolation, and online operations use version store

Long readers can prevent cleanup

Check active snapshot transactions

DATA

TempDB data files

Shared landing zone for all consumers

Files must be pre-sized and evenly sized

Autogrowth is a safety net, not capacity planning

LATCH

Allocation contention

PAGELATCH_* on allocation pages

Often caused by high temp object churn

Use equal files and reduce churn

I/O

Read/write latency

PAGEIOLATCH_* or slow file stats

Usually visible after spills or version-store growth

Tune the consumer before adding storage

3. Real-world problem scenario

An e-commerce database enables RCSI to reduce blocking. Blocking improves, but two days later TempDB grows from 20 GB to 300 GB. A long-running reporting transaction keeps old row versions alive, so cleanup cannot remove them. The symptom is TempDB full; the root cause is a long snapshot/RCSI reader.

4. How to diagnose

Check TempDB file sizes and free space.
Separate user object, internal object, and version store usage.
Identify sessions consuming TempDB.
Check waits: PAGELATCH_* suggests allocation contention; PAGEIOLATCH_* suggests physical I/O.
Check active snapshot transactions when version store grows.

Check in this order:

Is TempDB full, slow, or contended?
Is usage dominated by user objects, internal objects, or version store?
Which session or request is allocating the most pages?
Are long snapshot/RCSI transactions preventing version cleanup?
Are files equally sized and growing predictably?

TempDB file usage:

🔍 Diagnostic query

USE tempdb;
GO

SELECT
    name,
    type_desc,
    size / 128.0 AS size_mb,
    FILEPROPERTY(name, 'SpaceUsed') / 128.0 AS used_mb,
    (size - FILEPROPERTY(name, 'SpaceUsed')) / 128.0 AS free_mb
FROM sys.database_files;

TempDB space by session:

🔍 Diagnostic query

SELECT TOP (25)
    s.session_id,
    s.login_name,
    s.host_name,
    s.program_name,
    (su.user_objects_alloc_page_count - su.user_objects_dealloc_page_count) * 8 / 1024 AS user_objects_mb,
    (su.internal_objects_alloc_page_count - su.internal_objects_dealloc_page_count) * 8 / 1024 AS internal_objects_mb,
    r.status,
    r.wait_type,
    t.text AS sql_text
FROM sys.dm_db_session_space_usage AS su
JOIN sys.dm_exec_sessions AS s
    ON su.session_id = s.session_id
LEFT JOIN sys.dm_exec_requests AS r
    ON su.session_id = r.session_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE s.is_user_process = 1
ORDER BY (su.user_objects_alloc_page_count + su.internal_objects_alloc_page_count) DESC;

Version store usage:

🔍 Diagnostic query

SELECT
    SUM(version_store_reserved_page_count) * 8 / 1024 AS version_store_mb,
    SUM(user_object_reserved_page_count) * 8 / 1024 AS user_objects_mb,
    SUM(internal_object_reserved_page_count) * 8 / 1024 AS internal_objects_mb,
    SUM(unallocated_extent_page_count) * 8 / 1024 AS free_space_mb
FROM sys.dm_db_file_space_usage;

Active snapshot transactions:

🔍 Diagnostic query

SELECT
    transaction_id,
    session_id,
    elapsed_time_seconds,
    is_snapshot,
    first_snapshot_sequence_num,
    max_version_chain_traversed
FROM sys.dm_tran_active_snapshot_database_transactions
ORDER BY elapsed_time_seconds DESC;

5. Fix / optimisation

User objects: reduce temp table width, index temp tables correctly, drop when done, avoid unnecessary temp materialization.
Internal objects: tune queries that sort, hash, spool, or spill; improve estimates and indexes.
Version store: find long-running snapshot/RCSI readers; shorten transactions; schedule reporting safely.
Allocation contention: use multiple equally sized TempDB data files and modern SQL Server defaults.
I/O pressure: move TempDB to faster storage and pre-size files to avoid autogrowth storms.

Do not assume the fix is always "add more TempDB files." More equal-sized files help allocation contention, but they will not fix a spilling hash join, an unbounded temp table, or a long transaction retaining row versions.

6. Example scripts

Reproduce TempDB internal object pressure with a sort:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.TempDbSortLab;
GO

CREATE TABLE dbo.TempDbSortLab
(
    Id int IDENTITY(1,1) NOT NULL,
    GroupId int NOT NULL,
    Padding char(200) NOT NULL DEFAULT 'x'
);

WITH n AS
(
    SELECT TOP (300000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
    FROM sys.all_objects AS a
    CROSS JOIN sys.all_objects AS b
)
INSERT dbo.TempDbSortLab(GroupId)
SELECT rn % 5000
FROM n;
GO

-- This sort can request memory and may spill depending on the lab machine.
SELECT GroupId, COUNT(*) AS row_count
FROM dbo.TempDbSortLab
GROUP BY GroupId
ORDER BY COUNT(*) DESC;

Diagnose task-level TempDB usage while a query runs:

🔍 Diagnostic query

SELECT
    session_id,
    request_id,
    (internal_objects_alloc_page_count - internal_objects_dealloc_page_count) * 8 / 1024 AS internal_mb,
    (user_objects_alloc_page_count - user_objects_dealloc_page_count) * 8 / 1024 AS user_mb
FROM sys.dm_db_task_space_usage
WHERE session_id <> @@SPID
ORDER BY internal_mb DESC;

Fix with an index that supports grouping:

🚀 Optimization query

CREATE INDEX IX_TempDbSortLab_GroupId ON dbo.TempDbSortLab(GroupId);
GO

SELECT GroupId, COUNT(*) AS row_count
FROM dbo.TempDbSortLab
GROUP BY GroupId
ORDER BY COUNT(*) DESC;

Check TempDB file layout:

✅ Validation query

SELECT
    name,
    physical_name,
    size / 128.0 AS size_mb,
    growth,
    is_percent_growth
FROM tempdb.sys.database_files
WHERE type_desc = 'ROWS'
ORDER BY file_id;

7. Key insights (expert notes)

TempDB is shared. One bad report can hurt the whole instance.
Spills are query tuning problems that show up as TempDB problems.
RCSI is useful, but long readers can retain row versions and grow TempDB.
Equal file size matters. Unequal files can cause uneven allocation.
Autogrowth is a safety net, not a sizing strategy.

🧠 Troubleshooting playbook

Use this playbook when you need ordered, incident-ready checks for CPU pressure, blocking, or slow-query triage.

High CPU issue steps

Confirm CPU pressure at SQL Server level, not only VM/OS level.
- Check whether SQL Server process CPU is high and whether signal waits or runnable tasks show scheduler pressure.
- If CPU is normal but elapsed time is high, look for waits instead of tuning CPU.
Check active CPU consumers:

🔍 Diagnostic query

SELECT TOP (20)
    r.session_id,
    r.cpu_time AS cpu_ms,
    r.total_elapsed_time AS elapsed_ms,
    r.logical_reads,
    r.status,
    r.command,
    t.text AS sql_text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID
ORDER BY r.cpu_time DESC;

Check cached top CPU queries and compare with Query Store history.
Inspect actual plans for scans, bad joins, scalar UDFs, implicit conversions, and excessive parallelism.
Check waits. High signal wait or many runnable tasks means CPU scheduling pressure.
Mitigate: force last good plan, stop runaway query, reduce DOP for the query, or scale temporarily.
Fix root cause: indexes, query rewrite, statistics, parameter sensitivity, or app call pattern.

Do not clear the plan cache as the first response to high CPU. If the issue is parameter sniffing, a cache clear may temporarily hide it and then let the same bad plan recompile under a different parameter. Use Query Store to identify and force a known-good plan when appropriate.

Blocking troubleshooting steps

Identify the lead blocker and transaction age.
Capture blocker SQL text, login, host, program, wait resource, and open transaction details.
Decide whether to wait, ask the owner to commit/rollback, or kill the session.
If killing, estimate rollback impact. Large updates can take time to undo.
After service restoration, fix the pattern: shorter transactions, better indexes, batching, RCSI, or consistent access order.

Treat blocking as a chain, not a list. The session with the most visible wait is often a victim, not the cause. The lead blocker may be sleeping, may have no active request, and may still hold locks through an open transaction.

Quick blocker query:

🔍 Diagnostic query

SELECT
    blocked.session_id AS blocked_session_id,
    blocked.blocking_session_id,
    blocked.wait_type,
    blocked.wait_time AS wait_ms,
    blocked.wait_resource,
    blocker_s.login_name AS blocker_login,
    blocker_s.host_name AS blocker_host,
    blocker_s.program_name AS blocker_program,
    blocker_text.text AS blocker_sql
FROM sys.dm_exec_requests AS blocked
LEFT JOIN sys.dm_exec_sessions AS blocker_s
    ON blocked.blocking_session_id = blocker_s.session_id
LEFT JOIN sys.dm_exec_requests AS blocker_r
    ON blocked.blocking_session_id = blocker_r.session_id
OUTER APPLY sys.dm_exec_sql_text(blocker_r.sql_handle) AS blocker_text
WHERE blocked.blocking_session_id <> 0
ORDER BY blocked.wait_time DESC;

Slow query checklist

Is the query slow now, or was it slow historically?
Did the plan change? Check Query Store.
Are estimated and actual rows far apart?
Are predicates SARGable?
Are parameter data types aligned with column data types?
Is there a key lookup repeated many times?
Are there spills to TempDB?
Is the query waiting on locks, I/O, memory grants, or CPU?
Are statistics stale or sampled poorly?
Does the query return more columns or rows than the application needs?
Would a filtered, covering, or composite index match the real access pattern?
Is this a parameter-sensitive procedure?

When prioritizing the checklist, start with evidence that changes the remediation path: plan changed, row estimates wrong, blocking present, or spill present. Then inspect index design and query shape. Review cosmetic query cleanup later.

Do not add every missing-index recommendation during an incident. Missing-index DMVs do not understand your full index portfolio, write overhead, filtered-index opportunities, or duplicate indexes. Convert suggestions into an index review after the service is stable.

Slow query starter script:

🔍 Diagnostic query

DECLARE @session_id smallint = 0; -- Replace with target session_id.

SELECT
    r.session_id,
    r.status,
    r.cpu_time AS cpu_ms,
    r.total_elapsed_time AS elapsed_ms,
    r.logical_reads,
    r.reads,
    r.writes,
    r.wait_type,
    r.wait_time AS wait_ms,
    r.blocking_session_id,
    qp.query_plan,
    t.text AS sql_text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
OUTER APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
WHERE r.session_id = @session_id;

🧠 References

Use these Microsoft Learn pages to verify syntax, supported behavior, and engine details before applying changes in production.

🧠 Flashcards

Use these recall prompts to test the incident workflow, not just memorize DMV names.

What is the first execution-plan comparison to make during a slow-query incident?

Click to reveal answer

Compare estimated rows to actual rows. Large gaps explain bad join choices, wrong access methods, memory grant problems, and spills.

1 / 6

🧠 Architecture map​

🧠 Execution plans​

1. Concept (simple explanation)​

2. How it works internally​

3. Real-world problem scenario​

4. How to diagnose​

🔍 Diagnostic query​

🔍 Diagnostic query​

5. Fix / optimisation​

6. Example scripts​

🛠 Setup script​

🔍 Diagnostic query​

🚀 Optimization query​

7. Key insights (expert notes)​

🧠 Performance tuning​

1. Concept (simple explanation)​

2. How it works internally​

3. Real-world problem scenario​

4. How to diagnose​

🔍 Diagnostic query​

🔍 Diagnostic query​

🔍 Diagnostic query​

5. Fix / optimisation​

6. Example scripts​

🛠 Setup script​

🔍 Diagnostic query​

🚀 Optimization query​

7. Key insights (expert notes)​

🧠 SQL Server internals​

1. Concept (simple explanation)​

2. How it works internally​

3. Real-world problem scenario​

4. How to diagnose​

🔍 Diagnostic query​

🔍 Diagnostic query​

🔍 Diagnostic query​

5. Fix / optimisation​

6. Example scripts​

🛠 Setup script​

🔍 Diagnostic query​

🚀 Optimization query​

7. Key insights (expert notes)​

🧠 Dynamic management views (DMVs)​

1. Concept (simple explanation)​

2. How it works internally​

3. Real-world problem scenario​

4. How to diagnose​

🔍 Diagnostic query​

🔍 Diagnostic query​

5. Fix / optimisation​

6. Example scripts​

🛠 Setup script​

🔍 Diagnostic query​

🚀 Optimization query​

7. Key insights (expert notes)​

🧠 Locking and blocking​

1. Concept (simple explanation)​

2. How it works internally​

3. Real-world problem scenario​

4. How to diagnose​

🔍 Diagnostic query​

🔍 Diagnostic query​

5. Fix / optimisation​

🚀 Optimization query​

6. Example scripts​

🛠 Setup script​

🛠 Setup script​

🔍 Diagnostic query​

🚀 Optimization query​

🚀 Optimization query​

7. Key insights (expert notes)​

🧠 TempDB​

1. Concept (simple explanation)​

2. How it works internally​

3. Real-world problem scenario​

4. How to diagnose​

🔍 Diagnostic query​

🔍 Diagnostic query​

🔍 Diagnostic query​

🔍 Diagnostic query​

🧠 Architecture map

🧠 Execution plans

1. Concept (simple explanation)

2. How it works internally

3. Real-world problem scenario

4. How to diagnose

🔍 Diagnostic query

🔍 Diagnostic query

5. Fix / optimisation

6. Example scripts

🛠 Setup script

🔍 Diagnostic query

🚀 Optimization query

7. Key insights (expert notes)

🧠 Performance tuning

1. Concept (simple explanation)

2. How it works internally

3. Real-world problem scenario

4. How to diagnose

🔍 Diagnostic query

🔍 Diagnostic query

🔍 Diagnostic query

5. Fix / optimisation

6. Example scripts

🛠 Setup script

🔍 Diagnostic query

🚀 Optimization query

7. Key insights (expert notes)

🧠 SQL Server internals

1. Concept (simple explanation)

2. How it works internally

3. Real-world problem scenario

4. How to diagnose

🔍 Diagnostic query

🔍 Diagnostic query

🔍 Diagnostic query

5. Fix / optimisation

6. Example scripts

🛠 Setup script

🔍 Diagnostic query

🚀 Optimization query

7. Key insights (expert notes)

🧠 Dynamic management views (DMVs)

1. Concept (simple explanation)

2. How it works internally

3. Real-world problem scenario

4. How to diagnose

🔍 Diagnostic query

🔍 Diagnostic query

5. Fix / optimisation

6. Example scripts

🛠 Setup script

🔍 Diagnostic query

🚀 Optimization query

7. Key insights (expert notes)

🧠 Locking and blocking

1. Concept (simple explanation)

2. How it works internally

3. Real-world problem scenario

4. How to diagnose

🔍 Diagnostic query

🔍 Diagnostic query

5. Fix / optimisation

🚀 Optimization query

6. Example scripts

🛠 Setup script

🛠 Setup script

🔍 Diagnostic query

🚀 Optimization query

🚀 Optimization query

7. Key insights (expert notes)

🧠 TempDB

1. Concept (simple explanation)

2. How it works internally

3. Real-world problem scenario

4. How to diagnose

🔍 Diagnostic query

🔍 Diagnostic query

🔍 Diagnostic query

🔍 Diagnostic query