Skip to main content

SQL Server expert troubleshooting guide

This page is a field guide for engineers who troubleshoot SQL Server under pressure. It goes below the surface: execution plans, optimizer behavior, SQLOS, DMVs, blocking chains, TempDB allocation, and the scripts you can run in a lab before you need them in production.

Use the scripts in a non-production database such as SqlTroubleshootingLab. Several examples intentionally create slow queries, blocking, or TempDB pressure.

🧠 Architecture map

This guide follows the same path a request follows inside the engine: client call, compilation, execution, waits, storage, TempDB, and diagnostic evidence. Use the diagram as a triage map during incidents rather than treating each topic as isolated theory.

SQL Server troubleshooting architecture map

Under Sev-A pressure, keep the mental flow simple:

  • Client -> compile: confirm whether the workload is sending stable, parameterized calls or ad-hoc text with mismatched data types.
  • Compile -> execute: check whether the selected plan matches the data shape, parameter value, statistics, and available indexes.
  • Execute -> waits: classify whether the request is burning CPU or waiting on locks, I/O, memory grants, workers, TempDB, or the client.
  • Waits -> resources: map the wait to the resource before changing anything. LCK_M_* means concurrency, PAGEIOLATCH_* means physical reads, RESOURCE_SEMAPHORE means memory grant pressure, and PAGELATCH_* on TempDB allocation pages means latch contention.
  • Resources -> diagnostics: capture DMV and Query Store evidence before cache clears, restarts, kills, or configuration changes remove the trail.
ComponentWhen things go wrongCheck firstCommon root causesQuick remediation
Parser / algebrizerCompile errors, sudden failures after deployment, permissions errorsError text, deployed SQL, object names, data typesRenamed objects, invalid schema binding, implicit conversion from client parametersFix code or parameter type; avoid emergency server changes
OptimizerHigh CPU, scans, wrong join type, unstable durationActual plan, Query Store, statistics date, estimated vs actual rowsParameter sniffing, stale stats, skewed data, non-SARGable predicatesForce last good plan if needed; update stats or rewrite predicate after evidence
Execution planMemory spills, excessive DOP, repeated lookupsPlan warnings, memory grant, DOP, key lookup countBad cardinality, missing covering index, row goal, implicit conversionAdd targeted index, fix estimate, tune query-level DOP or recompile strategy
SQLOSHigh signal waits, runnable queue, THREADPOOLsys.dm_os_schedulers, sys.dm_os_wait_stats, active workersCPU saturation, worker starvation, connection storm, blocking fan-outStop runaway work, reduce concurrency, fix pooling, scale only as mitigation
Buffer pool / storagePAGEIOLATCH_*, high physical reads, file latencysys.dm_io_virtual_file_stats, logical reads, plan access methodsScans, cold cache, undersized memory, slow data filesReduce reads first; then tune storage or memory
Lock managerTimeouts, blocked sessions, deadlockssys.dm_exec_requests, sys.dm_tran_locks, deadlock graphLong transactions, missing indexes, lock escalation, isolation mismatchFind lead blocker; shorten transactions; batch writes; consider RCSI after testing
TempDBSpills, TempDB full, PAGELATCH_*, version store growthsys.dm_db_file_space_usage, session/task space usage, snapshot transactionsSort/hash spills, long RCSI readers, too few or uneven filesTune spilling query; end long reader; pre-size equal files; fix allocation contention
DMVs / Query Store / XEventsMissing history, unclear root cause, competing narrativesLive DMVs for now, Query Store for history, XEvents for deadlocksEvidence lost after restart, plan eviction, incomplete snapshotCapture first, then act; persist snapshots for intermittent incidents
⚠️ Watch Out

During a live incident, do not start with DBCC FREEPROCCACHE, SQL Server restart, blanket index creation, or NOLOCK changes. Those actions can remove evidence, cause wider regressions, or trade correctness for speed. Capture evidence first, restore service second, and perform root-cause tuning after the blast radius is contained.

Incident triage path
1
Start with the symptom
Capture the production symptom
Record the impacted app, database, and time window
2
Decide current vs historical
Current: use active DMVs, waits, blockers
Historical: use Query Store and wait trends
3
Classify the dominant signal
CPU, I/O, locks, or memory
Avoid tuning before the bottleneck is named
4
Follow the matching evidence
Plans and estimates for CPU
Reads and latency for I/O
Blocking chains for locks
Grants and spills for memory
5
Fix root cause, then validate
Apply the smallest safe mitigation
Recheck waits, duration, reads, and plan shape
🔀 Oracle → Azure

Oracle parallel: Treat SQL Server Query Store plus DMVs as the closest day-to-day equivalent of AWR, ASH, V$SQL, and V$SESSION_WAIT. The mapping works for workload history and live diagnosis, but SQL Server exposes different wait names, plan operators, and locking semantics.

🧠 Execution plans

Use this section to read plans as troubleshooting evidence: estimates, operators, warnings, memory grants, and plan regressions.


1. Concept (simple explanation)

An execution plan is SQL Server's recipe for running a query. It shows which tables SQL Server reads, which indexes it uses, how rows are joined, where data is sorted, and where memory is needed.

Read plans as evidence, not artwork. The important question is not "does the plan look complex?" The important question is: "Where did SQL Server estimate wrong, read too much, spill to disk, or choose an operator that does not match the data?"

Common operators:

OperatorSimple meaningPerformance signal
Index seekNavigate directly to a range in a B-treeUsually good, but can still read many rows
Index/table scanRead a full structureFine for large ranges; bad for selective predicates
Key lookupFetch missing columns from clustered indexExpensive when repeated many times
Hash matchBuild/probe hash table for joins or aggregatesGood for large sets; can spill if memory grant is low
Nested loopsFor each outer row, search inner inputGood for small outer input; terrible when estimates are low
SortOrder rowsNeeds memory; spills to TempDB if grant is too small

2. How it works internally

SQL Server turns a query into a plan through four main stages:

Compilation to execution flow
1
Parse SQL text
Check syntax
Tokenize T-SQL
Build the initial parse tree
2
Bind names and types
Resolve objects and columns
Check permissions
Validate data types
3
Estimate cardinality
Read statistics histograms
Estimate row counts
Model data distribution
4
Search plan alternatives
Compare join orders
Choose access methods
Evaluate serial vs parallel
5
Pick and cache the plan
Select lowest estimated cost
Create memory grant
Reuse from cache when safe
6
Execute with workers
Run iterator operators
Request rows from child operators
Track waits, CPU, reads, and spills

Internally:

  • The parser checks syntax and builds a tree.
  • The algebrizer resolves tables, columns, data types, ownership chains, and permissions.
  • The optimizer uses statistics histograms, density vectors, constraints, indexes, and row goals to estimate row counts.
  • The optimizer evaluates physical choices: join order, join type, access method, serial vs parallel plan, and memory grants.
  • The execution engine runs the selected plan using the iterator model. Each operator asks its child operator for rows.
  • The plan may be cached so the next execution avoids compile CPU.

The optimizer is cost-based, but the cost is only as good as the estimates. Bad statistics, implicit conversions, parameter sniffing, table variables, skewed data, and non-SARGable predicates can all push the optimizer toward the wrong plan.

When things go wrong, the usual chain is: bad cardinality -> bad join/access choice -> wrong memory grant or DOP -> high CPU, spills, or excessive reads. For example, a predicate estimated at 10 rows but returning 10 million rows can choose nested loops plus key lookups, burn CPU, and overload storage even though every individual operator looks valid.

Prioritize plan investigation this way:

  • First: compare estimated rows vs actual rows at the first large divergence.
  • Next: inspect warnings for spills, implicit conversions, missing join predicates, and excessive grants.
  • Then: decide whether the problem is compile-time metadata, runtime parameter sensitivity, or missing physical access path.
  • Later: consider hints. Hints can stabilize service, but they also freeze assumptions.

3. Real-world problem scenario

Contoso has an order lookup procedure that usually returns one customer. After a weekend data load, the API latency jumps from 80 ms to 8 seconds. CPU rises, reads spike, and the plan shows an index scan with a hash join. The root cause is an implicit conversion: the application sends nvarchar while the indexed column is varchar, so SQL Server converts the column and cannot seek efficiently.

4. How to diagnose

  • Capture the actual execution plan, not only the estimated plan.
  • Compare estimated vs actual rows on each operator.
  • Look for warnings: spills, missing indexes, implicit conversions, residual predicates, and excessive memory grants.
  • Check whether the plan is new by using Query Store.
  • Check whether the query is parameter-sensitive: fast for one parameter, slow for another.

Use Query Store early when the question is "what changed?". It tells you whether the query regressed because the plan changed, the same plan slowed down, or the runtime volume changed. Use live DMVs when the question is "what is hurting us right now?".

Do not assume a scan is bad or a seek is good. A seek with millions of key lookups can be worse than a controlled scan. The incident signal is total work: CPU, reads, elapsed time, waits, and memory spills.

Diagnostic query for cached plans with implicit conversions:

🔍 Diagnostic query

SELECT TOP (25)
DB_NAME(st.dbid) AS database_name,
qs.execution_count,
qs.total_worker_time / 1000 AS total_cpu_ms,
qs.total_logical_reads,
SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset) / 2) + 1) AS statement_text,
qp.query_plan
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) AS qp
WHERE CAST(qp.query_plan AS nvarchar(max)) LIKE N'%CONVERT_IMPLICIT%'
ORDER BY qs.total_worker_time DESC;

Query Store plan regression check:

🔍 Diagnostic query

SELECT TOP (20)
qsq.query_id,
qsp.plan_id,
rs.count_executions,
rs.avg_duration / 1000.0 AS avg_duration_ms,
rs.avg_cpu_time / 1000.0 AS avg_cpu_ms,
rs.avg_logical_io_reads,
qt.query_sql_text
FROM sys.query_store_runtime_stats AS rs
JOIN sys.query_store_plan AS qsp
ON rs.plan_id = qsp.plan_id
JOIN sys.query_store_query AS qsq
ON qsp.query_id = qsq.query_id
JOIN sys.query_store_query_text AS qt
ON qsq.query_text_id = qt.query_text_id
ORDER BY rs.avg_duration DESC;

5. Fix / optimisation

  • If estimates are wrong, update statistics or improve predicate shape.
  • If a key lookup runs thousands of times, add included columns or rewrite the query.
  • If an implicit conversion blocks seeks, align parameter and column data types.
  • If parameter sniffing causes plan instability, use Query Store plan forcing, OPTION (RECOMPILE), OPTIMIZE FOR, or SQL Server 2022 Parameter Sensitive Plan optimization.
  • If a sort or hash spills, reduce rows, add an index that provides order, or investigate memory grants.

The fix works because it gives the optimizer better metadata or a cheaper physical access path.

6. Example scripts

Reproduce a non-SARGable predicate:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.PlanLabOrders;
GO

CREATE TABLE dbo.PlanLabOrders
(
OrderId int IDENTITY(1,1) NOT NULL CONSTRAINT PK_PlanLabOrders PRIMARY KEY,
CustomerCode varchar(20) NOT NULL,
OrderDate date NOT NULL,
Amount money NOT NULL
);

CREATE INDEX IX_PlanLabOrders_OrderDate ON dbo.PlanLabOrders(OrderDate);

WITH n AS
(
SELECT TOP (200000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
FROM sys.all_objects AS a
CROSS JOIN sys.all_objects AS b
)
INSERT dbo.PlanLabOrders(CustomerCode, OrderDate, Amount)
SELECT
CONCAT('C', RIGHT(CONCAT('00000', rn % 5000), 5)),
DATEADD(day, -(rn % 1500), CONVERT(date, SYSDATETIME())),
(rn % 1000) + 1
FROM n;
GO

SET STATISTICS IO, TIME ON;

-- Bad: function on indexed column prevents a direct range seek.
SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE YEAR(OrderDate) = YEAR(SYSDATETIME());

-- Better: range predicate is SARGable.
DECLARE @start date = DATEFROMPARTS(YEAR(SYSDATETIME()), 1, 1);
DECLARE @end date = DATEADD(year, 1, @start);

SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE OrderDate >= @start
AND OrderDate < @end;

Diagnose row estimate error in the actual plan:

🔍 Diagnostic query

SET STATISTICS XML ON;
GO
SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE YEAR(OrderDate) = YEAR(SYSDATETIME());
GO
SET STATISTICS XML OFF;

Fix with a computed column when code cannot change immediately:

🚀 Optimization query

ALTER TABLE dbo.PlanLabOrders
ADD OrderYear AS YEAR(OrderDate) PERSISTED;
GO

CREATE INDEX IX_PlanLabOrders_OrderYear ON dbo.PlanLabOrders(OrderYear);
GO

SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE OrderYear = YEAR(SYSDATETIME());

7. Key insights (expert notes)

  • Estimated cost is not elapsed time. It is an optimizer model.
  • A seek is not always good. A seek that reads 80 percent of a table can be worse than a scan.
  • The most useful plan comparison is estimated rows vs actual rows.
  • Missing index hints are per-query suggestions. Consolidate them before creating indexes.
  • Warnings on operators often explain the incident faster than the visible operator names.

🧠 Performance tuning

Use this section to move from symptom to bottleneck before choosing an index, query rewrite, isolation change, or temporary mitigation.


1. Concept (simple explanation)

Performance tuning is the discipline of reducing wasted work. SQL Server runs faster when it reads fewer pages, compiles less often, waits less, uses memory correctly, and avoids unnecessary contention.

Tuning is not only indexing. It includes query shape, schema design, statistics, isolation levels, TempDB, plan stability, memory grants, parallelism, and application call patterns.

2. How it works internally

Most performance incidents land in one of four buckets:

Performance bottleneck map
CPU
CPU pressure
Bad plans and excessive parallelism
High signal waits or runnable tasks
Check operators, estimates, DOP, and Query Store
I/O
I/O pressure
Scans, missing indexes, or cold cache
High logical reads or file latency
Check access methods and file stats
LCK
Concurrency waits
Long transactions or isolation mismatch
Blocking chains, deadlocks, or lock escalation
Find the head blocker first
MEM
Memory pressure
Sort/hash spills or plan cache pressure
Memory grants, TempDB usage, RESOURCE_SEMAPHORE
Fix estimates before adding capacity

Inside SQL Server, a request uses a worker on a scheduler. When it cannot continue, it records a wait. Good tuning starts by identifying where time is going:

  • CPU time means the worker is executing.
  • Wait time means the worker is stalled on something.
  • Logical reads show buffer pool work.
  • Physical reads show storage access.
  • Writes and log flushes show transaction log pressure.
  • Spills show TempDB used because memory was insufficient.

Production tuning is triage before surgery. During a Sev-A, separate restore service actions from permanent fix actions:

  • Restore service: stop or throttle the runaway query, force a known-good plan, reduce concurrency, add temporary capacity, or kill a confirmed blocker after rollback risk is understood.
  • Permanent fix: correct cardinality, index design, query shape, transaction scope, application call pattern, or isolation model.

Common failure patterns:

  • Parameter sniffing: one cached plan is good for small customers and disastrous for large customers.
  • Memory grant pressure: large sorts/hashes wait on RESOURCE_SEMAPHORE or spill because estimates are wrong.
  • Parallelism skew: high DOP uses many workers but one branch does most of the work; watch CXPACKET, CXCONSUMER, and actual row distribution.
  • Lock escalation: many row/key locks become a table lock, expanding the blocking impact.
  • TempDB contention: the real query issue is a spill, but the visible symptom is TempDB I/O or latch pressure.

3. Real-world problem scenario

Fabrikam runs a reporting query every 15 minutes. It used to finish in 30 seconds. It now runs for 12 minutes and blocks OLTP writes. The query joins five tables, scans a 90 GB order table, spills a hash aggregate to TempDB, and runs with degree of parallelism 8. The issue is a combination of stale statistics, a missing covering index, and the report running under read committed against hot transactional rows.

4. How to diagnose

  • Confirm whether the problem is current or historical.
  • For current issues, check active requests, waits, blockers, CPU, reads, writes, and grants.
  • For historical issues, use Query Store and wait stats.
  • Compare before/after plans if the issue is a regression.
  • Validate whether scaling masks the problem or fixes the real bottleneck.

Check in this order when the bridge is crowded:

  1. Is SQL Server actually the bottleneck, or is the app waiting on network/client processing?
  2. Which sessions are consuming CPU, reads, writes, memory grants, or blocking others right now?
  3. Is there one dominant query/session, or is the whole workload degraded?
  4. Did the plan or runtime profile change compared with the last known-good window?
  5. What is the safest mitigation that restores service without destroying evidence?

Current expensive requests:

🔍 Diagnostic query

SELECT
r.session_id,
r.status,
r.command,
r.cpu_time AS cpu_ms,
r.total_elapsed_time AS elapsed_ms,
r.logical_reads,
r.reads,
r.writes,
r.wait_type,
r.wait_time AS wait_ms,
r.blocking_session_id,
DB_NAME(r.database_id) AS database_name,
SUBSTRING(t.text, (r.statement_start_offset / 2) + 1,
((CASE r.statement_end_offset
WHEN -1 THEN DATALENGTH(t.text)
ELSE r.statement_end_offset
END - r.statement_start_offset) / 2) + 1) AS running_statement
FROM sys.dm_exec_requests AS r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID
ORDER BY r.cpu_time DESC;

Top cached queries by reads:

🔍 Diagnostic query

SELECT TOP (25)
qs.execution_count,
qs.total_logical_reads,
qs.total_logical_reads / NULLIF(qs.execution_count, 0) AS avg_logical_reads,
qs.total_worker_time / 1000 AS total_cpu_ms,
qs.total_elapsed_time / 1000 AS total_duration_ms,
DB_NAME(st.dbid) AS database_name,
SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset) / 2) + 1) AS statement_text
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY qs.total_logical_reads DESC;

Wait profile since last restart:

🔍 Diagnostic query

SELECT TOP (30)
wait_type,
waiting_tasks_count,
wait_time_ms,
signal_wait_time_ms,
wait_time_ms - signal_wait_time_ms AS resource_wait_ms,
100.0 * wait_time_ms / SUM(wait_time_ms) OVER () AS pct_of_waits
FROM sys.dm_os_wait_stats
WHERE wait_type NOT LIKE 'SLEEP%'
AND wait_type NOT IN
(
'BROKER_TASK_STOP', 'BROKER_TO_FLUSH', 'BROKER_EVENTHANDLER',
'SQLTRACE_BUFFER_FLUSH', 'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT',
'LAZYWRITER_SLEEP', 'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH',
'XE_TIMER_EVENT', 'XE_DISPATCHER_WAIT', 'LOGMGR_QUEUE'
)
ORDER BY wait_time_ms DESC;

5. Fix / optimisation

  • High logical reads: add or redesign indexes, rewrite predicates, reduce selected columns.
  • High CPU: fix bad plans, scalar UDFs, excessive parallelism, implicit conversions, and row-by-row logic.
  • I/O waits: reduce reads first, then check storage latency and file layout.
  • Blocking: shorten transactions, add useful indexes, change isolation to row versioning when appropriate.
  • Spills: improve cardinality, indexes, memory grant feedback, or query shape.
  • Plan regression: use Query Store to force a known-good plan while fixing root cause.

Do not tune from a single metric. High CPU with low reads points differently than high CPU with massive reads. WRITELOG points to commit/log throughput. ASYNC_NETWORK_IO often means the database is waiting for the client to consume rows, not that SQL Server needs more CPU.

6. Example scripts

Reproduce a key lookup problem:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.TuningLabCustomerOrders;
GO

CREATE TABLE dbo.TuningLabCustomerOrders
(
OrderId int IDENTITY(1,1) NOT NULL CONSTRAINT PK_TuningLabCustomerOrders PRIMARY KEY,
CustomerId int NOT NULL,
OrderDate datetime2(0) NOT NULL,
StatusCode char(1) NOT NULL,
Amount money NOT NULL,
Notes varchar(200) NULL
);

CREATE INDEX IX_TuningLabCustomerOrders_CustomerId
ON dbo.TuningLabCustomerOrders(CustomerId);

WITH n AS
(
SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
FROM sys.all_objects AS a
CROSS JOIN sys.all_objects AS b
)
INSERT dbo.TuningLabCustomerOrders(CustomerId, OrderDate, StatusCode, Amount, Notes)
SELECT rn % 1000,
DATEADD(minute, -rn, SYSDATETIME()),
CASE WHEN rn % 10 = 0 THEN 'C' ELSE 'O' END,
rn % 2500,
REPLICATE('x', 50)
FROM n;
GO

SET STATISTICS IO, TIME ON;

SELECT OrderId, CustomerId, OrderDate, StatusCode, Amount
FROM dbo.TuningLabCustomerOrders
WHERE CustomerId = 42;

Diagnose missing coverage:

🔍 Diagnostic query

SELECT
migs.user_seeks,
migs.avg_total_user_cost,
migs.avg_user_impact,
mid.statement AS table_name,
mid.equality_columns,
mid.inequality_columns,
mid.included_columns
FROM sys.dm_db_missing_index_group_stats AS migs
JOIN sys.dm_db_missing_index_groups AS mig
ON migs.group_handle = mig.index_group_handle
JOIN sys.dm_db_missing_index_details AS mid
ON mig.index_handle = mid.index_handle
WHERE mid.database_id = DB_ID()
ORDER BY migs.avg_user_impact DESC;

Fix with a covering index:

🚀 Optimization query

CREATE INDEX IX_TuningLabCustomerOrders_CustomerId_Covering
ON dbo.TuningLabCustomerOrders(CustomerId)
INCLUDE (OrderDate, StatusCode, Amount);
GO

SELECT OrderId, CustomerId, OrderDate, StatusCode, Amount
FROM dbo.TuningLabCustomerOrders
WHERE CustomerId = 42;

7. Key insights (expert notes)

  • Tune by bottleneck, not by habit.
  • SELECT * is a performance design decision. It widens indexes, memory grants, network usage, and lookups.
  • Query Store is often safer than clearing plan cache during an incident.
  • Avoid changing five things at once. You will not know which fix worked.
  • Scaling up can be a valid mitigation, but it should not be the only diagnosis.

🧠 SQL Server internals

Use this section to connect waits and resource symptoms back to SQLOS, memory, storage, logging, locking, and TempDB internals.


1. Concept (simple explanation)

SQL Server internals explain what the engine is doing when your query runs: scheduling workers, managing memory, reading 8 KB pages, writing log records, taking locks, creating row versions, and using TempDB for temporary work.

When you understand internals, wait stats become meaningful. PAGEIOLATCH_SH is not just a code. It means a worker is waiting for a data page to be read from storage into the buffer pool.

2. How it works internally

Core pieces:

ComponentWhat it doesWhat you monitor
SQLOSScheduling, workers, memory, waitssys.dm_os_schedulers, wait stats
Buffer poolCaches 8 KB data pagesPLE, physical reads, buffer descriptors
Plan cacheStores compiled planscache size, single-use plans, compile CPU
Transaction logGuarantees durability with write-ahead logginglog flush waits, VLFs, log reuse waits
Lock managerCoordinates concurrent accesslocks, blocking chains, deadlocks
TempDBShared workspace for temp objects and spillsfile contention, version store, internal objects

Incident lens for these components:

  • SQLOS matters because every running request needs workers and schedulers. If runnable tasks pile up, adding indexes will not immediately clear CPU starvation.
  • Buffer pool matters because logical reads are memory work and physical reads are storage work. A plan that reads 50 million pages can saturate either path.
  • Plan cache matters because the wrong reused plan can make a good procedure fail intermittently.
  • Transaction log matters because commits wait for log hardening. Slow log writes become application latency.
  • Lock manager matters because one long transaction can make a healthy server look unavailable.
  • TempDB matters because spills, versions, and temporary objects share a finite workspace across the whole instance.

Data access path:

Page read and log-write path
1
Worker requests a page
The query operator asks for an 8 KB data page
The worker checks the buffer pool first
2
Serve a logical read
If the page is cached, SQL Server returns it from memory
This is still counted as a logical read
3
Issue a physical read
If the page is missing, SQL Server reads it from the data file
The visible wait is often PAGEIOLATCH_*
4
Cache and return the page
The page enters the buffer pool
Future requests can avoid physical I/O
5
Harden log records first
For changes, the log record is written before the dirty page flushes
Slow log writes show up as WRITELOG latency

Write-ahead logging means the log record must be hardened before the changed data page is written to the data file. That is why slow log storage can hurt write-heavy systems even when data files are fast.

3. Real-world problem scenario

Tailwind Traders sees periodic stalls during checkout. CPU is low, but application threads wait. SQL Server wait stats show WRITELOG, and the transaction log file has thousands of small virtual log files after years of tiny autogrowth increments. The actual issue is log throughput and poor log file growth history, not missing indexes.

4. How to diagnose

  • Check wait stats first to identify the broad subsystem.
  • Check scheduler pressure for runnable queue buildup.
  • Check memory clerks for unusual memory consumers.
  • Check database file latency.
  • Check log reuse waits and VLF count for transaction log health.

When the system is unstable, take a quick baseline snapshot before changing anything: active requests, wait stats delta, file latency, scheduler runnable queue, memory grants, blockers, and top Query Store regressions. A five-minute evidence window is often enough to prevent a wrong fix.

Scheduler health:

🔍 Diagnostic query

SELECT
scheduler_id,
cpu_id,
status,
is_online,
current_tasks_count,
runnable_tasks_count,
current_workers_count,
active_workers_count,
work_queue_count
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255
ORDER BY scheduler_id;

File latency:

🔍 Diagnostic query

SELECT
DB_NAME(vfs.database_id) AS database_name,
mf.type_desc,
mf.physical_name,
vfs.num_of_reads,
vfs.io_stall_read_ms,
vfs.io_stall_read_ms / NULLIF(vfs.num_of_reads, 0) AS avg_read_ms,
vfs.num_of_writes,
vfs.io_stall_write_ms,
vfs.io_stall_write_ms / NULLIF(vfs.num_of_writes, 0) AS avg_write_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
JOIN sys.master_files AS mf
ON vfs.database_id = mf.database_id
AND vfs.file_id = mf.file_id
ORDER BY avg_write_ms DESC;

Memory clerk overview:

🔍 Diagnostic query

SELECT TOP (25)
type,
pages_kb / 1024 AS pages_mb,
virtual_memory_committed_kb / 1024 AS virtual_committed_mb
FROM sys.dm_os_memory_clerks
ORDER BY pages_kb DESC;

5. Fix / optimisation

  • For WRITELOG, reduce transaction size, improve log storage latency, avoid row-by-row commits, and pre-size log files.
  • For SOS_SCHEDULER_YIELD, find CPU-heavy queries and excessive parallelism.
  • For PAGEIOLATCH_*, reduce reads first; storage tuning comes after query/index tuning.
  • For plan cache bloat, parameterize safely and avoid ad-hoc literal storms.
  • For memory pressure, set max server memory correctly and find large grants or clerks.

Avoid broad instance-level changes unless the evidence points there. Changing max server memory, MAXDOP, cost threshold, or clearing cache during a live incident can affect every workload. Prefer query-level mitigation when one query or procedure is the cause.

6. Example scripts

Reproduce plan cache bloat:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.CacheLab;
CREATE TABLE dbo.CacheLab (Id int NOT NULL PRIMARY KEY, ValueText varchar(50) NOT NULL);

INSERT dbo.CacheLab(Id, ValueText)
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 'cache-test'
FROM sys.all_objects;
GO

DECLARE @i int = 1;
DECLARE @sql nvarchar(max);

WHILE @i <= 200
BEGIN
SET @sql = N'SELECT ValueText FROM dbo.CacheLab WHERE Id = ' + CONVERT(nvarchar(20), @i) + N';';
EXEC sys.sp_executesql @sql;
SET @i += 1;
END;

Diagnose single-use ad-hoc plans:

🔍 Diagnostic query

SELECT TOP (25)
cp.objtype,
cp.usecounts,
cp.size_in_bytes / 1024 AS size_kb,
st.text
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) AS st
WHERE cp.objtype = 'Adhoc'
AND cp.usecounts = 1
ORDER BY cp.size_in_bytes DESC;

Fix with parameterization in application code or sp_executesql:

🚀 Optimization query

DECLARE @id int = 42;

EXEC sys.sp_executesql
N'SELECT ValueText FROM dbo.CacheLab WHERE Id = @Id;',
N'@Id int',
@Id = @id;

7. Key insights (expert notes)

  • SQL Server waits are a time accounting system. They are not all problems.
  • Signal wait time points to CPU scheduling pressure. Resource wait time points to the thing being waited on.
  • The buffer pool caches pages, not rows.
  • The transaction log is sequential, but it is often the critical path for writes.
  • Clearing cache may hide the evidence you need. Capture first, clear only with intent.

🧠 Dynamic management views (DMVs)

Use this section to capture live SQL Server evidence before remediation changes remove the incident trail.


1. Concept (simple explanation)

DMVs are SQL Server's live inspection windows. They expose what is running, what has run, where time was spent, which locks exist, how much TempDB is used, and how the engine is behaving.

Think of DMVs as a toolbox. One DMV rarely gives the full answer. You usually join request, session, SQL text, plan, wait, lock, and database metadata.

In production, DMVs are your first live triage layer:

  • Start with sys.dm_exec_requests to find what is active.
  • Join to sys.dm_exec_sessions to identify owner, host, and application.
  • Use sys.dm_exec_sql_text and sys.dm_exec_query_plan to capture what is running.
  • Use waits, locks, grants, and TempDB usage DMVs to explain why it is slow.
  • Use Query Store when you need history beyond what plan cache can retain.

2. How it works internally

DMVs read engine state kept in memory. Some reset when SQL Server restarts. Some reset when a database is closed, an object is rebuilt, a plan leaves cache, or Query Store is cleaned.

That volatility matters during incidents. A restart can make dashboards look clean while the root cause remains. A plan eviction can erase the only cached plan showing the bad estimate. Capture before remediation whenever the situation allows.

Important DMV families:

NeedDMV family
Active sessions and requestssys.dm_exec_sessions, sys.dm_exec_requests
SQL text and planssys.dm_exec_sql_text, sys.dm_exec_query_plan
Cached query historysys.dm_exec_query_stats
Waitssys.dm_os_wait_stats, sys.dm_exec_session_wait_stats
Locks and transactionssys.dm_tran_locks, sys.dm_tran_active_transactions
Missing indexessys.dm_db_missing_index_*
Index usagesys.dm_db_index_usage_stats
TempDB usagesys.dm_db_task_space_usage, sys.dm_db_session_space_usage
File latencysys.dm_io_virtual_file_stats

3. Real-world problem scenario

A production web API is timing out. The application team says "SQL Server is down." The server is online. DMVs show 120 sessions waiting behind one open transaction from a reporting tool. The lead blocker ran an update and never committed. Without DMVs, the incident looks like general slowness. With DMVs, it is one blocking session.

4. How to diagnose

Fast triage query:

Before running deeper analysis, answer four questions:

  • Who is active right now?
  • Who is waiting, and on what?
  • Who is blocking whom?
  • Which statement is associated with the largest CPU, elapsed time, reads, writes, or memory grant?

🔍 Diagnostic query

SELECT
s.session_id,
s.login_name,
s.host_name,
s.program_name,
r.status,
r.command,
r.blocking_session_id,
r.wait_type,
r.wait_time AS wait_ms,
r.cpu_time AS cpu_ms,
r.total_elapsed_time AS elapsed_ms,
r.logical_reads,
r.writes,
DB_NAME(COALESCE(r.database_id, s.database_id)) AS database_name,
t.text AS sql_text
FROM sys.dm_exec_sessions AS s
LEFT JOIN sys.dm_exec_requests AS r
ON s.session_id = r.session_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE s.is_user_process = 1
ORDER BY r.total_elapsed_time DESC, s.session_id;

Find top statements in cache:

🔍 Diagnostic query

SELECT TOP (20)
qs.execution_count,
qs.total_worker_time / 1000 AS total_cpu_ms,
qs.total_elapsed_time / 1000 AS total_duration_ms,
qs.total_logical_reads,
qs.total_logical_writes,
qs.max_grant_kb,
qs.max_used_grant_kb,
SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset) / 2) + 1) AS statement_text
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY qs.total_worker_time DESC;

5. Fix / optimisation

DMVs tell you where to act:

  • Kill or correct a lead blocker only after confirming rollback impact.
  • Add an index only after validating the query pattern and write overhead.
  • Force a plan only after identifying a good historical plan.
  • Adjust server settings only after proving the setting is the bottleneck.
  • Persist DMV snapshots if the problem is intermittent.

Do not treat DMV output as a final diagnosis without context. A missing index suggestion can be valid for one query and harmful for the write workload. A high wait count can be normal if the wait time is tiny. A sleeping session is harmless unless it holds locks or an open transaction.

6. Example scripts

Create a lightweight incident snapshot table:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.DmvIncidentSnapshot;
GO

CREATE TABLE dbo.DmvIncidentSnapshot
(
SnapshotTime datetime2(0) NOT NULL DEFAULT SYSDATETIME(),
SessionId smallint NOT NULL,
BlockingSessionId smallint NULL,
WaitType nvarchar(60) NULL,
CpuMs int NULL,
ElapsedMs int NULL,
LogicalReads bigint NULL,
SqlText nvarchar(max) NULL
);
GO

INSERT dbo.DmvIncidentSnapshot
(
SessionId,
BlockingSessionId,
WaitType,
CpuMs,
ElapsedMs,
LogicalReads,
SqlText
)
SELECT
r.session_id,
NULLIF(r.blocking_session_id, 0),
r.wait_type,
r.cpu_time,
r.total_elapsed_time,
r.logical_reads,
t.text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID;

SELECT *
FROM dbo.DmvIncidentSnapshot
ORDER BY SnapshotTime DESC, ElapsedMs DESC;

Diagnose index usage:

🔍 Diagnostic query

SELECT
OBJECT_SCHEMA_NAME(i.object_id) AS schema_name,
OBJECT_NAME(i.object_id) AS table_name,
i.name AS index_name,
i.type_desc,
us.user_seeks,
us.user_scans,
us.user_lookups,
us.user_updates
FROM sys.indexes AS i
LEFT JOIN sys.dm_db_index_usage_stats AS us
ON i.object_id = us.object_id
AND i.index_id = us.index_id
AND us.database_id = DB_ID()
WHERE i.object_id > 100
ORDER BY us.user_updates DESC, us.user_seeks ASC;

Convert DMV evidence into a candidate index review:

🚀 Optimization query

SELECT
'Candidate index review' AS action_type,
mid.statement AS table_name,
mid.equality_columns,
mid.inequality_columns,
mid.included_columns,
migs.user_seeks,
migs.avg_user_impact
FROM sys.dm_db_missing_index_group_stats AS migs
JOIN sys.dm_db_missing_index_groups AS mig
ON migs.group_handle = mig.index_group_handle
JOIN sys.dm_db_missing_index_details AS mid
ON mig.index_handle = mid.index_handle
WHERE mid.database_id = DB_ID()
ORDER BY migs.user_seeks DESC;

7. Key insights (expert notes)

  • DMVs are volatile. Capture evidence during the incident.
  • sys.dm_exec_query_stats only shows plans still in cache. Use Query Store for durable history.
  • Missing index DMVs do not suggest filtered indexes or index consolidation.
  • Index usage stats reset after restart and can mislead after maintenance windows.
  • Always join DMVs back to database, object, and query context.

🧠 Locking and blocking

Use this section to separate normal locking from harmful blocking, identify the head blocker, and choose the safest remediation.


1. Concept (simple explanation)

Locking protects data correctness. Blocking happens when one session holds a lock that another session needs. Blocking is not automatically bad. Long blocking chains are bad because they turn normal concurrency into application timeouts.

Deadlocks are different. A deadlock is a cycle where two or more sessions wait on each other, so SQL Server chooses a victim and rolls it back.

2. How it works internally

SQL Server lock manager tracks locks on resources such as rows, keys, pages, tables, allocation units, metadata, and databases.

Common lock modes:

Lock modeMeaningCompatible with shared reads?
SShared read lockYes, with other S locks
UUpdate lock, used before converting to XLimited
XExclusive write lockNo
IS/IXIntent locks at higher levelsDepends on target
Sch-SSchema stabilityUsually yes
Sch-MSchema modificationNo

Blocking chain example:

Blocking chain anatomy
54
Head blocker
Open transaction updated Orders
Holds the X lock
May be active or sleeping
ROW
Locked resource
Orders row or key range
The shared resource all sessions need
Wait resource points you here
61
Blocked reader
SELECT Orders waits on LCK_M_S
Victim of the blocker
Fixing the reader does not clear the chain
72
Blocked writer
UPDATE Orders waits on LCK_M_X
May extend the queue
Check transaction age before KILL

Isolation level matters:

  • Read committed uses shared locks for reads by default.
  • Read committed snapshot isolation (RCSI) uses row versions for reads, reducing reader/writer blocking.
  • Snapshot isolation uses transaction-level row versioning.
  • Serializable protects ranges and can increase blocking.

Lock escalation is another failure pattern to watch. SQL Server may escalate many row or key locks to a table lock when lock memory pressure or lock count thresholds are reached. The query that was supposed to touch a small set can suddenly block the whole table if estimates are wrong, indexes are missing, or the transaction modifies too many rows.

3. Real-world problem scenario

A nightly ETL job updates 3 million customer rows in one transaction. During the update, the customer portal times out because read queries wait on LCK_M_S. The ETL is technically correct, but its transaction scope is too large for the OLTP workload.

4. How to diagnose

  • Identify the lead blocker.
  • Find what SQL it is running or what transaction it left open.
  • Check lock resources and wait types.
  • Determine whether killing it will cause a large rollback.
  • For deadlocks, capture the deadlock graph from Extended Events system_health.

Prioritize blocking like this:

  1. Find the head blocker, not just the session with the longest wait.
  2. Capture login, host, program, SQL text, transaction age, and row count if available.
  3. Decide whether the blocker is making progress, idle in transaction, or stuck behind another resource.
  4. Estimate rollback risk before issuing KILL.
  5. After service is restored, fix the access pattern so the same blocker does not return tomorrow.

Blocking chain query:

🔍 Diagnostic query

WITH blocking AS
(
SELECT
r.session_id,
r.blocking_session_id,
r.wait_type,
r.wait_time,
r.wait_resource,
r.database_id,
r.sql_handle
FROM sys.dm_exec_requests AS r
WHERE r.blocking_session_id <> 0
)
SELECT
b.session_id AS blocked_session_id,
b.blocking_session_id,
b.wait_type,
b.wait_time AS wait_ms,
b.wait_resource,
DB_NAME(b.database_id) AS database_name,
blocked_text.text AS blocked_sql,
blocker_text.text AS blocker_sql
FROM blocking AS b
OUTER APPLY sys.dm_exec_sql_text(b.sql_handle) AS blocked_text
LEFT JOIN sys.dm_exec_requests AS br
ON b.blocking_session_id = br.session_id
OUTER APPLY sys.dm_exec_sql_text(br.sql_handle) AS blocker_text
ORDER BY b.wait_time DESC;

Open transaction details:

🔍 Diagnostic query

SELECT
s.session_id,
s.login_name,
s.host_name,
s.program_name,
at.transaction_begin_time,
DATEDIFF(minute, at.transaction_begin_time, SYSDATETIME()) AS transaction_age_minutes,
at.transaction_type,
at.transaction_state
FROM sys.dm_tran_session_transactions AS st
JOIN sys.dm_tran_active_transactions AS at
ON st.transaction_id = at.transaction_id
JOIN sys.dm_exec_sessions AS s
ON st.session_id = s.session_id
WHERE s.is_user_process = 1
ORDER BY at.transaction_begin_time;

5. Fix / optimisation

  • Add indexes so updates and reads touch fewer rows.
  • Keep transactions short and predictable.
  • Batch large modifications.
  • Use RCSI when read/write blocking is common and the application can accept statement-level versioned reads.
  • Use consistent object access order to reduce deadlock risk.
  • Avoid user interaction while a transaction is open.

Do not use NOLOCK as the default fix. It can return dirty, duplicated, or missing rows and can still be blocked by schema modification locks. If readers should not block writers, evaluate RCSI with TempDB capacity and application correctness in mind.

Enable RCSI only after testing TempDB version store impact:

🚀 Optimization query

ALTER DATABASE YourDatabaseName
SET READ_COMMITTED_SNAPSHOT ON
WITH ROLLBACK IMMEDIATE;

6. Example scripts

Session 1: reproduce blocking:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.BlockingLab;
CREATE TABLE dbo.BlockingLab
(
Id int NOT NULL CONSTRAINT PK_BlockingLab PRIMARY KEY,
StatusCode char(1) NOT NULL,
ModifiedAt datetime2(0) NOT NULL
);

INSERT dbo.BlockingLab(Id, StatusCode, ModifiedAt)
VALUES (1, 'N', SYSDATETIME()), (2, 'N', SYSDATETIME());
GO

BEGIN TRAN;
UPDATE dbo.BlockingLab
SET StatusCode = 'P', ModifiedAt = SYSDATETIME()
WHERE Id = 1;

-- Leave this transaction open while Session 2 runs.

Session 2: blocked read:

🛠 Setup script

USE tempdb;
GO

SELECT *
FROM dbo.BlockingLab
WHERE Id = 1;

Diagnose from Session 3:

🔍 Diagnostic query

SELECT
r.session_id,
r.blocking_session_id,
r.wait_type,
r.wait_time,
r.wait_resource,
t.text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.blocking_session_id <> 0;

Fix from Session 1:

🚀 Optimization query

COMMIT TRAN;
-- Or ROLLBACK TRAN if this was a test or bad transaction.

Batch update pattern:

🚀 Optimization query

WHILE 1 = 1
BEGIN
UPDATE TOP (5000) dbo.BlockingLab
SET StatusCode = 'A', ModifiedAt = SYSDATETIME()
WHERE StatusCode = 'N';

IF @@ROWCOUNT = 0 BREAK;
END;

7. Key insights (expert notes)

  • The blocker may be sleeping. A sleeping session can still hold an open transaction.
  • Do not kill first and diagnose later. Capture SQL text, login, host, transaction age, and rollback risk.
  • NOLOCK can read uncommitted data, duplicate rows, miss rows, and fail with allocation errors. It is not a blocking strategy.
  • RCSI reduces reader/writer blocking but moves work to TempDB version store.
  • Deadlock priority can protect critical sessions, but it does not fix the access pattern.

🧠 TempDB

Use this section to identify whether TempDB pressure comes from user objects, internal spills, version store, allocation contention, or storage latency.


1. Concept (simple explanation)

TempDB is SQL Server's shared scratch database. It stores temporary tables, table variables, worktables, sort/hash spill data, online index build work, row versions, triggers, and internal objects.

When TempDB is slow or contended, unrelated workloads can all become slow because they share the same workspace.

2. How it works internally

TempDB is recreated every time SQL Server starts. It uses simple recovery and minimal logging where possible, but it still writes to disk.

Main consumers:

ConsumerExample
User objects#temp tables and table variables
Internal objectsSorts, hashes, spools, cursors
Version storeRCSI, snapshot isolation, online index operations
MetadataTemp object creation and drop tracking

Allocation contention often appears on pages such as PFS, GAM, and SGAM. Modern SQL Server versions include improvements, but file layout and workload pattern still matter.

TempDB incidents usually fall into three categories:

  • Space exhaustion: TempDB fills because a query spills, a temp table grows, or version store cleanup is blocked.
  • Allocation contention: many sessions create/drop temp objects or allocate pages at the same time, showing PAGELATCH_* waits.
  • I/O pressure: sorts, hashes, spools, and version store reads/writes overload TempDB storage, showing PAGEIOLATCH_* or slow file latency.

The priority is to identify the consumer before adding space. Extra space can restore service, but it does not tell you whether the cause is one report, RCSI version retention, or systemic spill pressure.

TempDB pressure map
SORT
Sort and hash spills
Internal objects consume TempDB
Usually caused by bad estimates or missing indexes
Check memory grants and spill warnings
#TMP
#temp tables
User objects allocate TempDB pages
Wide temp tables and missing temp indexes add pressure
Check session and task space usage
VER
Row versions
RCSI, snapshot isolation, and online operations use version store
Long readers can prevent cleanup
Check active snapshot transactions
DATA
TempDB data files
Shared landing zone for all consumers
Files must be pre-sized and evenly sized
Autogrowth is a safety net, not capacity planning
LATCH
Allocation contention
PAGELATCH_* on allocation pages
Often caused by high temp object churn
Use equal files and reduce churn
I/O
Read/write latency
PAGEIOLATCH_* or slow file stats
Usually visible after spills or version-store growth
Tune the consumer before adding storage

3. Real-world problem scenario

An e-commerce database enables RCSI to reduce blocking. Blocking improves, but two days later TempDB grows from 20 GB to 300 GB. A long-running reporting transaction keeps old row versions alive, so cleanup cannot remove them. The symptom is TempDB full; the root cause is a long snapshot/RCSI reader.

4. How to diagnose

  • Check TempDB file sizes and free space.
  • Separate user object, internal object, and version store usage.
  • Identify sessions consuming TempDB.
  • Check waits: PAGELATCH_* suggests allocation contention; PAGEIOLATCH_* suggests physical I/O.
  • Check active snapshot transactions when version store grows.

Check in this order:

  1. Is TempDB full, slow, or contended?
  2. Is usage dominated by user objects, internal objects, or version store?
  3. Which session or request is allocating the most pages?
  4. Are long snapshot/RCSI transactions preventing version cleanup?
  5. Are files equally sized and growing predictably?

TempDB file usage:

🔍 Diagnostic query

USE tempdb;
GO

SELECT
name,
type_desc,
size / 128.0 AS size_mb,
FILEPROPERTY(name, 'SpaceUsed') / 128.0 AS used_mb,
(size - FILEPROPERTY(name, 'SpaceUsed')) / 128.0 AS free_mb
FROM sys.database_files;

TempDB space by session:

🔍 Diagnostic query

SELECT TOP (25)
s.session_id,
s.login_name,
s.host_name,
s.program_name,
(su.user_objects_alloc_page_count - su.user_objects_dealloc_page_count) * 8 / 1024 AS user_objects_mb,
(su.internal_objects_alloc_page_count - su.internal_objects_dealloc_page_count) * 8 / 1024 AS internal_objects_mb,
r.status,
r.wait_type,
t.text AS sql_text
FROM sys.dm_db_session_space_usage AS su
JOIN sys.dm_exec_sessions AS s
ON su.session_id = s.session_id
LEFT JOIN sys.dm_exec_requests AS r
ON su.session_id = r.session_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE s.is_user_process = 1
ORDER BY (su.user_objects_alloc_page_count + su.internal_objects_alloc_page_count) DESC;

Version store usage:

🔍 Diagnostic query

SELECT
SUM(version_store_reserved_page_count) * 8 / 1024 AS version_store_mb,
SUM(user_object_reserved_page_count) * 8 / 1024 AS user_objects_mb,
SUM(internal_object_reserved_page_count) * 8 / 1024 AS internal_objects_mb,
SUM(unallocated_extent_page_count) * 8 / 1024 AS free_space_mb
FROM sys.dm_db_file_space_usage;

Active snapshot transactions:

🔍 Diagnostic query

SELECT
transaction_id,
session_id,
elapsed_time_seconds,
is_snapshot,
first_snapshot_sequence_num,
max_version_chain_traversed
FROM sys.dm_tran_active_snapshot_database_transactions
ORDER BY elapsed_time_seconds DESC;

5. Fix / optimisation

  • User objects: reduce temp table width, index temp tables correctly, drop when done, avoid unnecessary temp materialization.
  • Internal objects: tune queries that sort, hash, spool, or spill; improve estimates and indexes.
  • Version store: find long-running snapshot/RCSI readers; shorten transactions; schedule reporting safely.
  • Allocation contention: use multiple equally sized TempDB data files and modern SQL Server defaults.
  • I/O pressure: move TempDB to faster storage and pre-size files to avoid autogrowth storms.

Do not assume the fix is always "add more TempDB files." More equal-sized files help allocation contention, but they will not fix a spilling hash join, an unbounded temp table, or a long transaction retaining row versions.

6. Example scripts

Reproduce TempDB internal object pressure with a sort:

🛠 Setup script

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.TempDbSortLab;
GO

CREATE TABLE dbo.TempDbSortLab
(
Id int IDENTITY(1,1) NOT NULL,
GroupId int NOT NULL,
Padding char(200) NOT NULL DEFAULT 'x'
);

WITH n AS
(
SELECT TOP (300000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
FROM sys.all_objects AS a
CROSS JOIN sys.all_objects AS b
)
INSERT dbo.TempDbSortLab(GroupId)
SELECT rn % 5000
FROM n;
GO

-- This sort can request memory and may spill depending on the lab machine.
SELECT GroupId, COUNT(*) AS row_count
FROM dbo.TempDbSortLab
GROUP BY GroupId
ORDER BY COUNT(*) DESC;

Diagnose task-level TempDB usage while a query runs:

🔍 Diagnostic query

SELECT
session_id,
request_id,
(internal_objects_alloc_page_count - internal_objects_dealloc_page_count) * 8 / 1024 AS internal_mb,
(user_objects_alloc_page_count - user_objects_dealloc_page_count) * 8 / 1024 AS user_mb
FROM sys.dm_db_task_space_usage
WHERE session_id <> @@SPID
ORDER BY internal_mb DESC;

Fix with an index that supports grouping:

🚀 Optimization query

CREATE INDEX IX_TempDbSortLab_GroupId ON dbo.TempDbSortLab(GroupId);
GO

SELECT GroupId, COUNT(*) AS row_count
FROM dbo.TempDbSortLab
GROUP BY GroupId
ORDER BY COUNT(*) DESC;

Check TempDB file layout:

✅ Validation query

SELECT
name,
physical_name,
size / 128.0 AS size_mb,
growth,
is_percent_growth
FROM tempdb.sys.database_files
WHERE type_desc = 'ROWS'
ORDER BY file_id;

7. Key insights (expert notes)

  • TempDB is shared. One bad report can hurt the whole instance.
  • Spills are query tuning problems that show up as TempDB problems.
  • RCSI is useful, but long readers can retain row versions and grow TempDB.
  • Equal file size matters. Unequal files can cause uneven allocation.
  • Autogrowth is a safety net, not a sizing strategy.

🧠 Troubleshooting playbook

Use this playbook when you need ordered, incident-ready checks for CPU pressure, blocking, or slow-query triage.


High CPU issue steps

  1. Confirm CPU pressure at SQL Server level, not only VM/OS level.
    • Check whether SQL Server process CPU is high and whether signal waits or runnable tasks show scheduler pressure.
    • If CPU is normal but elapsed time is high, look for waits instead of tuning CPU.
  2. Check active CPU consumers:

🔍 Diagnostic query

SELECT TOP (20)
r.session_id,
r.cpu_time AS cpu_ms,
r.total_elapsed_time AS elapsed_ms,
r.logical_reads,
r.status,
r.command,
t.text AS sql_text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID
ORDER BY r.cpu_time DESC;
  1. Check cached top CPU queries and compare with Query Store history.
  2. Inspect actual plans for scans, bad joins, scalar UDFs, implicit conversions, and excessive parallelism.
  3. Check waits. High signal wait or many runnable tasks means CPU scheduling pressure.
  4. Mitigate: force last good plan, stop runaway query, reduce DOP for the query, or scale temporarily.
  5. Fix root cause: indexes, query rewrite, statistics, parameter sensitivity, or app call pattern.

Do not clear the plan cache as the first response to high CPU. If the issue is parameter sniffing, a cache clear may temporarily hide it and then let the same bad plan recompile under a different parameter. Use Query Store to identify and force a known-good plan when appropriate.

Blocking troubleshooting steps

  1. Identify the lead blocker and transaction age.
  2. Capture blocker SQL text, login, host, program, wait resource, and open transaction details.
  3. Decide whether to wait, ask the owner to commit/rollback, or kill the session.
  4. If killing, estimate rollback impact. Large updates can take time to undo.
  5. After service restoration, fix the pattern: shorter transactions, better indexes, batching, RCSI, or consistent access order.

Treat blocking as a chain, not a list. The session with the most visible wait is often a victim, not the cause. The lead blocker may be sleeping, may have no active request, and may still hold locks through an open transaction.

Quick blocker query:

🔍 Diagnostic query

SELECT
blocked.session_id AS blocked_session_id,
blocked.blocking_session_id,
blocked.wait_type,
blocked.wait_time AS wait_ms,
blocked.wait_resource,
blocker_s.login_name AS blocker_login,
blocker_s.host_name AS blocker_host,
blocker_s.program_name AS blocker_program,
blocker_text.text AS blocker_sql
FROM sys.dm_exec_requests AS blocked
LEFT JOIN sys.dm_exec_sessions AS blocker_s
ON blocked.blocking_session_id = blocker_s.session_id
LEFT JOIN sys.dm_exec_requests AS blocker_r
ON blocked.blocking_session_id = blocker_r.session_id
OUTER APPLY sys.dm_exec_sql_text(blocker_r.sql_handle) AS blocker_text
WHERE blocked.blocking_session_id <> 0
ORDER BY blocked.wait_time DESC;

Slow query checklist

  • Is the query slow now, or was it slow historically?
  • Did the plan change? Check Query Store.
  • Are estimated and actual rows far apart?
  • Are predicates SARGable?
  • Are parameter data types aligned with column data types?
  • Is there a key lookup repeated many times?
  • Are there spills to TempDB?
  • Is the query waiting on locks, I/O, memory grants, or CPU?
  • Are statistics stale or sampled poorly?
  • Does the query return more columns or rows than the application needs?
  • Would a filtered, covering, or composite index match the real access pattern?
  • Is this a parameter-sensitive procedure?

When prioritizing the checklist, start with evidence that changes the remediation path: plan changed, row estimates wrong, blocking present, or spill present. Then inspect index design and query shape. Review cosmetic query cleanup later.

Do not add every missing-index recommendation during an incident. Missing-index DMVs do not understand your full index portfolio, write overhead, filtered-index opportunities, or duplicate indexes. Convert suggestions into an index review after the service is stable.

Slow query starter script:

🔍 Diagnostic query

DECLARE @session_id smallint = 0; -- Replace with target session_id.

SELECT
r.session_id,
r.status,
r.cpu_time AS cpu_ms,
r.total_elapsed_time AS elapsed_ms,
r.logical_reads,
r.reads,
r.writes,
r.wait_type,
r.wait_time AS wait_ms,
r.blocking_session_id,
qp.query_plan,
t.text AS sql_text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
OUTER APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
WHERE r.session_id = @session_id;

🧠 References

Use these Microsoft Learn pages to verify syntax, supported behavior, and engine details before applying changes in production.


🧠 Flashcards

Use these recall prompts to test the incident workflow, not just memorize DMV names.


What is the first execution-plan comparison to make during a slow-query incident?
Click to reveal answer
Compare estimated rows to actual rows. Large gaps explain bad join choices, wrong access methods, memory grant problems, and spills.
1 / 6