SQL Server expert troubleshooting guide
This page is a field guide for engineers who troubleshoot SQL Server under pressure. It goes below the surface: execution plans, optimizer behavior, SQLOS, DMVs, blocking chains, TempDB allocation, and the scripts you can run in a lab before you need them in production.
Use the scripts in a non-production database such as SqlTroubleshootingLab. Several examples intentionally create slow queries, blocking, or TempDB pressure.
🧠 Architecture map
This guide follows the same path a request follows inside the engine: client call, compilation, execution, waits, storage, TempDB, and diagnostic evidence. Use the diagram as a triage map during incidents rather than treating each topic as isolated theory.
Under Sev-A pressure, keep the mental flow simple:
- Client -> compile: confirm whether the workload is sending stable, parameterized calls or ad-hoc text with mismatched data types.
- Compile -> execute: check whether the selected plan matches the data shape, parameter value, statistics, and available indexes.
- Execute -> waits: classify whether the request is burning CPU or waiting on locks, I/O, memory grants, workers, TempDB, or the client.
- Waits -> resources: map the wait to the resource before changing anything.
LCK_M_*means concurrency,PAGEIOLATCH_*means physical reads,RESOURCE_SEMAPHOREmeans memory grant pressure, andPAGELATCH_*on TempDB allocation pages means latch contention. - Resources -> diagnostics: capture DMV and Query Store evidence before cache clears, restarts, kills, or configuration changes remove the trail.
| Component | When things go wrong | Check first | Common root causes | Quick remediation |
|---|---|---|---|---|
| Parser / algebrizer | Compile errors, sudden failures after deployment, permissions errors | Error text, deployed SQL, object names, data types | Renamed objects, invalid schema binding, implicit conversion from client parameters | Fix code or parameter type; avoid emergency server changes |
| Optimizer | High CPU, scans, wrong join type, unstable duration | Actual plan, Query Store, statistics date, estimated vs actual rows | Parameter sniffing, stale stats, skewed data, non-SARGable predicates | Force last good plan if needed; update stats or rewrite predicate after evidence |
| Execution plan | Memory spills, excessive DOP, repeated lookups | Plan warnings, memory grant, DOP, key lookup count | Bad cardinality, missing covering index, row goal, implicit conversion | Add targeted index, fix estimate, tune query-level DOP or recompile strategy |
| SQLOS | High signal waits, runnable queue, THREADPOOL | sys.dm_os_schedulers, sys.dm_os_wait_stats, active workers | CPU saturation, worker starvation, connection storm, blocking fan-out | Stop runaway work, reduce concurrency, fix pooling, scale only as mitigation |
| Buffer pool / storage | PAGEIOLATCH_*, high physical reads, file latency | sys.dm_io_virtual_file_stats, logical reads, plan access methods | Scans, cold cache, undersized memory, slow data files | Reduce reads first; then tune storage or memory |
| Lock manager | Timeouts, blocked sessions, deadlocks | sys.dm_exec_requests, sys.dm_tran_locks, deadlock graph | Long transactions, missing indexes, lock escalation, isolation mismatch | Find lead blocker; shorten transactions; batch writes; consider RCSI after testing |
| TempDB | Spills, TempDB full, PAGELATCH_*, version store growth | sys.dm_db_file_space_usage, session/task space usage, snapshot transactions | Sort/hash spills, long RCSI readers, too few or uneven files | Tune spilling query; end long reader; pre-size equal files; fix allocation contention |
| DMVs / Query Store / XEvents | Missing history, unclear root cause, competing narratives | Live DMVs for now, Query Store for history, XEvents for deadlocks | Evidence lost after restart, plan eviction, incomplete snapshot | Capture first, then act; persist snapshots for intermittent incidents |
During a live incident, do not start with DBCC FREEPROCCACHE, SQL Server restart, blanket index creation, or NOLOCK changes. Those actions can remove evidence, cause wider regressions, or trade correctness for speed. Capture evidence first, restore service second, and perform root-cause tuning after the blast radius is contained.
Oracle parallel: Treat SQL Server Query Store plus DMVs as the closest day-to-day equivalent of AWR, ASH, V$SQL, and V$SESSION_WAIT. The mapping works for workload history and live diagnosis, but SQL Server exposes different wait names, plan operators, and locking semantics.
🧠 Execution plans
Use this section to read plans as troubleshooting evidence: estimates, operators, warnings, memory grants, and plan regressions.
1. Concept (simple explanation)
An execution plan is SQL Server's recipe for running a query. It shows which tables SQL Server reads, which indexes it uses, how rows are joined, where data is sorted, and where memory is needed.
Read plans as evidence, not artwork. The important question is not "does the plan look complex?" The important question is: "Where did SQL Server estimate wrong, read too much, spill to disk, or choose an operator that does not match the data?"
Common operators:
| Operator | Simple meaning | Performance signal |
|---|---|---|
| Index seek | Navigate directly to a range in a B-tree | Usually good, but can still read many rows |
| Index/table scan | Read a full structure | Fine for large ranges; bad for selective predicates |
| Key lookup | Fetch missing columns from clustered index | Expensive when repeated many times |
| Hash match | Build/probe hash table for joins or aggregates | Good for large sets; can spill if memory grant is low |
| Nested loops | For each outer row, search inner input | Good for small outer input; terrible when estimates are low |
| Sort | Order rows | Needs memory; spills to TempDB if grant is too small |
2. How it works internally
SQL Server turns a query into a plan through four main stages:
Internally:
- The parser checks syntax and builds a tree.
- The algebrizer resolves tables, columns, data types, ownership chains, and permissions.
- The optimizer uses statistics histograms, density vectors, constraints, indexes, and row goals to estimate row counts.
- The optimizer evaluates physical choices: join order, join type, access method, serial vs parallel plan, and memory grants.
- The execution engine runs the selected plan using the iterator model. Each operator asks its child operator for rows.
- The plan may be cached so the next execution avoids compile CPU.
The optimizer is cost-based, but the cost is only as good as the estimates. Bad statistics, implicit conversions, parameter sniffing, table variables, skewed data, and non-SARGable predicates can all push the optimizer toward the wrong plan.
When things go wrong, the usual chain is: bad cardinality -> bad join/access choice -> wrong memory grant or DOP -> high CPU, spills, or excessive reads. For example, a predicate estimated at 10 rows but returning 10 million rows can choose nested loops plus key lookups, burn CPU, and overload storage even though every individual operator looks valid.
Prioritize plan investigation this way:
- First: compare estimated rows vs actual rows at the first large divergence.
- Next: inspect warnings for spills, implicit conversions, missing join predicates, and excessive grants.
- Then: decide whether the problem is compile-time metadata, runtime parameter sensitivity, or missing physical access path.
- Later: consider hints. Hints can stabilize service, but they also freeze assumptions.
3. Real-world problem scenario
Contoso has an order lookup procedure that usually returns one customer. After a weekend data load, the API latency jumps from 80 ms to 8 seconds. CPU rises, reads spike, and the plan shows an index scan with a hash join. The root cause is an implicit conversion: the application sends nvarchar while the indexed column is varchar, so SQL Server converts the column and cannot seek efficiently.
4. How to diagnose
- Capture the actual execution plan, not only the estimated plan.
- Compare estimated vs actual rows on each operator.
- Look for warnings: spills, missing indexes, implicit conversions, residual predicates, and excessive memory grants.
- Check whether the plan is new by using Query Store.
- Check whether the query is parameter-sensitive: fast for one parameter, slow for another.
Use Query Store early when the question is "what changed?". It tells you whether the query regressed because the plan changed, the same plan slowed down, or the runtime volume changed. Use live DMVs when the question is "what is hurting us right now?".
Do not assume a scan is bad or a seek is good. A seek with millions of key lookups can be worse than a controlled scan. The incident signal is total work: CPU, reads, elapsed time, waits, and memory spills.
Diagnostic query for cached plans with implicit conversions:
🔍 Diagnostic query
SELECT TOP (25)
DB_NAME(st.dbid) AS database_name,
qs.execution_count,
qs.total_worker_time / 1000 AS total_cpu_ms,
qs.total_logical_reads,
SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset) / 2) + 1) AS statement_text,
qp.query_plan
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) AS qp
WHERE CAST(qp.query_plan AS nvarchar(max)) LIKE N'%CONVERT_IMPLICIT%'
ORDER BY qs.total_worker_time DESC;
Query Store plan regression check:
🔍 Diagnostic query
SELECT TOP (20)
qsq.query_id,
qsp.plan_id,
rs.count_executions,
rs.avg_duration / 1000.0 AS avg_duration_ms,
rs.avg_cpu_time / 1000.0 AS avg_cpu_ms,
rs.avg_logical_io_reads,
qt.query_sql_text
FROM sys.query_store_runtime_stats AS rs
JOIN sys.query_store_plan AS qsp
ON rs.plan_id = qsp.plan_id
JOIN sys.query_store_query AS qsq
ON qsp.query_id = qsq.query_id
JOIN sys.query_store_query_text AS qt
ON qsq.query_text_id = qt.query_text_id
ORDER BY rs.avg_duration DESC;
5. Fix / optimisation
- If estimates are wrong, update statistics or improve predicate shape.
- If a key lookup runs thousands of times, add included columns or rewrite the query.
- If an implicit conversion blocks seeks, align parameter and column data types.
- If parameter sniffing causes plan instability, use Query Store plan forcing,
OPTION (RECOMPILE),OPTIMIZE FOR, or SQL Server 2022 Parameter Sensitive Plan optimization. - If a sort or hash spills, reduce rows, add an index that provides order, or investigate memory grants.
The fix works because it gives the optimizer better metadata or a cheaper physical access path.
6. Example scripts
Reproduce a non-SARGable predicate:
🛠 Setup script
USE tempdb;
GO
DROP TABLE IF EXISTS dbo.PlanLabOrders;
GO
CREATE TABLE dbo.PlanLabOrders
(
OrderId int IDENTITY(1,1) NOT NULL CONSTRAINT PK_PlanLabOrders PRIMARY KEY,
CustomerCode varchar(20) NOT NULL,
OrderDate date NOT NULL,
Amount money NOT NULL
);
CREATE INDEX IX_PlanLabOrders_OrderDate ON dbo.PlanLabOrders(OrderDate);
WITH n AS
(
SELECT TOP (200000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
FROM sys.all_objects AS a
CROSS JOIN sys.all_objects AS b
)
INSERT dbo.PlanLabOrders(CustomerCode, OrderDate, Amount)
SELECT
CONCAT('C', RIGHT(CONCAT('00000', rn % 5000), 5)),
DATEADD(day, -(rn % 1500), CONVERT(date, SYSDATETIME())),
(rn % 1000) + 1
FROM n;
GO
SET STATISTICS IO, TIME ON;
-- Bad: function on indexed column prevents a direct range seek.
SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE YEAR(OrderDate) = YEAR(SYSDATETIME());
-- Better: range predicate is SARGable.
DECLARE @start date = DATEFROMPARTS(YEAR(SYSDATETIME()), 1, 1);
DECLARE @end date = DATEADD(year, 1, @start);
SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE OrderDate >= @start
AND OrderDate < @end;
Diagnose row estimate error in the actual plan:
🔍 Diagnostic query
SET STATISTICS XML ON;
GO
SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE YEAR(OrderDate) = YEAR(SYSDATETIME());
GO
SET STATISTICS XML OFF;
Fix with a computed column when code cannot change immediately:
🚀 Optimization query
ALTER TABLE dbo.PlanLabOrders
ADD OrderYear AS YEAR(OrderDate) PERSISTED;
GO
CREATE INDEX IX_PlanLabOrders_OrderYear ON dbo.PlanLabOrders(OrderYear);
GO
SELECT COUNT(*) AS orders_this_year
FROM dbo.PlanLabOrders
WHERE OrderYear = YEAR(SYSDATETIME());
7. Key insights (expert notes)
- Estimated cost is not elapsed time. It is an optimizer model.
- A seek is not always good. A seek that reads 80 percent of a table can be worse than a scan.
- The most useful plan comparison is estimated rows vs actual rows.
- Missing index hints are per-query suggestions. Consolidate them before creating indexes.
- Warnings on operators often explain the incident faster than the visible operator names.
🧠 Performance tuning
Use this section to move from symptom to bottleneck before choosing an index, query rewrite, isolation change, or temporary mitigation.
1. Concept (simple explanation)
Performance tuning is the discipline of reducing wasted work. SQL Server runs faster when it reads fewer pages, compiles less often, waits less, uses memory correctly, and avoids unnecessary contention.
Tuning is not only indexing. It includes query shape, schema design, statistics, isolation levels, TempDB, plan stability, memory grants, parallelism, and application call patterns.
2. How it works internally
Most performance incidents land in one of four buckets:
Inside SQL Server, a request uses a worker on a scheduler. When it cannot continue, it records a wait. Good tuning starts by identifying where time is going:
- CPU time means the worker is executing.
- Wait time means the worker is stalled on something.
- Logical reads show buffer pool work.
- Physical reads show storage access.
- Writes and log flushes show transaction log pressure.
- Spills show TempDB used because memory was insufficient.
Production tuning is triage before surgery. During a Sev-A, separate restore service actions from permanent fix actions:
- Restore service: stop or throttle the runaway query, force a known-good plan, reduce concurrency, add temporary capacity, or kill a confirmed blocker after rollback risk is understood.
- Permanent fix: correct cardinality, index design, query shape, transaction scope, application call pattern, or isolation model.
Common failure patterns:
- Parameter sniffing: one cached plan is good for small customers and disastrous for large customers.
- Memory grant pressure: large sorts/hashes wait on
RESOURCE_SEMAPHOREor spill because estimates are wrong. - Parallelism skew: high DOP uses many workers but one branch does most of the work; watch
CXPACKET,CXCONSUMER, and actual row distribution. - Lock escalation: many row/key locks become a table lock, expanding the blocking impact.
- TempDB contention: the real query issue is a spill, but the visible symptom is TempDB I/O or latch pressure.
3. Real-world problem scenario
Fabrikam runs a reporting query every 15 minutes. It used to finish in 30 seconds. It now runs for 12 minutes and blocks OLTP writes. The query joins five tables, scans a 90 GB order table, spills a hash aggregate to TempDB, and runs with degree of parallelism 8. The issue is a combination of stale statistics, a missing covering index, and the report running under read committed against hot transactional rows.
4. How to diagnose
- Confirm whether the problem is current or historical.
- For current issues, check active requests, waits, blockers, CPU, reads, writes, and grants.
- For historical issues, use Query Store and wait stats.
- Compare before/after plans if the issue is a regression.
- Validate whether scaling masks the problem or fixes the real bottleneck.
Check in this order when the bridge is crowded:
- Is SQL Server actually the bottleneck, or is the app waiting on network/client processing?
- Which sessions are consuming CPU, reads, writes, memory grants, or blocking others right now?
- Is there one dominant query/session, or is the whole workload degraded?
- Did the plan or runtime profile change compared with the last known-good window?
- What is the safest mitigation that restores service without destroying evidence?
Current expensive requests:
🔍 Diagnostic query
SELECT
r.session_id,
r.status,
r.command,
r.cpu_time AS cpu_ms,
r.total_elapsed_time AS elapsed_ms,
r.logical_reads,
r.reads,
r.writes,
r.wait_type,
r.wait_time AS wait_ms,
r.blocking_session_id,
DB_NAME(r.database_id) AS database_name,
SUBSTRING(t.text, (r.statement_start_offset / 2) + 1,
((CASE r.statement_end_offset
WHEN -1 THEN DATALENGTH(t.text)
ELSE r.statement_end_offset
END - r.statement_start_offset) / 2) + 1) AS running_statement
FROM sys.dm_exec_requests AS r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID
ORDER BY r.cpu_time DESC;
Top cached queries by reads:
🔍 Diagnostic query
SELECT TOP (25)
qs.execution_count,
qs.total_logical_reads,
qs.total_logical_reads / NULLIF(qs.execution_count, 0) AS avg_logical_reads,
qs.total_worker_time / 1000 AS total_cpu_ms,
qs.total_elapsed_time / 1000 AS total_duration_ms,
DB_NAME(st.dbid) AS database_name,
SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset) / 2) + 1) AS statement_text
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY qs.total_logical_reads DESC;
Wait profile since last restart:
🔍 Diagnostic query
SELECT TOP (30)
wait_type,
waiting_tasks_count,
wait_time_ms,
signal_wait_time_ms,
wait_time_ms - signal_wait_time_ms AS resource_wait_ms,
100.0 * wait_time_ms / SUM(wait_time_ms) OVER () AS pct_of_waits
FROM sys.dm_os_wait_stats
WHERE wait_type NOT LIKE 'SLEEP%'
AND wait_type NOT IN
(
'BROKER_TASK_STOP', 'BROKER_TO_FLUSH', 'BROKER_EVENTHANDLER',
'SQLTRACE_BUFFER_FLUSH', 'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT',
'LAZYWRITER_SLEEP', 'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH',
'XE_TIMER_EVENT', 'XE_DISPATCHER_WAIT', 'LOGMGR_QUEUE'
)
ORDER BY wait_time_ms DESC;
5. Fix / optimisation
- High logical reads: add or redesign indexes, rewrite predicates, reduce selected columns.
- High CPU: fix bad plans, scalar UDFs, excessive parallelism, implicit conversions, and row-by-row logic.
- I/O waits: reduce reads first, then check storage latency and file layout.
- Blocking: shorten transactions, add useful indexes, change isolation to row versioning when appropriate.
- Spills: improve cardinality, indexes, memory grant feedback, or query shape.
- Plan regression: use Query Store to force a known-good plan while fixing root cause.
Do not tune from a single metric. High CPU with low reads points differently than high CPU with massive reads. WRITELOG points to commit/log throughput. ASYNC_NETWORK_IO often means the database is waiting for the client to consume rows, not that SQL Server needs more CPU.
6. Example scripts
Reproduce a key lookup problem:
🛠 Setup script
USE tempdb;
GO
DROP TABLE IF EXISTS dbo.TuningLabCustomerOrders;
GO
CREATE TABLE dbo.TuningLabCustomerOrders
(
OrderId int IDENTITY(1,1) NOT NULL CONSTRAINT PK_TuningLabCustomerOrders PRIMARY KEY,
CustomerId int NOT NULL,
OrderDate datetime2(0) NOT NULL,
StatusCode char(1) NOT NULL,
Amount money NOT NULL,
Notes varchar(200) NULL
);
CREATE INDEX IX_TuningLabCustomerOrders_CustomerId
ON dbo.TuningLabCustomerOrders(CustomerId);
WITH n AS
(
SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
FROM sys.all_objects AS a
CROSS JOIN sys.all_objects AS b
)
INSERT dbo.TuningLabCustomerOrders(CustomerId, OrderDate, StatusCode, Amount, Notes)
SELECT rn % 1000,
DATEADD(minute, -rn, SYSDATETIME()),
CASE WHEN rn % 10 = 0 THEN 'C' ELSE 'O' END,
rn % 2500,
REPLICATE('x', 50)
FROM n;
GO
SET STATISTICS IO, TIME ON;
SELECT OrderId, CustomerId, OrderDate, StatusCode, Amount
FROM dbo.TuningLabCustomerOrders
WHERE CustomerId = 42;
Diagnose missing coverage:
🔍 Diagnostic query
SELECT
migs.user_seeks,
migs.avg_total_user_cost,
migs.avg_user_impact,
mid.statement AS table_name,
mid.equality_columns,
mid.inequality_columns,
mid.included_columns
FROM sys.dm_db_missing_index_group_stats AS migs
JOIN sys.dm_db_missing_index_groups AS mig
ON migs.group_handle = mig.index_group_handle
JOIN sys.dm_db_missing_index_details AS mid
ON mig.index_handle = mid.index_handle
WHERE mid.database_id = DB_ID()
ORDER BY migs.avg_user_impact DESC;
Fix with a covering index:
🚀 Optimization query
CREATE INDEX IX_TuningLabCustomerOrders_CustomerId_Covering
ON dbo.TuningLabCustomerOrders(CustomerId)
INCLUDE (OrderDate, StatusCode, Amount);
GO
SELECT OrderId, CustomerId, OrderDate, StatusCode, Amount
FROM dbo.TuningLabCustomerOrders
WHERE CustomerId = 42;
7. Key insights (expert notes)
- Tune by bottleneck, not by habit.
SELECT *is a performance design decision. It widens indexes, memory grants, network usage, and lookups.- Query Store is often safer than clearing plan cache during an incident.
- Avoid changing five things at once. You will not know which fix worked.
- Scaling up can be a valid mitigation, but it should not be the only diagnosis.
🧠 SQL Server internals
Use this section to connect waits and resource symptoms back to SQLOS, memory, storage, logging, locking, and TempDB internals.
1. Concept (simple explanation)
SQL Server internals explain what the engine is doing when your query runs: scheduling workers, managing memory, reading 8 KB pages, writing log records, taking locks, creating row versions, and using TempDB for temporary work.
When you understand internals, wait stats become meaningful. PAGEIOLATCH_SH is not just a code. It means a worker is waiting for a data page to be read from storage into the buffer pool.
2. How it works internally
Core pieces:
| Component | What it does | What you monitor |
|---|---|---|
| SQLOS | Scheduling, workers, memory, waits | sys.dm_os_schedulers, wait stats |
| Buffer pool | Caches 8 KB data pages | PLE, physical reads, buffer descriptors |
| Plan cache | Stores compiled plans | cache size, single-use plans, compile CPU |
| Transaction log | Guarantees durability with write-ahead logging | log flush waits, VLFs, log reuse waits |
| Lock manager | Coordinates concurrent access | locks, blocking chains, deadlocks |
| TempDB | Shared workspace for temp objects and spills | file contention, version store, internal objects |
Incident lens for these components:
- SQLOS matters because every running request needs workers and schedulers. If runnable tasks pile up, adding indexes will not immediately clear CPU starvation.
- Buffer pool matters because logical reads are memory work and physical reads are storage work. A plan that reads 50 million pages can saturate either path.
- Plan cache matters because the wrong reused plan can make a good procedure fail intermittently.
- Transaction log matters because commits wait for log hardening. Slow log writes become application latency.
- Lock manager matters because one long transaction can make a healthy server look unavailable.
- TempDB matters because spills, versions, and temporary objects share a finite workspace across the whole instance.
Data access path:
Write-ahead logging means the log record must be hardened before the changed data page is written to the data file. That is why slow log storage can hurt write-heavy systems even when data files are fast.
3. Real-world problem scenario
Tailwind Traders sees periodic stalls during checkout. CPU is low, but application threads wait. SQL Server wait stats show WRITELOG, and the transaction log file has thousands of small virtual log files after years of tiny autogrowth increments. The actual issue is log throughput and poor log file growth history, not missing indexes.
4. How to diagnose
- Check wait stats first to identify the broad subsystem.
- Check scheduler pressure for runnable queue buildup.
- Check memory clerks for unusual memory consumers.
- Check database file latency.
- Check log reuse waits and VLF count for transaction log health.
When the system is unstable, take a quick baseline snapshot before changing anything: active requests, wait stats delta, file latency, scheduler runnable queue, memory grants, blockers, and top Query Store regressions. A five-minute evidence window is often enough to prevent a wrong fix.
Scheduler health:
🔍 Diagnostic query
SELECT
scheduler_id,
cpu_id,
status,
is_online,
current_tasks_count,
runnable_tasks_count,
current_workers_count,
active_workers_count,
work_queue_count
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255
ORDER BY scheduler_id;
File latency:
🔍 Diagnostic query
SELECT
DB_NAME(vfs.database_id) AS database_name,
mf.type_desc,
mf.physical_name,
vfs.num_of_reads,
vfs.io_stall_read_ms,
vfs.io_stall_read_ms / NULLIF(vfs.num_of_reads, 0) AS avg_read_ms,
vfs.num_of_writes,
vfs.io_stall_write_ms,
vfs.io_stall_write_ms / NULLIF(vfs.num_of_writes, 0) AS avg_write_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
JOIN sys.master_files AS mf
ON vfs.database_id = mf.database_id
AND vfs.file_id = mf.file_id
ORDER BY avg_write_ms DESC;
Memory clerk overview:
🔍 Diagnostic query
SELECT TOP (25)
type,
pages_kb / 1024 AS pages_mb,
virtual_memory_committed_kb / 1024 AS virtual_committed_mb
FROM sys.dm_os_memory_clerks
ORDER BY pages_kb DESC;
5. Fix / optimisation
- For
WRITELOG, reduce transaction size, improve log storage latency, avoid row-by-row commits, and pre-size log files. - For
SOS_SCHEDULER_YIELD, find CPU-heavy queries and excessive parallelism. - For
PAGEIOLATCH_*, reduce reads first; storage tuning comes after query/index tuning. - For plan cache bloat, parameterize safely and avoid ad-hoc literal storms.
- For memory pressure, set max server memory correctly and find large grants or clerks.
Avoid broad instance-level changes unless the evidence points there. Changing max server memory, MAXDOP, cost threshold, or clearing cache during a live incident can affect every workload. Prefer query-level mitigation when one query or procedure is the cause.
6. Example scripts
Reproduce plan cache bloat:
🛠 Setup script
USE tempdb;
GO
DROP TABLE IF EXISTS dbo.CacheLab;
CREATE TABLE dbo.CacheLab (Id int NOT NULL PRIMARY KEY, ValueText varchar(50) NOT NULL);
INSERT dbo.CacheLab(Id, ValueText)
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 'cache-test'
FROM sys.all_objects;
GO
DECLARE @i int = 1;
DECLARE @sql nvarchar(max);
WHILE @i <= 200
BEGIN
SET @sql = N'SELECT ValueText FROM dbo.CacheLab WHERE Id = ' + CONVERT(nvarchar(20), @i) + N';';
EXEC sys.sp_executesql @sql;
SET @i += 1;
END;
Diagnose single-use ad-hoc plans:
🔍 Diagnostic query
SELECT TOP (25)
cp.objtype,
cp.usecounts,
cp.size_in_bytes / 1024 AS size_kb,
st.text
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) AS st
WHERE cp.objtype = 'Adhoc'
AND cp.usecounts = 1
ORDER BY cp.size_in_bytes DESC;
Fix with parameterization in application code or sp_executesql:
🚀 Optimization query
DECLARE @id int = 42;
EXEC sys.sp_executesql
N'SELECT ValueText FROM dbo.CacheLab WHERE Id = @Id;',
N'@Id int',
@Id = @id;
7. Key insights (expert notes)
- SQL Server waits are a time accounting system. They are not all problems.
- Signal wait time points to CPU scheduling pressure. Resource wait time points to the thing being waited on.
- The buffer pool caches pages, not rows.
- The transaction log is sequential, but it is often the critical path for writes.
- Clearing cache may hide the evidence you need. Capture first, clear only with intent.
🧠 Dynamic management views (DMVs)
Use this section to capture live SQL Server evidence before remediation changes remove the incident trail.
1. Concept (simple explanation)
DMVs are SQL Server's live inspection windows. They expose what is running, what has run, where time was spent, which locks exist, how much TempDB is used, and how the engine is behaving.
Think of DMVs as a toolbox. One DMV rarely gives the full answer. You usually join request, session, SQL text, plan, wait, lock, and database metadata.
In production, DMVs are your first live triage layer:
- Start with
sys.dm_exec_requeststo find what is active. - Join to
sys.dm_exec_sessionsto identify owner, host, and application. - Use
sys.dm_exec_sql_textandsys.dm_exec_query_planto capture what is running. - Use waits, locks, grants, and TempDB usage DMVs to explain why it is slow.
- Use Query Store when you need history beyond what plan cache can retain.
2. How it works internally
DMVs read engine state kept in memory. Some reset when SQL Server restarts. Some reset when a database is closed, an object is rebuilt, a plan leaves cache, or Query Store is cleaned.
That volatility matters during incidents. A restart can make dashboards look clean while the root cause remains. A plan eviction can erase the only cached plan showing the bad estimate. Capture before remediation whenever the situation allows.
Important DMV families:
| Need | DMV family |
|---|---|
| Active sessions and requests | sys.dm_exec_sessions, sys.dm_exec_requests |
| SQL text and plans | sys.dm_exec_sql_text, sys.dm_exec_query_plan |
| Cached query history | sys.dm_exec_query_stats |
| Waits | sys.dm_os_wait_stats, sys.dm_exec_session_wait_stats |
| Locks and transactions | sys.dm_tran_locks, sys.dm_tran_active_transactions |
| Missing indexes | sys.dm_db_missing_index_* |
| Index usage | sys.dm_db_index_usage_stats |
| TempDB usage | sys.dm_db_task_space_usage, sys.dm_db_session_space_usage |
| File latency | sys.dm_io_virtual_file_stats |
3. Real-world problem scenario
A production web API is timing out. The application team says "SQL Server is down." The server is online. DMVs show 120 sessions waiting behind one open transaction from a reporting tool. The lead blocker ran an update and never committed. Without DMVs, the incident looks like general slowness. With DMVs, it is one blocking session.
4. How to diagnose
Fast triage query:
Before running deeper analysis, answer four questions:
- Who is active right now?
- Who is waiting, and on what?
- Who is blocking whom?
- Which statement is associated with the largest CPU, elapsed time, reads, writes, or memory grant?
🔍 Diagnostic query
SELECT
s.session_id,
s.login_name,
s.host_name,
s.program_name,
r.status,
r.command,
r.blocking_session_id,
r.wait_type,
r.wait_time AS wait_ms,
r.cpu_time AS cpu_ms,
r.total_elapsed_time AS elapsed_ms,
r.logical_reads,
r.writes,
DB_NAME(COALESCE(r.database_id, s.database_id)) AS database_name,
t.text AS sql_text
FROM sys.dm_exec_sessions AS s
LEFT JOIN sys.dm_exec_requests AS r
ON s.session_id = r.session_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE s.is_user_process = 1
ORDER BY r.total_elapsed_time DESC, s.session_id;
Find top statements in cache:
🔍 Diagnostic query
SELECT TOP (20)
qs.execution_count,
qs.total_worker_time / 1000 AS total_cpu_ms,
qs.total_elapsed_time / 1000 AS total_duration_ms,
qs.total_logical_reads,
qs.total_logical_writes,
qs.max_grant_kb,
qs.max_used_grant_kb,
SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset) / 2) + 1) AS statement_text
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY qs.total_worker_time DESC;
5. Fix / optimisation
DMVs tell you where to act:
- Kill or correct a lead blocker only after confirming rollback impact.
- Add an index only after validating the query pattern and write overhead.
- Force a plan only after identifying a good historical plan.
- Adjust server settings only after proving the setting is the bottleneck.
- Persist DMV snapshots if the problem is intermittent.
Do not treat DMV output as a final diagnosis without context. A missing index suggestion can be valid for one query and harmful for the write workload. A high wait count can be normal if the wait time is tiny. A sleeping session is harmless unless it holds locks or an open transaction.
6. Example scripts
Create a lightweight incident snapshot table:
🛠 Setup script
USE tempdb;
GO
DROP TABLE IF EXISTS dbo.DmvIncidentSnapshot;
GO
CREATE TABLE dbo.DmvIncidentSnapshot
(
SnapshotTime datetime2(0) NOT NULL DEFAULT SYSDATETIME(),
SessionId smallint NOT NULL,
BlockingSessionId smallint NULL,
WaitType nvarchar(60) NULL,
CpuMs int NULL,
ElapsedMs int NULL,
LogicalReads bigint NULL,
SqlText nvarchar(max) NULL
);
GO
INSERT dbo.DmvIncidentSnapshot
(
SessionId,
BlockingSessionId,
WaitType,
CpuMs,
ElapsedMs,
LogicalReads,
SqlText
)
SELECT
r.session_id,
NULLIF(r.blocking_session_id, 0),
r.wait_type,
r.cpu_time,
r.total_elapsed_time,
r.logical_reads,
t.text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID;
SELECT *
FROM dbo.DmvIncidentSnapshot
ORDER BY SnapshotTime DESC, ElapsedMs DESC;
Diagnose index usage:
🔍 Diagnostic query
SELECT
OBJECT_SCHEMA_NAME(i.object_id) AS schema_name,
OBJECT_NAME(i.object_id) AS table_name,
i.name AS index_name,
i.type_desc,
us.user_seeks,
us.user_scans,
us.user_lookups,
us.user_updates
FROM sys.indexes AS i
LEFT JOIN sys.dm_db_index_usage_stats AS us
ON i.object_id = us.object_id
AND i.index_id = us.index_id
AND us.database_id = DB_ID()
WHERE i.object_id > 100
ORDER BY us.user_updates DESC, us.user_seeks ASC;
Convert DMV evidence into a candidate index review:
🚀 Optimization query
SELECT
'Candidate index review' AS action_type,
mid.statement AS table_name,
mid.equality_columns,
mid.inequality_columns,
mid.included_columns,
migs.user_seeks,
migs.avg_user_impact
FROM sys.dm_db_missing_index_group_stats AS migs
JOIN sys.dm_db_missing_index_groups AS mig
ON migs.group_handle = mig.index_group_handle
JOIN sys.dm_db_missing_index_details AS mid
ON mig.index_handle = mid.index_handle
WHERE mid.database_id = DB_ID()
ORDER BY migs.user_seeks DESC;
7. Key insights (expert notes)
- DMVs are volatile. Capture evidence during the incident.
sys.dm_exec_query_statsonly shows plans still in cache. Use Query Store for durable history.- Missing index DMVs do not suggest filtered indexes or index consolidation.
- Index usage stats reset after restart and can mislead after maintenance windows.
- Always join DMVs back to database, object, and query context.
🧠 Locking and blocking
Use this section to separate normal locking from harmful blocking, identify the head blocker, and choose the safest remediation.
1. Concept (simple explanation)
Locking protects data correctness. Blocking happens when one session holds a lock that another session needs. Blocking is not automatically bad. Long blocking chains are bad because they turn normal concurrency into application timeouts.
Deadlocks are different. A deadlock is a cycle where two or more sessions wait on each other, so SQL Server chooses a victim and rolls it back.
2. How it works internally
SQL Server lock manager tracks locks on resources such as rows, keys, pages, tables, allocation units, metadata, and databases.
Common lock modes:
| Lock mode | Meaning | Compatible with shared reads? |
|---|---|---|
| S | Shared read lock | Yes, with other S locks |
| U | Update lock, used before converting to X | Limited |
| X | Exclusive write lock | No |
| IS/IX | Intent locks at higher levels | Depends on target |
| Sch-S | Schema stability | Usually yes |
| Sch-M | Schema modification | No |
Blocking chain example:
Isolation level matters:
- Read committed uses shared locks for reads by default.
- Read committed snapshot isolation (RCSI) uses row versions for reads, reducing reader/writer blocking.
- Snapshot isolation uses transaction-level row versioning.
- Serializable protects ranges and can increase blocking.
Lock escalation is another failure pattern to watch. SQL Server may escalate many row or key locks to a table lock when lock memory pressure or lock count thresholds are reached. The query that was supposed to touch a small set can suddenly block the whole table if estimates are wrong, indexes are missing, or the transaction modifies too many rows.
3. Real-world problem scenario
A nightly ETL job updates 3 million customer rows in one transaction. During the update, the customer portal times out because read queries wait on LCK_M_S. The ETL is technically correct, but its transaction scope is too large for the OLTP workload.
4. How to diagnose
- Identify the lead blocker.
- Find what SQL it is running or what transaction it left open.
- Check lock resources and wait types.
- Determine whether killing it will cause a large rollback.
- For deadlocks, capture the deadlock graph from Extended Events
system_health.
Prioritize blocking like this:
- Find the head blocker, not just the session with the longest wait.
- Capture login, host, program, SQL text, transaction age, and row count if available.
- Decide whether the blocker is making progress, idle in transaction, or stuck behind another resource.
- Estimate rollback risk before issuing
KILL. - After service is restored, fix the access pattern so the same blocker does not return tomorrow.
Blocking chain query:
🔍 Diagnostic query
WITH blocking AS
(
SELECT
r.session_id,
r.blocking_session_id,
r.wait_type,
r.wait_time,
r.wait_resource,
r.database_id,
r.sql_handle
FROM sys.dm_exec_requests AS r
WHERE r.blocking_session_id <> 0
)
SELECT
b.session_id AS blocked_session_id,
b.blocking_session_id,
b.wait_type,
b.wait_time AS wait_ms,
b.wait_resource,
DB_NAME(b.database_id) AS database_name,
blocked_text.text AS blocked_sql,
blocker_text.text AS blocker_sql
FROM blocking AS b
OUTER APPLY sys.dm_exec_sql_text(b.sql_handle) AS blocked_text
LEFT JOIN sys.dm_exec_requests AS br
ON b.blocking_session_id = br.session_id
OUTER APPLY sys.dm_exec_sql_text(br.sql_handle) AS blocker_text
ORDER BY b.wait_time DESC;
Open transaction details:
🔍 Diagnostic query
SELECT
s.session_id,
s.login_name,
s.host_name,
s.program_name,
at.transaction_begin_time,
DATEDIFF(minute, at.transaction_begin_time, SYSDATETIME()) AS transaction_age_minutes,
at.transaction_type,
at.transaction_state
FROM sys.dm_tran_session_transactions AS st
JOIN sys.dm_tran_active_transactions AS at
ON st.transaction_id = at.transaction_id
JOIN sys.dm_exec_sessions AS s
ON st.session_id = s.session_id
WHERE s.is_user_process = 1
ORDER BY at.transaction_begin_time;
5. Fix / optimisation
- Add indexes so updates and reads touch fewer rows.
- Keep transactions short and predictable.
- Batch large modifications.
- Use RCSI when read/write blocking is common and the application can accept statement-level versioned reads.
- Use consistent object access order to reduce deadlock risk.
- Avoid user interaction while a transaction is open.
Do not use NOLOCK as the default fix. It can return dirty, duplicated, or missing rows and can still be blocked by schema modification locks. If readers should not block writers, evaluate RCSI with TempDB capacity and application correctness in mind.
Enable RCSI only after testing TempDB version store impact:
🚀 Optimization query
ALTER DATABASE YourDatabaseName
SET READ_COMMITTED_SNAPSHOT ON
WITH ROLLBACK IMMEDIATE;
6. Example scripts
Session 1: reproduce blocking:
🛠 Setup script
USE tempdb;
GO
DROP TABLE IF EXISTS dbo.BlockingLab;
CREATE TABLE dbo.BlockingLab
(
Id int NOT NULL CONSTRAINT PK_BlockingLab PRIMARY KEY,
StatusCode char(1) NOT NULL,
ModifiedAt datetime2(0) NOT NULL
);
INSERT dbo.BlockingLab(Id, StatusCode, ModifiedAt)
VALUES (1, 'N', SYSDATETIME()), (2, 'N', SYSDATETIME());
GO
BEGIN TRAN;
UPDATE dbo.BlockingLab
SET StatusCode = 'P', ModifiedAt = SYSDATETIME()
WHERE Id = 1;
-- Leave this transaction open while Session 2 runs.
Session 2: blocked read:
🛠 Setup script
USE tempdb;
GO
SELECT *
FROM dbo.BlockingLab
WHERE Id = 1;
Diagnose from Session 3:
🔍 Diagnostic query
SELECT
r.session_id,
r.blocking_session_id,
r.wait_type,
r.wait_time,
r.wait_resource,
t.text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.blocking_session_id <> 0;
Fix from Session 1:
🚀 Optimization query
COMMIT TRAN;
-- Or ROLLBACK TRAN if this was a test or bad transaction.
Batch update pattern:
🚀 Optimization query
WHILE 1 = 1
BEGIN
UPDATE TOP (5000) dbo.BlockingLab
SET StatusCode = 'A', ModifiedAt = SYSDATETIME()
WHERE StatusCode = 'N';
IF @@ROWCOUNT = 0 BREAK;
END;
7. Key insights (expert notes)
- The blocker may be sleeping. A sleeping session can still hold an open transaction.
- Do not kill first and diagnose later. Capture SQL text, login, host, transaction age, and rollback risk.
NOLOCKcan read uncommitted data, duplicate rows, miss rows, and fail with allocation errors. It is not a blocking strategy.- RCSI reduces reader/writer blocking but moves work to TempDB version store.
- Deadlock priority can protect critical sessions, but it does not fix the access pattern.
🧠 TempDB
Use this section to identify whether TempDB pressure comes from user objects, internal spills, version store, allocation contention, or storage latency.
1. Concept (simple explanation)
TempDB is SQL Server's shared scratch database. It stores temporary tables, table variables, worktables, sort/hash spill data, online index build work, row versions, triggers, and internal objects.
When TempDB is slow or contended, unrelated workloads can all become slow because they share the same workspace.
2. How it works internally
TempDB is recreated every time SQL Server starts. It uses simple recovery and minimal logging where possible, but it still writes to disk.
Main consumers:
| Consumer | Example |
|---|---|
| User objects | #temp tables and table variables |
| Internal objects | Sorts, hashes, spools, cursors |
| Version store | RCSI, snapshot isolation, online index operations |
| Metadata | Temp object creation and drop tracking |
Allocation contention often appears on pages such as PFS, GAM, and SGAM. Modern SQL Server versions include improvements, but file layout and workload pattern still matter.
TempDB incidents usually fall into three categories:
- Space exhaustion: TempDB fills because a query spills, a temp table grows, or version store cleanup is blocked.
- Allocation contention: many sessions create/drop temp objects or allocate pages at the same time, showing
PAGELATCH_*waits. - I/O pressure: sorts, hashes, spools, and version store reads/writes overload TempDB storage, showing
PAGEIOLATCH_*or slow file latency.
The priority is to identify the consumer before adding space. Extra space can restore service, but it does not tell you whether the cause is one report, RCSI version retention, or systemic spill pressure.
3. Real-world problem scenario
An e-commerce database enables RCSI to reduce blocking. Blocking improves, but two days later TempDB grows from 20 GB to 300 GB. A long-running reporting transaction keeps old row versions alive, so cleanup cannot remove them. The symptom is TempDB full; the root cause is a long snapshot/RCSI reader.
4. How to diagnose
- Check TempDB file sizes and free space.
- Separate user object, internal object, and version store usage.
- Identify sessions consuming TempDB.
- Check waits:
PAGELATCH_*suggests allocation contention;PAGEIOLATCH_*suggests physical I/O. - Check active snapshot transactions when version store grows.
Check in this order:
- Is TempDB full, slow, or contended?
- Is usage dominated by user objects, internal objects, or version store?
- Which session or request is allocating the most pages?
- Are long snapshot/RCSI transactions preventing version cleanup?
- Are files equally sized and growing predictably?
TempDB file usage:
🔍 Diagnostic query
USE tempdb;
GO
SELECT
name,
type_desc,
size / 128.0 AS size_mb,
FILEPROPERTY(name, 'SpaceUsed') / 128.0 AS used_mb,
(size - FILEPROPERTY(name, 'SpaceUsed')) / 128.0 AS free_mb
FROM sys.database_files;
TempDB space by session:
🔍 Diagnostic query
SELECT TOP (25)
s.session_id,
s.login_name,
s.host_name,
s.program_name,
(su.user_objects_alloc_page_count - su.user_objects_dealloc_page_count) * 8 / 1024 AS user_objects_mb,
(su.internal_objects_alloc_page_count - su.internal_objects_dealloc_page_count) * 8 / 1024 AS internal_objects_mb,
r.status,
r.wait_type,
t.text AS sql_text
FROM sys.dm_db_session_space_usage AS su
JOIN sys.dm_exec_sessions AS s
ON su.session_id = s.session_id
LEFT JOIN sys.dm_exec_requests AS r
ON su.session_id = r.session_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE s.is_user_process = 1
ORDER BY (su.user_objects_alloc_page_count + su.internal_objects_alloc_page_count) DESC;
Version store usage:
🔍 Diagnostic query
SELECT
SUM(version_store_reserved_page_count) * 8 / 1024 AS version_store_mb,
SUM(user_object_reserved_page_count) * 8 / 1024 AS user_objects_mb,
SUM(internal_object_reserved_page_count) * 8 / 1024 AS internal_objects_mb,
SUM(unallocated_extent_page_count) * 8 / 1024 AS free_space_mb
FROM sys.dm_db_file_space_usage;
Active snapshot transactions:
🔍 Diagnostic query
SELECT
transaction_id,
session_id,
elapsed_time_seconds,
is_snapshot,
first_snapshot_sequence_num,
max_version_chain_traversed
FROM sys.dm_tran_active_snapshot_database_transactions
ORDER BY elapsed_time_seconds DESC;
5. Fix / optimisation
- User objects: reduce temp table width, index temp tables correctly, drop when done, avoid unnecessary temp materialization.
- Internal objects: tune queries that sort, hash, spool, or spill; improve estimates and indexes.
- Version store: find long-running snapshot/RCSI readers; shorten transactions; schedule reporting safely.
- Allocation contention: use multiple equally sized TempDB data files and modern SQL Server defaults.
- I/O pressure: move TempDB to faster storage and pre-size files to avoid autogrowth storms.
Do not assume the fix is always "add more TempDB files." More equal-sized files help allocation contention, but they will not fix a spilling hash join, an unbounded temp table, or a long transaction retaining row versions.
6. Example scripts
Reproduce TempDB internal object pressure with a sort:
🛠 Setup script
USE tempdb;
GO
DROP TABLE IF EXISTS dbo.TempDbSortLab;
GO
CREATE TABLE dbo.TempDbSortLab
(
Id int IDENTITY(1,1) NOT NULL,
GroupId int NOT NULL,
Padding char(200) NOT NULL DEFAULT 'x'
);
WITH n AS
(
SELECT TOP (300000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
FROM sys.all_objects AS a
CROSS JOIN sys.all_objects AS b
)
INSERT dbo.TempDbSortLab(GroupId)
SELECT rn % 5000
FROM n;
GO
-- This sort can request memory and may spill depending on the lab machine.
SELECT GroupId, COUNT(*) AS row_count
FROM dbo.TempDbSortLab
GROUP BY GroupId
ORDER BY COUNT(*) DESC;
Diagnose task-level TempDB usage while a query runs:
🔍 Diagnostic query
SELECT
session_id,
request_id,
(internal_objects_alloc_page_count - internal_objects_dealloc_page_count) * 8 / 1024 AS internal_mb,
(user_objects_alloc_page_count - user_objects_dealloc_page_count) * 8 / 1024 AS user_mb
FROM sys.dm_db_task_space_usage
WHERE session_id <> @@SPID
ORDER BY internal_mb DESC;
Fix with an index that supports grouping:
🚀 Optimization query
CREATE INDEX IX_TempDbSortLab_GroupId ON dbo.TempDbSortLab(GroupId);
GO
SELECT GroupId, COUNT(*) AS row_count
FROM dbo.TempDbSortLab
GROUP BY GroupId
ORDER BY COUNT(*) DESC;
Check TempDB file layout:
✅ Validation query
SELECT
name,
physical_name,
size / 128.0 AS size_mb,
growth,
is_percent_growth
FROM tempdb.sys.database_files
WHERE type_desc = 'ROWS'
ORDER BY file_id;
7. Key insights (expert notes)
- TempDB is shared. One bad report can hurt the whole instance.
- Spills are query tuning problems that show up as TempDB problems.
- RCSI is useful, but long readers can retain row versions and grow TempDB.
- Equal file size matters. Unequal files can cause uneven allocation.
- Autogrowth is a safety net, not a sizing strategy.
🧠 Troubleshooting playbook
Use this playbook when you need ordered, incident-ready checks for CPU pressure, blocking, or slow-query triage.
High CPU issue steps
- Confirm CPU pressure at SQL Server level, not only VM/OS level.
- Check whether SQL Server process CPU is high and whether signal waits or runnable tasks show scheduler pressure.
- If CPU is normal but elapsed time is high, look for waits instead of tuning CPU.
- Check active CPU consumers:
🔍 Diagnostic query
SELECT TOP (20)
r.session_id,
r.cpu_time AS cpu_ms,
r.total_elapsed_time AS elapsed_ms,
r.logical_reads,
r.status,
r.command,
t.text AS sql_text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID
ORDER BY r.cpu_time DESC;
- Check cached top CPU queries and compare with Query Store history.
- Inspect actual plans for scans, bad joins, scalar UDFs, implicit conversions, and excessive parallelism.
- Check waits. High signal wait or many runnable tasks means CPU scheduling pressure.
- Mitigate: force last good plan, stop runaway query, reduce DOP for the query, or scale temporarily.
- Fix root cause: indexes, query rewrite, statistics, parameter sensitivity, or app call pattern.
Do not clear the plan cache as the first response to high CPU. If the issue is parameter sniffing, a cache clear may temporarily hide it and then let the same bad plan recompile under a different parameter. Use Query Store to identify and force a known-good plan when appropriate.
Blocking troubleshooting steps
- Identify the lead blocker and transaction age.
- Capture blocker SQL text, login, host, program, wait resource, and open transaction details.
- Decide whether to wait, ask the owner to commit/rollback, or kill the session.
- If killing, estimate rollback impact. Large updates can take time to undo.
- After service restoration, fix the pattern: shorter transactions, better indexes, batching, RCSI, or consistent access order.
Treat blocking as a chain, not a list. The session with the most visible wait is often a victim, not the cause. The lead blocker may be sleeping, may have no active request, and may still hold locks through an open transaction.
Quick blocker query:
🔍 Diagnostic query
SELECT
blocked.session_id AS blocked_session_id,
blocked.blocking_session_id,
blocked.wait_type,
blocked.wait_time AS wait_ms,
blocked.wait_resource,
blocker_s.login_name AS blocker_login,
blocker_s.host_name AS blocker_host,
blocker_s.program_name AS blocker_program,
blocker_text.text AS blocker_sql
FROM sys.dm_exec_requests AS blocked
LEFT JOIN sys.dm_exec_sessions AS blocker_s
ON blocked.blocking_session_id = blocker_s.session_id
LEFT JOIN sys.dm_exec_requests AS blocker_r
ON blocked.blocking_session_id = blocker_r.session_id
OUTER APPLY sys.dm_exec_sql_text(blocker_r.sql_handle) AS blocker_text
WHERE blocked.blocking_session_id <> 0
ORDER BY blocked.wait_time DESC;
Slow query checklist
- Is the query slow now, or was it slow historically?
- Did the plan change? Check Query Store.
- Are estimated and actual rows far apart?
- Are predicates SARGable?
- Are parameter data types aligned with column data types?
- Is there a key lookup repeated many times?
- Are there spills to TempDB?
- Is the query waiting on locks, I/O, memory grants, or CPU?
- Are statistics stale or sampled poorly?
- Does the query return more columns or rows than the application needs?
- Would a filtered, covering, or composite index match the real access pattern?
- Is this a parameter-sensitive procedure?
When prioritizing the checklist, start with evidence that changes the remediation path: plan changed, row estimates wrong, blocking present, or spill present. Then inspect index design and query shape. Review cosmetic query cleanup later.
Do not add every missing-index recommendation during an incident. Missing-index DMVs do not understand your full index portfolio, write overhead, filtered-index opportunities, or duplicate indexes. Convert suggestions into an index review after the service is stable.
Slow query starter script:
🔍 Diagnostic query
DECLARE @session_id smallint = 0; -- Replace with target session_id.
SELECT
r.session_id,
r.status,
r.cpu_time AS cpu_ms,
r.total_elapsed_time AS elapsed_ms,
r.logical_reads,
r.reads,
r.writes,
r.wait_type,
r.wait_time AS wait_ms,
r.blocking_session_id,
qp.query_plan,
t.text AS sql_text
FROM sys.dm_exec_requests AS r
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
OUTER APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
WHERE r.session_id = @session_id;
🧠 References
Use these Microsoft Learn pages to verify syntax, supported behavior, and engine details before applying changes in production.
- Microsoft Learn: Execution plans
- Microsoft Learn: Query Store
- Microsoft Learn: System dynamic management views
- Microsoft Learn: SQL Server transaction locking and row versioning guide
- Microsoft Learn: TempDB database
🧠 Flashcards
Use these recall prompts to test the incident workflow, not just memorize DMV names.