Introduction to wait stats – Part 3, I/O Waites – Latches

During the query execution the storage engine needs to bring the data from disk space to the buffer pool , or if a thread or task needs to access a specific part of the buffer in or order manage the access to and from the buffer the SQL server uses LATCHES

Latches are short-term synchronization objects , they are mainly used for synchronization of in-line memory access to the data pages. There are almost 144 different types of  latches and certain types of latches lead to blocking of specific resources in high concurrent systems, and diagnosing the latch blocking is very difficult because there is not much information available out there or documented to perform the rca

Latch waits are divided in to two main groups :

i) Page latch waits

ii) Non Page latch waits

pagelatch waits are subdivided into two main groups

i) PAGELATCH_XX

ii) PAGEIOLATCH_XX

Non page latch waits are sub divided into two main groups

i)LATCH_XX

ii)TRANMARKLATCH_XX

as you can see above mentioned all the latches have different modes (%_XX). These modes are NL,KP,SH,UP,DT,EX

what it means is that every latch with a mode has a purpose example SH mode means it is trying to read a page in a buffer or from the disk

KP stands for KEEP latch and DT stands for DESTROY latch .KP is compatible with all latch types except for DT , KP latch can be acquired by any thread which wants to read buffer , so that every thread needs to acquire a latch in order have an access to the buffer so that no other thread can destroy the buffer unless it acquires the DT latch. A DT latch is used to destroy the buf i.e to free the cache , so that any other thread can access it for loading a page

EX is called exclusive latch because this latch is held on the page when moving a page from buffer to I/O disk or from disk I/O  to buffer in order to modify the page so that no other latch can be held on the page

Blocking latches, Latch waits

So  the blocking task information is known only when the latch is held in UP, EX or DT modes the common factor among these is that these can be held single time by  a single task only , where as KP, SH can be held by multiple tasks and multiple times

and one important thing is that the mode specified in the wait type is the request mode not the blocking mode

Latches grant order

Latches are granted almost in FIFO , latches also try to avoid starvation by having two exceptions to the grant order

i) KP requests never need to wait unless current mode is in DT mode

ii) when granting waiting list all compatible locks are granted first in regard less to the FIFO list  Eg:  if current mode is UP and first waiter is UP and second waiter is EX and third waiter is SH and kp, SH but when up latch is released all the SH latches are granted although EX is arrived first , to prevent starvation

Latch wait time , the wait time in SYS.DM_OS_WAIT_STATS, and SYS.DM_OS_WAITING_TASKS the wait time in here is the time task is idle waiting for the latch in wait type

Page Latches

Page latches are used to provide the synchronization to the physical access to the individual pages stored in the buffer pool, access to the on disk data pages is controlled by the buffer pool

The difference between pagelatch wait and Pageiolatch is actually very small it depends on the pending I/O request

To read the page it need ths SH page latch , to write the page a ex latch is held , in case row versioning and usage of temp db needs UP mode

PAGEIOLATCH_EX wait problems

if there are ten tasks requesting a single page for modification , first task may wait for in order to get PAGEIOLATCH_ex but other nine tasks may wait to untill the data got in to the buffer by first task , after time out they may try to re load the data which will lead to a contention

Introduction

Recently, we performed a lab test that had a large OLTP workload in the Microsoft Enterprise Engineering Center. The purpose of this lab was to take an intensive Microsoft SQL Server workload and see what happened when we scaled it up from 64 processors to 128 processors. (Note: This configuration is supported as part of the Microsoft SQL Server 2008 R2 release.). The workload had highly concurrent insert operations going to a few large tables.

As we began to scale this workload up to 128 cores, the wait stats captured were dominated by PAGELATCH_UP and PAGELATCH_EX. The average wait times were tens of milliseconds, and there were a lot of waits. These waits were not expected, or they were expected to be a few milliseconds only.

In this TechNote we will describe how we first diagnosed the problem and how we then used table partitioning to work around it.

Diagnosing the Problem

When you see large waits for PAGELATCH in sys.dm_os_wait_stats, you will want to do the following. Start your investigation with sys.dm_os_waiting_tasks and locate a task waiting for PAGELATCH, like this:

SELECT session_id, wait_type, resource_description
FROM sys.dm_os_waiting_tasks
WHERE wait_type LIKE ‘PAGELATCH%’            

Example Output:

The resource_description column lists the exact page being waited for in the format: <database_id>:<file_id>:<page_id>.

Using the resource_description column, you can now write this rather complex query that looks up all these waiting pages:

SELECT wt.session_id, wt.wait_type, wt.wait_duration_ms          
, s.name AS schema_name          
, o.name AS object_name          
, i.name AS index_name          
FROM sys.dm_os_buffer_descriptors bd
JOIN (          
  SELECT *            
  , CHARINDEX(‘:’, resource_description) AS file_index            
  , CHARINDEX(‘:’, resource_description
  , CHARINDEX(‘:’, resource_description)) AS page_index           
  , resource_description AS rd          
  FROM sys.dm_os_waiting_tasks wt          
  WHERE wait_type LIKE ‘PAGELATCH%’                     
  ) AS wt          
    ON bd.database_id = SUBSTRING(wt.rd, 0, wt.file_index)          
    AND bd.file_id = SUBSTRING(wt.rd, wt.file_index, wt.page_index)          
    AND bd.page_id = SUBSTRING(wt.rd, wt.page_index, LEN(wt.rd))
JOIN sys.allocation_units au ON bd.allocation_unit_id = au.allocation_unit_id
JOIN sys.partitions p ON au.container_id = p.partition_id
JOIN sys.indexes i ON  p.index_id = i.index_id AND p.object_id = i.object_id
JOIN sys.objects o ON i.object_id = o.object_id

JOIN sys.schemas s ON o.schema_id = s.schema_id

The query shows that the page we are waiting for is in a clustered index, enforcing the primary key, of a table with this structure:

CREATE TABLE HeavyInsert ( 
  ID INT PRIMARY KEY CLUSTERED  
  , col1 VARCHAR(50)
) ON [PRIMARY] 

What is going on here, why are we waiting to access a data page in the index?

Background Information

To diagnose what was happening in our large OLTP workload, it’s important to understand how SQL Server handles the insertion of a new row into an index. When a new row is inserted into an index, SQL Server will use the following algorithm to execute the modification:

  1. Record a log entry that row has been modified.
  2. Traverse the B-tree to locate the correct page to hold the new record.
  3. Latch the page with PAGELATCH_EX, preventing others from modifying it.
  4. Add the row to the page and, if needed, mark the page as dirty.
  5. Unlatch the page. 

Eventually, the page will also have to be flushed to disk by a checkpoint or lazy write operation.

However, what happens if all the inserted rows go to the same page? In that case, you can see a queue building up on that page. Even though a latch is a very lightweight semaphore, it can still be a contention point if the workload is highly concurrent. In this customer case, the first, and only, column in the index was a continuously increasing key. Because of this, every new insert went to the same page at the end of the B-tree, until that page was full. Workloads that use IDENTITY or other sequentially increasing value columns as primary keys may run into this same issue at high concurrency too.

Solution

Whenever many threads need synchronized access to a single resource, contention can occur. The solution is typically to create more of the contended resource. In this case, the contended resource is the last page in the B-tree.

One way to avoid contention on a single page is to choose a leading column in the index that is not continually increasing. However, this would have required an application change in the customer’s system. We had to look for a solution that could be implemented within in the database.

Remember that the contention point is a single page in a B-tree. If only there was a way to get more B-trees in the table. Fortunately, there IS a way to get this: Partition the table. The table can be partitioned in such a way that the new rows get spread over multiple partitions.

First, create the partition function and scheme:

CREATE PARTITION FUNCTION pf_hash (TINYNT) AS RANGE LEFT FOR VALUES (0,1,2) 

CREATE PARTITION SCHEME ps_hash AS PARTITION pf_hash ALL TO ([PRIMARY]) 

This example uses four partitions. The number of partitions you need depends on the amount of INSERT activity happening on the table. There is a drawback to hash-partitioning the table like this: Whenever you select rows from the table, you have to touch all partitions. This means that you need to access more than one B-tree – you will not get partition elimination. There is a CPU cost and latency cost to this, so keep the number of partitions as small as possible (while still avoiding PAGELATCH). In our particular customer case, we had plenty of spare CPU cycles, so we could afford to sacrifice some time on SELECT statements, as long as it helped us increase the INSERT rate.

Second, you need a column to partition on, one that spreads the inserts over the four partitions. There was no column available in the table for this in the Microsoft Enterprise Engineering Center scenario. However, it is easy to create one. Taking advantage of the fact that the ID column is constantly increasing in increments of one, here is a simple hash function of the row:

CREATE TABLE HeavyInsert_Hash( 
  ID INT NOT NULL 
  , col1 VARCHAR(50) 
  , HashID AS CAST(ABS(ID % 4) AS TINYINT)  PERSISTED NOT NULL)

With the HashID column, you can cycle the inserts between the four partitions. Create the clustering index in this way:

CREATE UNIQUE CLUSTERED INDEX CIX_Hash
ON HeavyInsert_Hash (ID, HashID) ON ps_hash(HashID) 

By using this new, partitioned table instead of the original table, we managed to get rid of the PAGELATCH contention and increase the insertion rate, because we spread out the high concurrency across many pages and across several partitions, each having its own B-tree structure. We managed to increase the INSERT rate by 15 percent for this customer, with the PAGELATCH waits going away on the hot index in one table. But even then, we had CPU cycles to spare, so we could have optimized further by applying a similar trick to other table with high insert rates.

Strictly speaking, this optimization trick is a logical change in the primary key of the table. However, because the new key is just extended with the hash value of the original key, duplicates in the ID column are avoided.

The single column unique indexes on a table are typically the worst offender if you are experiencing PAGELATCH contention. But even if you eliminate this, there may be other, nonclustered indexes on the table that suffer from the same problem. Typically, the problem occurs with single column unique keys, where every insert ends up on the same page. If you have other indexes in the table that suffer from PAGELATCH contention, you can apply this partition trick to them too, using the same hash key as the primary key.

Not all applications can be modified, something that is a challenge for ISVs. However, if you DO have the option of modifying the queries in the system, you can add an additional filter to queries seeking on the primary key.

Example: To get partition elimination, change this:

      SELECT * FROM HeavyInsert_Hash 
      WHERE ID = 42


To this:

            SELECT * FROM HeavyInsert_Hash
      WHERE ID = 42 AND HashID  = CAST(ABS(42 % 4) AS TINYINT

With partition elimination, the hash partitioning trick is almost a free treat. You will still add one byte to each row of the clustered index.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s