Problem solution approach for parallel query processing- CxPacket,Insufficient Memory,Data Skews,Statistics


as mentioned in the above article about parallel query processing , not only benefits but there are some problems with MAXDOP and parallel query processing


1) Insufficient Memory

2) Over head of synchronizing the worker threads (which leads CXPACKET waits)

3) Statistical Estimation Errors

4) Data Skew

>> Insufficient Memory during execution of parallel query plans

before we get into the detail we will first understand little bit about memory management in SQL server

as we know SQL server requires physical memory to execute queries

Total amount of memory required by the query to is divided in to two

         1) Required Memory

         2) Additional Memory

Required Memory is the must have memory by the SQL server in order to execute the query

Additional Memory is the memory that the query needs in order to fit additional data

lets learn this by an example

sorting a 1 million rows , lets say the size of one row is 10 bytes the memory required to perform one sort is 512KB

512 kb is the Required Memory

10 Mb is the Additional Memory required for the query to perform well

Now lets talk about execution plans for parallel query plans , as we know in order to have parallel query for a query the execution plan should have Parallel query operators called Exchange operators Distribute Streams, Repartition Streams, and Gather Streams

exchange operators An exchange operator is an operator in a query execution plan that provides process management, data redistribution, and flow control, basically these operators exchange rows between threads. A exchange operator internally has two iterators Producers and consumers  buffers are used to transfer batch rows between iterators  Producers reads the data from the subtree and assembles them in to packets and routes them to the consumer buffers , most of the times the number of required by the query are (DOP)²

So the required memory for the is quite higher in parallel execution plan when compared to the serial execution plan for the same query




Query degree of parallelism and memory grants can be monitored using the XML query plan. The MemoryFractions element in the run-time XML showplan gives the relative memory usage among operators in the query plan. The MemoryGrant attribute of the QueryPlan element also provides information regarding memory grants for a specific query. After a query is executed, the DegreeOfParallelism attribute of the QueryPlan element in the actual XML showplan, provides information of the actual degree of parallelism used

Using the DMV


  • This DMV contain all queries requesting Memory grants and one which are granted , Queries which are waiting for the memory grants will have NULL value in the grant tim column

                     SELECT *
                    FROM sys.dm_exec_query_memory_grants
                   WHERE is_next_candidate in (0,1)
                  ORDER BY is_next_candidate desc, queue_id, wait_order;

  • sys.dm_os_wait_stats: This DMV shows wait statistics for all server objects. Since memory grants use RESOURCE_SEMAPHORE type waits, you might notice significant waits on this wait type for complex memory intensive queries.

Worker Thread Balance

 Effective number of worker threads and Inter-thread communication will have an effective utilization of CPu

there should be a right balance on the max number of worker thread , too many worker thread or too less worker threads will degrade the performance which will take us to a solid point that proper testing should be done in testing environment in order achieve required results

Distribution of data

 when we are achieving the parallelism the data will be divided between the threads for the processing but the problem is the rows or data will not be divided equally across all threads for some reasons

one known reason is the Selectivity of the data or (Skewed data)

Lets look at the simple example how distribution of data affects the parallelism and query execution


  100 rows  —   Serial Execution —-  100 secs

  100 rows  — Parallel Execution (Two threads)

              Thread 1  50  rows  1 to 50 secs

              Thread 2 50 rows 1 to 50 secs

1——————50 secs 

1——————50 rows thread 1

50—————–100 rows thread 2

Total time for the query is 50 secs

But sometimes this is not the case in reality

1——————75 secs

1—-25                            rows thread 1

25—————-100 rows thread 2

Now the total time for the query is 75 secs

one real world example is when an index is created on 2.5 million row  but one thread have been allocated with 1.5 millon rows , reason came out be the rest 1 million rows have null strings

the sort operator could not be truly parallelized and increasing the number of parallel worker threads did not improve performance


partitioned indexes, Filtered indexes 

One of the biggest peoblem is


In the case of exchange operators, synchronization is required between producer and consumer threads that are accessing the same exchange buffers. The CXPACKET lock is used for this purpose, and it provides exclusive buffer access to requesting producer and consumer threads. In most parallel execution cases, the synchronization between producer and consumer threads seems to perform well. However, there are cases (for example, when either the producer or consumer thread is waiting for a resource such as I/O, CPU, or memory grants) in which excessive CXPACKET waits can occur. Other possible causes for high CXPACKET waits are underestimation of the number of rows flowing through an operator and highly skewed data.

There is no way that we can eleminate the total CXPACKET wait but we only need to worry if it is too high which may degrade the performance




If you reduce the degree of parallelism (setting MAXDOP to an explicit value greater than 0), you reduce the number of producer and consumer buffers by a quadratic factor. This change results in fewer CXPACKET locks and consequently fewer CXPACKET waits. The ideal MAXDOP setting for a system is highly dependent on a number of factors such as workload, data access, and data distribution patterns of the query. For OLTP workloads, generally, a MAXDOP value of 1 is reliable. For data warehousing workloads, MAXDOP values ranging between ¼ and ½ the number of logical processors provided by the machine typically works well. In addition, this again depends on a number of factors such as workload, data access, and data distribution. We recommend that you not modify the degree of parallelism value on production systems without sufficient testing.

As an example, if you are experiencing unpredictable multi-user parallel query performance on a machine with 32 logical processors, consider setting the max degree of parallelism sp_configure option to 16 or even 8. For large queries that need the best possible response time, consider reducing the number of concurrently running queries (for example, during a batch window that you control) and running them with a MAXDOP value of 32.

Using the DMV

 Select * from sys.dm_os_waiting_tasks where wait_type =’CXPACKET’

we can alos look the wait time for cxpacket


One thought on “Problem solution approach for parallel query processing- CxPacket,Insufficient Memory,Data Skews,Statistics

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s