as mentioned in the above article about parallel query processing , not only benefits but there are some problems with MAXDOP and parallel query processing
Problems
1) Insufficient Memory
2) Over head of synchronizing the worker threads (which leads CXPACKET waits)
3) Statistical Estimation Errors
4) Data Skew
>> Insufficient Memory during execution of parallel query plans
before we get into the detail we will first understand little bit about memory management in SQL server
as we know SQL server requires physical memory to execute queries
Total amount of memory required by the query to is divided in to two
1) Required Memory
2) Additional Memory
Required Memory is the must have memory by the SQL server in order to execute the query
Additional Memory is the memory that the query needs in order to fit additional data
lets learn this by an example
sorting a 1 million rows , lets say the size of one row is 10 bytes the memory required to perform one sort is 512KB
512 kb is the Required Memory
10 Mb is the Additional Memory required for the query to perform well
Now lets talk about execution plans for parallel query plans , as we know in order to have parallel query for a query the execution plan should have Parallel query operators called Exchange operators Distribute Streams, Repartition Streams, and Gather Streams
exchange operators An exchange operator is an operator in a query execution plan that provides process management, data redistribution, and flow control, basically these operators exchange rows between threads. A exchange operator internally has two iterators Producers and consumers buffers are used to transfer batch rows between iterators Producers reads the data from the subtree and assembles them in to packets and routes them to the consumer buffers , most of the times the number of required by the query are (DOP)²
So the required memory for the is quite higher in parallel execution plan when compared to the serial execution plan for the same query
Solution:
Query degree of parallelism and memory grants can be monitored using the XML query plan. The MemoryFractions element in the run-time XML showplan gives the relative memory usage among operators in the query plan. The MemoryGrant attribute of the QueryPlan element also provides information regarding memory grants for a specific query. After a query is executed, the DegreeOfParallelism attribute of the QueryPlan element in the actual XML showplan, provides information of the actual degree of parallelism used
Using the DMV
SYS.DM_EXEC_QUERY_MEMORY_GRANTS
- This DMV contain all queries requesting Memory grants and one which are granted , Queries which are waiting for the memory grants will have NULL value in the grant tim column
SELECT *
FROM sys.dm_exec_query_memory_grants
WHERE is_next_candidate in (0,1)
ORDER BY is_next_candidate desc, queue_id, wait_order;
- sys.dm_os_wait_stats: This DMV shows wait statistics for all server objects. Since memory grants use RESOURCE_SEMAPHORE type waits, you might notice significant waits on this wait type for complex memory intensive queries.
Worker Thread Balance
Effective number of worker threads and Inter-thread communication will have an effective utilization of CPu
there should be a right balance on the max number of worker thread , too many worker thread or too less worker threads will degrade the performance which will take us to a solid point that proper testing should be done in testing environment in order achieve required results
Distribution of data
when we are achieving the parallelism the data will be divided between the threads for the processing but the problem is the rows or data will not be divided equally across all threads for some reasons
one known reason is the Selectivity of the data or (Skewed data)
Lets look at the simple example how distribution of data affects the parallelism and query execution
Example
100 rows — Serial Execution —- 100 secs
100 rows — Parallel Execution (Two threads)
Thread 1 50 rows 1 to 50 secs
Thread 2 50 rows 1 to 50 secs
1——————50 secs
1——————50 rows thread 1
50—————–100 rows thread 2
Total time for the query is 50 secs
But sometimes this is not the case in reality
1——————75 secs
1—-25 rows thread 1
25—————-100 rows thread 2
Now the total time for the query is 75 secs
one real world example is when an index is created on 2.5 million row but one thread have been allocated with 1.5 millon rows , reason came out be the rest 1 million rows have null strings
the sort operator could not be truly parallelized and increasing the number of parallel worker threads did not improve performance
Solution:
partitioned indexes, Filtered indexes
One of the biggest peoblem is
CXPACKET Waits
In the case of exchange operators, synchronization is required between producer and consumer threads that are accessing the same exchange buffers. The CXPACKET lock is used for this purpose, and it provides exclusive buffer access to requesting producer and consumer threads. In most parallel execution cases, the synchronization between producer and consumer threads seems to perform well. However, there are cases (for example, when either the producer or consumer thread is waiting for a resource such as I/O, CPU, or memory grants) in which excessive CXPACKET waits can occur. Other possible causes for high CXPACKET waits are underestimation of the number of rows flowing through an operator and highly skewed data.
There is no way that we can eleminate the total CXPACKET wait but we only need to worry if it is too high which may degrade the performance
Solution:
If you reduce the degree of parallelism (setting MAXDOP to an explicit value greater than 0), you reduce the number of producer and consumer buffers by a quadratic factor. This change results in fewer CXPACKET locks and consequently fewer CXPACKET waits. The ideal MAXDOP setting for a system is highly dependent on a number of factors such as workload, data access, and data distribution patterns of the query. For OLTP workloads, generally, a MAXDOP value of 1 is reliable. For data warehousing workloads, MAXDOP values ranging between ¼ and ½ the number of logical processors provided by the machine typically works well. In addition, this again depends on a number of factors such as workload, data access, and data distribution. We recommend that you not modify the degree of parallelism value on production systems without sufficient testing.
As an example, if you are experiencing unpredictable multi-user parallel query performance on a machine with 32 logical processors, consider setting the max degree of parallelism sp_configure option to 16 or even 8. For large queries that need the best possible response time, consider reducing the number of concurrently running queries (for example, during a batch window that you control) and running them with a MAXDOP value of 32.
Using the DMV
Select * from sys.dm_os_waiting_tasks where wait_type =’CXPACKET’
we can alos look the wait time for cxpacket
One thought on “Problem solution approach for parallel query processing- CxPacket,Insufficient Memory,Data Skews,Statistics”