Caret7:Development/OpenMP tricks

From Van Essen Lab

(Difference between revisions)
Jump to: navigation, search
(Quirks)
Line 3: Line 3:
== Quirks ==
== Quirks ==
-
* OpenMP's default loop scheduler is static, with a chunk size equal to (number of indexes) / (number of threads).  This means each process is assigned an equal sized continuous chunk of indexes, and will STOP and wait for the other threads when it is finished.  For things like correlation where the workload is unequal among indexes, this is inefficient because as soon as the lightest loaded thread is done, it is using one less thread than available.  The way to fix this for such jobs is to include the clause "scheduler(dynamic, 1)" (use dynamic scheduling with a chunk size of 1) to the "for" pragma line, it will cause each thread to work on one index, and will allocate it a new index once it finishes, never letting threads sit idle until all indexes are claimed.  Larger chunk sizes may be used if you know the job will always contain a lot of indexes, in order to reduce overhead.  If the job has extremely fast inner loops, this may slow the execution down somewhat due to overhead of mutexes and scheduling logic, so for jobs with equal workloads, the default static scheduling is probably fine.
+
* OpenMP's default loop scheduler is static, with a chunk size equal to (number of indexes) / (number of threads).  This means each thread is assigned an equal sized continuous chunk of indexes, and will STOP and wait for the other threads when it is finished.  For things like correlation where the workload is unequal among indexes, this is inefficient because as soon as the lightest loaded thread is done, it is using one less thread than available.  The way to fix this for such jobs is to include the clause "scheduler(dynamic, 1)" (use dynamic scheduling with a chunk size of 1) to the "for" pragma line, it will cause each thread to work on one index, and will allocate it a new index once it finishes, never letting threads sit idle until all indexes are claimed.  Larger chunk sizes may be used if you know the job will always contain a lot of indexes, in order to reduce overhead.  If the job has extremely fast inner loops, this may slow the execution down somewhat due to overhead of mutexes and scheduling logic, so for jobs with equal workloads, the default static scheduling is probably fine.
* The "private" clause is broken in some compilers (notably, the mac compiler for a recent version of XCode), so instead, use the "parallel" pragma by itself first, declare what needs to be private but persist through the loop, then use the "for" pragma on the loop itself.
* The "private" clause is broken in some compilers (notably, the mac compiler for a recent version of XCode), so instead, use the "parallel" pragma by itself first, declare what needs to be private but persist through the loop, then use the "for" pragma on the loop itself.

Revision as of 21:55, 7 September 2011

This page describes quirks discovered while using OpenMP in caret, as well as some general parallelization guidelines.

Quirks

  • OpenMP's default loop scheduler is static, with a chunk size equal to (number of indexes) / (number of threads). This means each thread is assigned an equal sized continuous chunk of indexes, and will STOP and wait for the other threads when it is finished. For things like correlation where the workload is unequal among indexes, this is inefficient because as soon as the lightest loaded thread is done, it is using one less thread than available. The way to fix this for such jobs is to include the clause "scheduler(dynamic, 1)" (use dynamic scheduling with a chunk size of 1) to the "for" pragma line, it will cause each thread to work on one index, and will allocate it a new index once it finishes, never letting threads sit idle until all indexes are claimed. Larger chunk sizes may be used if you know the job will always contain a lot of indexes, in order to reduce overhead. If the job has extremely fast inner loops, this may slow the execution down somewhat due to overhead of mutexes and scheduling logic, so for jobs with equal workloads, the default static scheduling is probably fine.
  • The "private" clause is broken in some compilers (notably, the mac compiler for a recent version of XCode), so instead, use the "parallel" pragma by itself first, declare what needs to be private but persist through the loop, then use the "for" pragma on the loop itself.

General guidelines

  • Parallelization at the lowest level that doesn't introduce much overhead is generally a good idea - if you can parallelize an operation on a single column of a metric, it is better than going parallel only when you have a multiple column metric, since it allows it to always run parallel.
Personal tools
Sums Database