Caret7:Development/OpenMP tricks

From Van Essen Lab

Jump to: navigation, search

This page describes quirks discovered while using OpenMP in caret, as well as some general parallelization guidelines.

Quirks

  • OpenMP's default loop scheduler is static, with a chunk size equal to (number of indexes) / (number of threads). This means each thread is assigned an equal sized continuous chunk of indexes, and will STOP and wait for the other threads when it is finished. For things like correlation where the workload is unequal among indexes, this is inefficient because as soon as the lightest loaded thread is done, it is using one less thread than available. The way to fix this for such jobs is to include the clause "scheduler(dynamic, 1)" (use dynamic scheduling with a chunk size of 1, default chunk size is 1 so technically you don't need to specify it) to the "for" pragma line, it will cause each thread to work on one index, and will allocate it a new index once it finishes, never letting threads sit idle until all indexes are claimed:
//#pragma omp parallel for //this pragma will result in poor processor utilization
#pragma omp parallel for scheduler(dynamic, 1) //this pragma will utilize all cores fully
for (int i = 0; i < max; ++i)
{//process only upper triangular
   for (int j = i; j < max; ++j)
   {
      doProcessing(i, j, matrix);
   }
}
  • There are other scheduler methods which should also solve the problem, such as "guided" (uses large chunks initially, but tapers down to specified chunk size at the end, again assigning chunks as they are requested). Larger chunk sizes may be better if you know the job will always contain a lot of indexes, in order to reduce overhead. If the job has extremely fast inner loops, using this scheduler may slow the execution down slightly due to overhead of mutexes and scheduling logic, so for jobs with equal workloads, the default static scheduling is probably fine:
//#pragma omp parallel for scheduler(dynamic, 1) //this pragma will work, but since workload is even, isn't needed
#pragma omp parallel for //this pragma will work fine for this usage pattern
for (int i = 0; i < max; ++i)
{//process entire square matrix
   for (int j = 0; j < max; ++j)
   {
      sum += matrix[i][j];
   }
}
  • The "private" clause is broken in some compilers (notably, the mac compiler for a recent version of XCode), so instead, use the "parallel" pragma by itself first, declare what needs to be private but persist through the loop, then use the "for" pragma on the loop itself:
#pragma omp parallel //do this to make variables thread private if they are declared inside the block
{
   vector<float> myScratchSpace;
//#pragma omp parallel for private(myScratchSpace) //broken in some compilers
#pragma omp for //so do this, but we need private copies, so see the first pragma
   for (int i = 0; i < numColumns; ++i)
   {
      doProcessing(i, myScratchSpace);
   }
} //omp parallel

General guidelines

  • Parallelization at the lowest level that doesn't introduce much overhead is generally a good idea - if you can parallelize an operation on a single column of a metric, it is better than going parallel only when you have a multiple column metric, since it allows it to always run parallel:
//#pragma omp parallel for //not ideal, what if there are only 1 or 2 columns?
for (int i = 0; i < numColumns; ++i)
{
   processColumn(i);
}

void processColumn(int whichColumn)
{
#pragma omp parallel for //good, allows a single column to be processed in parallel
   for (int i = 0; i < numNodes; ++i)
   {
      m_myMetricOut.setValue(i, j, someFunction(m_myMetricIn.getValue(i, j)));
   }
}
Personal tools
Sums Database