Parallel Counting Sort | Dr Dobb's
inline void CountingSort_TBB( unsigned char* a, unsigned long a_size )
{
if ( a_size < 2 ) return;
const unsigned long numberOfCounts = 256;
CountingType count( a ); // contains the count array, which is initialized to all zeros
// Scan the array and count the number of times each value appears
// for( unsigned long i = 0; i < a_size; i++ )
// count[ a[ i ] ]++;
parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count );
// Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on
unsigned long n = 0;
for( unsigned long i = 0; i < numberOfCounts; i++ )
for( unsigned long j = 0; j < count.my_count[ i ]; j++, n++ )
a[ n ] = (unsigned char)i;
}
class CountingType
{
unsigned char* my_input_array; // a local copy to the array being counted
// to provide a pointer to each parallel task
public:
static const unsigned long numberOfCounts = 256;
unsigned long my_count[ numberOfCounts ]; // the count for this task
CountingType( unsigned char a[] ) : my_input_array( a ) // constructor, which copies the pointer to the array being counted
{
for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values
my_count[ i ] = 0;
}
// Method that performs the core work of counting
void operator()( const blocked_range< unsigned long >& r )
{
unsigned char* a = my_input_array; // these local variables are used to help the compiler optimize the code better
size_t end = r.end();
unsigned long* count = my_count;
for( unsigned long i = r.begin(); i != end; ++i )
count[ a[ i ] ]++;
}
// Splitter (splitting constructor) required by the parallel_reduce
// Takes a reference to the original object, and a dummy argument to distinguish this method from a copy constructor
CountingType( CountingType& x, split ) : my_input_array( x.my_input_array )
{
for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values
my_count[ i ] = 0;
}
// Join method required by parallel_reduce
// Combines a result from another task into this result
void join( const CountingType& y )
{
for( unsigned long i = 0; i < numberOfCounts; i++ )
my_count[ i ] += y.my_count[ i ];
}
};
Parallel_reduce() splits the input array into sub-arrays, which are then processed by each computational core in parallel. To do this efficiently, however, each sub-array needs to have its own independent count[] array that is initialized to zeros and then counts the sub-array elements. In other words, instead of sharing the count[] array among multiple threads, thecount[] array is duplicated so that each thread has and operates on its own copy (as illustrated below).
TBB parallel_reduce() recursively splits the input array into P sub-arrays, where P is the number of processors available. During each recursive split, the splitter method is called. After the algorithm completes processing each sub-array, the join method is called, to combine the results from two sub-arrays. This process is illustrated by the diagram below.
The splitter method, which looks like a copy constructor with a dummy split second argument, creates a copy of CountingType variable by copying the pointer to the input array and initializing the count[] array to zeros. This work is necessary to process a new sub-array independently (as shown in the upper half of the graph above -- note count[] and count_1[]are separate arrays).
Read full article from Parallel Counting Sort | Dr Dobb's