Parallel Counting Sort | Dr Dobb's
inline void CountingSort_TBB( unsigned char* a, unsigned long a_size ){ if ( a_size < 2 ) return; const unsigned long numberOfCounts = 256; CountingType count( a ); // contains the count array, which is initialized to all zeros // Scan the array and count the number of times each value appears // for( unsigned long i = 0; i < a_size; i++ ) // count[ a[ i ] ]++; parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count ); // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) for( unsigned long j = 0; j < count.my_count[ i ]; j++, n++ ) a[ n ] = (unsigned char)i;}class CountingType{ unsigned char* my_input_array; // a local copy to the array being counted // to provide a pointer to each parallel taskpublic: static const unsigned long numberOfCounts = 256; unsigned long my_count[ numberOfCounts ]; // the count for this task CountingType( unsigned char a[] ) : my_input_array( a ) // constructor, which copies the pointer to the array being counted { for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values my_count[ i ] = 0; } // Method that performs the core work of counting void operator()( const blocked_range< unsigned long >& r ) { unsigned char* a = my_input_array; // these local variables are used to help the compiler optimize the code better size_t end = r.end(); unsigned long* count = my_count; for( unsigned long i = r.begin(); i != end; ++i ) count[ a[ i ] ]++; } // Splitter (splitting constructor) required by the parallel_reduce // Takes a reference to the original object, and a dummy argument to distinguish this method from a copy constructor CountingType( CountingType& x, split ) : my_input_array( x.my_input_array ) { for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values my_count[ i ] = 0; } // Join method required by parallel_reduce // Combines a result from another task into this result void join( const CountingType& y ) { for( unsigned long i = 0; i < numberOfCounts; i++ ) my_count[ i ] += y.my_count[ i ]; }};
Parallel_reduce() splits the input array into sub-arrays, which are then processed by each computational core in parallel. To do this efficiently, however, each sub-array needs to have its own independent count[] array that is initialized to zeros and then counts the sub-array elements. In other words, instead of sharing the count[] array among multiple threads, thecount[] array is duplicated so that each thread has and operates on its own copy (as illustrated below).
TBB parallel_reduce() recursively splits the input array into P sub-arrays, where P is the number of processors available. During each recursive split, the splitter method is called. After the algorithm completes processing each sub-array, the join method is called, to combine the results from two sub-arrays. This process is illustrated by the diagram below.
The splitter method, which looks like a copy constructor with a dummy split second argument, creates a copy of CountingType variable by copying the pointer to the input array and initializing the count[] array to zeros. This work is necessary to process a new sub-array independently (as shown in the upper half of the graph above -- note count[] and count_1[]are separate arrays).
Read full article from Parallel Counting Sort | Dr Dobb's