http://www.lst.inf.ethz.ch/teaching/lectures/ss10/24/slides/recitation/week04/mm.pdf
Parallel matrix multiplication
Each thread computes its share of the output C
Partition C by columns
http://stackoverflow.com/questions/5484204/parallel-matrix-multiplication-in-java-6
Let m = myrank, the MPI identifier of my process number. The algorithm relies on a "block daxpy" form of matrix-matrix multiply
Let's assume the "owner computes" rule: the processor that ends up "owning" the data is the one that will carry out the computations (and concommittent communications) needed to get that data. Then process i has all of C and B that it needs, but it will need to access all of A. The basic idea is to first let each process multiply the part it owns:
Now apply a timing model to this. Assume that communication follows the linear model
Also note that the cost associated with computation is O(n3), while the amount of communication is O(n2). This is the key to another definition of a "scalable" distributed memory algorithm; the communication costs grow slower than the computation costs. Here, however, the definition of scalable has undergone a subtle shift from that given earlier in the course, where it meant "parallel efficiency is bounded below by a positive number, as p grows".
Suppose now that p is a perfect square, and s2 = p. The matrices are partitioned and assigned to a mesh of processes. This version of matrix-matrix uses the formula for matrix-matrix product that has a block inner product as the innermost loop:
The timing model for this is straightforward. First, note that on the shift phase 1A each row can proceed independently. The last row has the furthest to shift blocks, requiring (in a mesh connected network) s-1 messages to get a block where it belongs. Since each process in the row can be sending a message simultaneously, this means communication time for Phase 1A is
http://stackoverflow.com/questions/5254294/using-executorservice-with-a-tree-of-tasks-to-perform
http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Ortega-Fall-2012-CSE633.pdf
http://www.slideshare.net/pkpramit/matrix-multiplicationan-example-of-concurrent-programming
http://www.cs.berkeley.edu/~yelick/cs267-sp04/lectures/13/lect13-pmatmul-6x.pdf
Parallel matrix multiplication
Each thread computes its share of the output C
Partition C by columns
http://stackoverflow.com/questions/5484204/parallel-matrix-multiplication-in-java-6
public class Java6MatrixMultiply implements Algorithm {
private static final int SIZE = 2048;
private static final int THRESHOLD = 64;
private static final int MAX_THREADS = Runtime.getRuntime().availableProcessors();
private final ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);
private float[][] a = new float[SIZE][SIZE];
private float[][] b = new float[SIZE][SIZE];
private float[][] c = new float[SIZE][SIZE];
@Override
public void initialize() {
init(a, b, SIZE);
}
@Override
public void execute() {
MatrixMultiplyTask mainTask = new MatrixMultiplyTask(a, 0, 0, b, 0, 0, c, 0, 0, SIZE);
Future future = executor.submit(mainTask);
try {
future.get();
} catch (Exception e) {
System.out.println("Error: " + e.getMessage());
}
}
@Override
public void printResult() {
check(c, SIZE);
for (int i = 0; i < SIZE && i <= 10; i++) {
for (int j = 0; j < SIZE && j <= 10; j++) {
if(j == 10) {
System.out.print("...");
}
else {
System.out.print(c[i][j] + " ");
}
}
if(i == 10) {
System.out.println();
for(int k = 0; k < 10; k++) System.out.print(" ... ");
}
System.out.println();
}
System.out.println();
}
// To simplify checking, fill with all 1's. Answer should be all n's.
static void init(float[][] a, float[][] b, int n) {
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
a[i][j] = 1.0F;
b[i][j] = 1.0F;
}
}
}
static void check(float[][] c, int n) {
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
if (c[i][j] != n) {
throw new Error("Check Failed at [" + i + "][" + j + "]: " + c[i][j]);
//System.out.println("Check Failed at [" + i + "][" + j + "]: " + c[i][j]);
}
}
}
}
public class Seq implements Runnable {
private final MatrixMultiplyTask a;
private final MatrixMultiplyTask b;
public Seq(MatrixMultiplyTask a, MatrixMultiplyTask b) {
this.a = a;
this.b = b;
}
public void run() {
a.run();
b.run();
}
}
private class MatrixMultiplyTask implements Runnable {
private final float[][] A; // Matrix A
private final int aRow; // first row of current quadrant of A
private final int aCol; // first column of current quadrant of A
private final float[][] B; // Similarly for B
private final int bRow;
private final int bCol;
private final float[][] C; // Similarly for result matrix C
private final int cRow;
private final int cCol;
private final int size;
public MatrixMultiplyTask(float[][] A, int aRow, int aCol, float[][] B,
int bRow, int bCol, float[][] C, int cRow, int cCol, int size) {
this.A = A;
this.aRow = aRow;
this.aCol = aCol;
this.B = B;
this.bRow = bRow;
this.bCol = bCol;
this.C = C;
this.cRow = cRow;
this.cCol = cCol;
this.size = size;
}
public void run() {
//System.out.println("Thread: " + Thread.currentThread().getName());
if (size <= THRESHOLD) {
multiplyStride2();
} else {
int h = size / 2;
Seq seq1 = new Seq(new MatrixMultiplyTask(A,
aRow, aCol, // A11
B, bRow, bCol, // B11
C, cRow, cCol, // C11
h),
new MatrixMultiplyTask(A, aRow, aCol + h, // A12
B, bRow + h, bCol, // B21
C, cRow, cCol, // C11
h));
Seq seq2 = new Seq(new MatrixMultiplyTask(A,
aRow, aCol, // A11
B, bRow, bCol + h, // B12
C, cRow, cCol + h, // C12
h),
new MatrixMultiplyTask(A, aRow, aCol + h, // A12
B, bRow + h, bCol + h, // B22
C, cRow, cCol + h, // C12
h));
Seq seq3 = new Seq(new MatrixMultiplyTask(A, aRow
+ h, aCol, // A21
B, bRow, bCol, // B11
C, cRow + h, cCol, // C21
h),
new MatrixMultiplyTask(A, aRow + h, aCol + h, // A22
B, bRow + h, bCol, // B21
C, cRow + h, cCol, // C21
h));
Seq seq4 = new Seq(new MatrixMultiplyTask(A, aRow
+ h, aCol, // A21
B, bRow, bCol + h, // B12
C, cRow + h, cCol + h, // C22
h),
new MatrixMultiplyTask(A, aRow + h, aCol + h, // A22
B, bRow + h, bCol + h, // B22
C, cRow + h, cCol + h, // C22
h));
final FutureTask s1Task = new FutureTask(seq2, null);
final FutureTask s2Task = new FutureTask(seq3, null);
final FutureTask s3Task = new FutureTask(seq4, null);
executor.execute(s1Task);
executor.execute(s2Task);
executor.execute(s3Task);
seq1.run();
s1Task.run();
s2Task.run();
s3Task.run();
try {
s1Task.get();
s2Task.get();
s3Task.get();
} catch (Exception e) {
System.out.println("Error: " + e.getMessage());
executor.shutdownNow();
}
}
}
public void multiplyStride2() {
for (int j = 0; j < size; j += 2) {
for (int i = 0; i < size; i += 2) {
float[] a0 = A[aRow + i];
float[] a1 = A[aRow + i + 1];
float s00 = 0.0F;
float s01 = 0.0F;
float s10 = 0.0F;
float s11 = 0.0F;
for (int k = 0; k < size; k += 2) {
float[] b0 = B[bRow + k];
s00 += a0[aCol + k] * b0[bCol + j];
s10 += a1[aCol + k] * b0[bCol + j];
s01 += a0[aCol + k] * b0[bCol + j + 1];
s11 += a1[aCol + k] * b0[bCol + j + 1];
float[] b1 = B[bRow + k + 1];
s00 += a0[aCol + k + 1] * b1[bCol + j];
s10 += a1[aCol + k + 1] * b1[bCol + j];
s01 += a0[aCol + k + 1] * b1[bCol + j + 1];
s11 += a1[aCol + k + 1] * b1[bCol + j + 1];
}
C[cRow + i][cCol + j] += s00;
C[cRow + i][cCol + j + 1] += s01;
C[cRow + i + 1][cCol + j] += s10;
C[cRow + i + 1][cCol + j + 1] += s11;
}
}
}
}
}
http://www.cs.indiana.edu/classes/b673-bram/notes/matrix_mult.html1D Block Layout Algorithm
First consider layout by block columns, with the matrix partitioning given with s = p above. Process k owns all the data in A(:,k), B(:,k), and C(:,k).Let m = myrank, the MPI identifier of my process number. The algorithm relies on a "block daxpy" form of matrix-matrix multiply
C(:,i) = C(:,i) + A*B(:,i) = C(:,i) + sum A(:,j)*B(j,i),where the sum runs from 0 to p-1. A picture of this for block column i = 1 is below:
C(:,m) = C(:,m) + A(:,m)*B(m,m)where m = myrank, and then to send its block column of A to the "next" process, in a shift with wrap-around. The corresponding parallel algorithm is
C(:,m) = C(:,m) + A(:,m)*B(m,m) Copy A(:,m) to buffer S for i = 1: p-1 send S to process (m+1) mod p receive S from process (m-1) mod p C(:,m) = C(:,m) + S*B((m-i) mod p,m) endQuick reality check: which if any of the send/receive pair above should be blocking? Which need to be nonblocking?
Now apply a timing model to this. Assume that communication follows the linear model
t(k) = tlat + k*tbw,where k is the number of double words sent, tlat is the latency time, and tbw is the time associated with the bandwidth of the communication channel. Each computation step will require
(2 n/p) * (n/p) * n = (2n3)/p2 flopsEach iteration of the for loop involves
t = tlat + (n2)/p*tbwcommunication time. So the total time is
(2n3)/p2 + (p-1)[ (2n3)/p2 + tlat + n2/p * tbw ] = 2n3/p + (p-1)tlat + n2(p-1)/p tbwNotice that if the total time consisted only of the first term, then the algorithm has perfect speedup - since it is the total number of flops (which we are assuming take one time unit each) divided by the number of processors. Hence the other two terms give the overhead due to parallelism. The theoretical speedup is then obtained by dividing the above time into 2n3 (why?), and the parallel efficiency is obtained by dividing speedup by p for
1 E = ------------------------------ 1 + p(p-1)tlat + (p-1)tbw ---------- -------- 2n3 2nNote that as n increases, the efficiency approaches 1. This indicates that high efficiencies should be attainable on this algorithm. What could prevent that from happening? The above communication pattern has a natural mapping to which kind(s) of interconnect networks for a distributed memory machine? Although MPI provides an interconnect-independent view, it is important that you be able to "see" natural mappings like these. The way to do it is by looking at the communications involved in the algorithm.
Also note that the cost associated with computation is O(n3), while the amount of communication is O(n2). This is the key to another definition of a "scalable" distributed memory algorithm; the communication costs grow slower than the computation costs. Here, however, the definition of scalable has undergone a subtle shift from that given earlier in the course, where it meant "parallel efficiency is bounded below by a positive number, as p grows".
2D Block Layout Algorithm
As you should have guessed above, the 1D algorithm maps naturally to a ring of processors. Of course, since a ring interconnect network is embedded in a rectangular mesh with wrap-around and in a hypercube, it should still work alright on such machines. But it does not fully use the available interconnections - which suggests a better algorithm is possible.Suppose now that p is a perfect square, and s2 = p. The matrices are partitioned and assigned to a mesh of processes. This version of matrix-matrix uses the formula for matrix-matrix product that has a block inner product as the innermost loop:
C(i,j) = C(i,j) + sum A(i,k)*B(k,j)where the summation goes from k = 0 to k = s-1. For the algorithm below, note that we can perform the summation in any order desired (by falsely assuming matrix multiplication is associative), so it can be instead written as
C(i,j) = C(i,j) + sum A(i,(i+j+k)mod s)*B((i+j+k)mod s,j)The algorithm has a preliminary skewing phase, followed by a multiplication/shift phase indicated by the shift of indices above. Initially suppose that s = 3, so that A looks like
[A(0,0) A(0,1) A(0,2) ] A = |A(1,0) A(1,1) A(1,2) | [A(2,0) A(2,1) A(2,3) ]and B, C are similarly partitioned.
Phase 1A: for i = 0:s-1 left circular shift row i of A by amount i: A(i,j) <-- A(i, (j+i)mod s) end forAt the end of Phase 1A, the matrix A looks like:
[A(0,0) A(0,1) A(0,2) ] A = |A(1,1) A(1,2) A(1,0) | [A(2,2) A(2,0) A(2,1) ]
Phase 1B: for i = 0:s-1 upward circular shift column j of B by amount j: B(i,j) <-- B((i+j)mod s, j) end forAt the end of Phase 1B, the matrix B looks like:
[B(0,0) B(1,1) B(2,2) ] B = |B(1,0) B(2,1) B(0,2) | [B(2,0) B(0,1) B(1,2) ]After this skewing, the actual computation begins, with a circular shift on each step:
Phase 2: for k = 0:s-1 for i = 0:s-1 and j = 0:s-1 C(i,j) = C(i,j) + A(i,j)*B(i,j) end left circular shift each row of A by 1: A(i,j) <-- A(i, (j+1)mod s) upward circular shift each column of B by 1: B(i,j) <-- B((i+1)mod s, j)Here is the set of pictures for this process, for s = 4:
(s-1) (tlat + (n/s)2 tbw)and clearly Phase 1B takes the same time. Phase 2 has each shift costing tlat + (n/s)2 tbw, and each local multiplication taking 2(n/s)3 flops. Each loop iteration in the "for i = 0:s-1 and j = 0:s-1" can be done in parallel. Since p = s2, the theoretical time is
Tp = 2n3/p + 3*sqrt(p)tlat + 3n2/sqrt(p) tbwand the corresponding speedup and efficiency is easily obtained. Note that like the first version there is no redundant arithmetic, but now the communication cost is a factor of sqrt(p) smaller. Why is this to be expected?
http://stackoverflow.com/questions/5254294/using-executorservice-with-a-tree-of-tasks-to-perform
Java 7 has the concept of a
ForkJoinPool
that allows a task to "fork" off another task by submitting it tot he same Executor. Then gives it the option of later attempting to "help join" that task by attempting to run it if it has not been run.
Java 7 has the concept of a
ForkJoinPool
that allows a task to "fork" off another task by submitting it tot he same Executor. Then gives it the option of later attempting to "help join" that task by attempting to run it if it has not been run.http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Ortega-Fall-2012-CSE633.pdf
http://www.slideshare.net/pkpramit/matrix-multiplicationan-example-of-concurrent-programming
http://www.cs.berkeley.edu/~yelick/cs267-sp04/lectures/13/lect13-pmatmul-6x.pdf