Massive Algorithms: Parallel Matrix Multiplication

Parallel Matrix Multiplication

http://www.lst.inf.ethz.ch/teaching/lectures/ss10/24/slides/recitation/week04/mm.pdf
Parallel matrix multiplication
Each thread computes its share of the output C
Partition C by columns

http://stackoverflow.com/questions/5484204/parallel-matrix-multiplication-in-java-6

public class Java6MatrixMultiply implements Algorithm {

    private static final int SIZE = 2048;
    private static final int THRESHOLD = 64;
    private static final int MAX_THREADS = Runtime.getRuntime().availableProcessors();

    private final ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);

    private float[][] a = new float[SIZE][SIZE];
    private float[][] b = new float[SIZE][SIZE];
    private float[][] c = new float[SIZE][SIZE];

    @Override
    public void initialize() {
        init(a, b, SIZE);
    }

    @Override
    public void execute() {
        MatrixMultiplyTask mainTask =  new MatrixMultiplyTask(a, 0, 0, b, 0, 0, c, 0, 0, SIZE);
        Future future = executor.submit(mainTask);  

        try {
            future.get();
        } catch (Exception e) {
            System.out.println("Error: " + e.getMessage());
        }
    }

    @Override
    public void printResult() {
        check(c, SIZE);

        for (int i = 0; i < SIZE && i <= 10; i++) {
            for (int j = 0; j < SIZE && j <= 10; j++) {         
                if(j == 10) {
                    System.out.print("...");
                }
                else {
                    System.out.print(c[i][j] + " ");
                }
            }

            if(i == 10) {
                System.out.println();
                for(int k = 0; k < 10; k++) System.out.print(" ... ");
            }   

            System.out.println();
        }       

        System.out.println();
    }

    // To simplify checking, fill with all 1's. Answer should be all n's.
    static void init(float[][] a, float[][] b, int n) {
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < n; ++j) {
                a[i][j] = 1.0F;
                b[i][j] = 1.0F;
            }
        }
    }

    static void check(float[][] c, int n) {
        for (int i = 0; i < n; i++) {
            for (int j = 0; j < n; j++) {
                if (c[i][j] != n) {
                    throw new Error("Check Failed at [" + i + "][" + j + "]: " + c[i][j]);
                    //System.out.println("Check Failed at [" + i + "][" + j + "]: " + c[i][j]); 
                }
            }
        }       
    }   

    public class Seq implements Runnable {

        private final MatrixMultiplyTask a;
        private final MatrixMultiplyTask b;

        public Seq(MatrixMultiplyTask a, MatrixMultiplyTask b) {
            this.a = a;
            this.b = b;     
        }

        public void run() {
            a.run();
            b.run();
        }   
    }

    private class MatrixMultiplyTask implements Runnable {
        private final float[][] A; // Matrix A
        private final int aRow; // first row of current quadrant of A
        private final int aCol; // first column of current quadrant of A

        private final float[][] B; // Similarly for B
        private final int bRow;
        private final int bCol;

        private final float[][] C; // Similarly for result matrix C
        private final int cRow;
        private final int cCol;

        private final int size;

        public MatrixMultiplyTask(float[][] A, int aRow, int aCol, float[][] B,
                int bRow, int bCol, float[][] C, int cRow, int cCol, int size) {

            this.A = A;
            this.aRow = aRow;
            this.aCol = aCol;
            this.B = B;
            this.bRow = bRow;
            this.bCol = bCol;
            this.C = C;
            this.cRow = cRow;
            this.cCol = cCol;
            this.size = size;
        }   

        public void run() {

            //System.out.println("Thread: " + Thread.currentThread().getName());

            if (size <= THRESHOLD) {
                multiplyStride2();
            } else {

                int h = size / 2;

                        Seq seq1 = new Seq(new MatrixMultiplyTask(A,
                                aRow, aCol, // A11
                                B, bRow, bCol, // B11
                                C, cRow, cCol, // C11
                                h),

                        new MatrixMultiplyTask(A, aRow, aCol + h, // A12
                                B, bRow + h, bCol, // B21
                                C, cRow, cCol, // C11
                                h));

                        Seq seq2 = new Seq(new MatrixMultiplyTask(A,
                                aRow, aCol, // A11
                                B, bRow, bCol + h, // B12
                                C, cRow, cCol + h, // C12
                                h),

                        new MatrixMultiplyTask(A, aRow, aCol + h, // A12
                                B, bRow + h, bCol + h, // B22
                                C, cRow, cCol + h, // C12
                                h));

                        Seq seq3 = new Seq(new MatrixMultiplyTask(A, aRow
                                + h, aCol, // A21
                                B, bRow, bCol, // B11
                                C, cRow + h, cCol, // C21
                                h),

                        new MatrixMultiplyTask(A, aRow + h, aCol + h, // A22
                                B, bRow + h, bCol, // B21
                                C, cRow + h, cCol, // C21
                                h));

                        Seq seq4 = new Seq(new MatrixMultiplyTask(A, aRow
                                + h, aCol, // A21
                                B, bRow, bCol + h, // B12
                                C, cRow + h, cCol + h, // C22
                                h),

                        new MatrixMultiplyTask(A, aRow + h, aCol + h, // A22
                                B, bRow + h, bCol + h, // B22
                                C, cRow + h, cCol + h, // C22
                                h));            



                final FutureTask s1Task = new FutureTask(seq2, null);
                final FutureTask s2Task = new FutureTask(seq3, null);
                final FutureTask s3Task = new FutureTask(seq4, null);

                executor.execute(s1Task);
                executor.execute(s2Task);
                executor.execute(s3Task);

                seq1.run();
                s1Task.run();
                s2Task.run();
                s3Task.run();

                try {
                    s1Task.get();
                    s2Task.get();
                    s3Task.get();
                } catch (Exception e) {
                    System.out.println("Error: " + e.getMessage());
                    executor.shutdownNow();
                }       
            }       
        }       

        public void multiplyStride2() {
            for (int j = 0; j < size; j += 2) {
                for (int i = 0; i < size; i += 2) {

                    float[] a0 = A[aRow + i];
                    float[] a1 = A[aRow + i + 1];

                    float s00 = 0.0F;
                    float s01 = 0.0F;
                    float s10 = 0.0F;
                    float s11 = 0.0F;

                    for (int k = 0; k < size; k += 2) {

                        float[] b0 = B[bRow + k];

                        s00 += a0[aCol + k] * b0[bCol + j];
                        s10 += a1[aCol + k] * b0[bCol + j];
                        s01 += a0[aCol + k] * b0[bCol + j + 1];
                        s11 += a1[aCol + k] * b0[bCol + j + 1];

                        float[] b1 = B[bRow + k + 1];

                        s00 += a0[aCol + k + 1] * b1[bCol + j];
                        s10 += a1[aCol + k + 1] * b1[bCol + j];
                        s01 += a0[aCol + k + 1] * b1[bCol + j + 1];
                        s11 += a1[aCol + k + 1] * b1[bCol + j + 1];
                    }

                    C[cRow + i][cCol + j] += s00;
                    C[cRow + i][cCol + j + 1] += s01;
                    C[cRow + i + 1][cCol + j] += s10;
                    C[cRow + i + 1][cCol + j + 1] += s11;
                }
            }           
        }
    }
}

http://www.cs.indiana.edu/classes/b673-bram/notes/matrix_mult.html

1D Block Layout Algorithm

First consider layout by block columns, with the matrix partitioning given with s = p above. Process k owns all the data in A(:,k), B(:,k), and C(:,k).
Let m = myrank, the MPI identifier of my process number. The algorithm relies on a "block daxpy" form of matrix-matrix multiply

C(:,i) = C(:,i) + A*B(:,i)
           = C(:,i) + sum A(:,j)*B(j,i),

where the sum runs from 0 to p-1. A picture of this for block column i = 1 is below:

1D Layout Algorithm Here

Let's assume the "owner computes" rule: the processor that ends up "owning" the data is the one that will carry out the computations (and concommittent communications) needed to get that data. Then process i has all of C and B that it needs, but it will need to access all of A. The basic idea is to first let each process multiply the part it owns:

C(:,m) = C(:,m) + A(:,m)*B(m,m)

where m = myrank, and then to send its block column of A to the "next" process, in a shift with wrap-around. The corresponding parallel algorithm is

C(:,m) = C(:,m) + A(:,m)*B(m,m)
Copy A(:,m) to buffer S
for i = 1: p-1
  send S to process (m+1) mod p
  receive S from process (m-1) mod p
  C(:,m) = C(:,m) + S*B((m-i) mod p,m)
end

Quick reality check: which if any of the send/receive pair above should be blocking? Which need to be nonblocking?
Now apply a timing model to this. Assume that communication follows the linear model

t(k) = t_lat + k*t_bw,

where k is the number of double words sent, t_lat is the latency time, and t_bw is the time associated with the bandwidth of the communication channel. Each computation step will require

(2 n/p) * (n/p) * n = (2n³)/p² flops

Each iteration of the for loop involves

t = t_lat + (n²)/p*t_bw

communication time. So the total time is

(2n³)/p² + (p-1)[ (2n³)/p² + t_lat + n²/p * t_bw ]
= 2n³/p + (p-1)t_lat + n²(p-1)/p t_bw

Notice that if the total time consisted only of the first term, then the algorithm has perfect speedup - since it is the total number of flops (which we are assuming take one time unit each) divided by the number of processors. Hence the other two terms give the overhead due to parallelism. The theoretical speedup is then obtained by dividing the above time into 2n³ (why?), and the parallel efficiency is obtained by dividing speedup by p for

                 1
E =   ------------------------------
       1 + p(p-1)t_lat  +  (p-1)t_bw
           ----------     --------
              2n³            2n

Note that as n increases, the efficiency approaches 1. This indicates that high efficiencies should be attainable on this algorithm. What could prevent that from happening? The above communication pattern has a natural mapping to which kind(s) of interconnect networks for a distributed memory machine? Although MPI provides an interconnect-independent view, it is important that you be able to "see" natural mappings like these. The way to do it is by looking at the communications involved in the algorithm.
Also note that the cost associated with computation is O(n³), while the amount of communication is O(n²). This is the key to another definition of a "scalable" distributed memory algorithm; the communication costs grow slower than the computation costs. Here, however, the definition of scalable has undergone a subtle shift from that given earlier in the course, where it meant "parallel efficiency is bounded below by a positive number, as p grows".

2D Block Layout Algorithm

As you should have guessed above, the 1D algorithm maps naturally to a ring of processors. Of course, since a ring interconnect network is embedded in a rectangular mesh with wrap-around and in a hypercube, it should still work alright on such machines. But it does not fully use the available interconnections - which suggests a better algorithm is possible.
Suppose now that p is a perfect square, and s² = p. The matrices are partitioned and assigned to a mesh of processes. This version of matrix-matrix uses the formula for matrix-matrix product that has a block inner product as the innermost loop:

 C(i,j) = C(i,j) + sum A(i,k)*B(k,j)

where the summation goes from k = 0 to k = s-1. For the algorithm below, note that we can perform the summation in any order desired (by falsely assuming matrix multiplication is associative), so it can be instead written as

 C(i,j) = C(i,j) + sum A(i,(i+j+k)mod s)*B((i+j+k)mod s,j)

The algorithm has a preliminary skewing phase, followed by a multiplication/shift phase indicated by the shift of indices above. Initially suppose that s = 3, so that A looks like

      [A(0,0)   A(0,1)   A(0,2) ]
 A =  |A(1,0)   A(1,1)   A(1,2) |
      [A(2,0)   A(2,1)   A(2,3) ]

and B, C are similarly partitioned.

Phase 1A: for i = 0:s-1
            left circular shift row i of A by amount i:
              A(i,j) <-- A(i, (j+i)mod s)
          end for

At the end of Phase 1A, the matrix A looks like:

      [A(0,0)   A(0,1)   A(0,2) ]
 A =  |A(1,1)   A(1,2)   A(1,0) |
      [A(2,2)   A(2,0)   A(2,1) ]

Phase 1B: for i = 0:s-1
            upward circular shift column j of B by amount j:
              B(i,j) <-- B((i+j)mod s, j)
          end for

At the end of Phase 1B, the matrix B looks like:

      [B(0,0)   B(1,1)   B(2,2) ]
 B =  |B(1,0)   B(2,1)   B(0,2) |
      [B(2,0)   B(0,1)   B(1,2) ]

After this skewing, the actual computation begins, with a circular shift on each step:

Phase 2: for k = 0:s-1
            for i = 0:s-1 and j = 0:s-1
               C(i,j) = C(i,j) + A(i,j)*B(i,j)
            end 
            left circular shift each row of A by 1:
              A(i,j) <-- A(i, (j+1)mod s)
            upward circular shift each column of B by 1:
              B(i,j) <-- B((i+1)mod s, j)

Here is the set of pictures for this process, for s = 4:

2D Skew Mat-Mat

The timing model for this is straightforward. First, note that on the shift phase 1A each row can proceed independently. The last row has the furthest to shift blocks, requiring (in a mesh connected network) s-1 messages to get a block where it belongs. Since each process in the row can be sending a message simultaneously, this means communication time for Phase 1A is

(s-1) (t_lat + (n/s)² t_bw)

and clearly Phase 1B takes the same time. Phase 2 has each shift costing t_lat + (n/s)² t_bw, and each local multiplication taking 2(n/s)³ flops. Each loop iteration in the "for i = 0:s-1 and j = 0:s-1" can be done in parallel. Since p = s², the theoretical time is

T_p = 2n³/p + 3*sqrt(p)t_lat + 3n²/sqrt(p) t_bw

and the corresponding speedup and efficiency is easily obtained. Note that like the first version there is no redundant arithmetic, but now the communication cost is a factor of sqrt(p) smaller. Why is this to be expected?
http://stackoverflow.com/questions/5254294/using-executorservice-with-a-tree-of-tasks-to-perform

Java 7 has the concept of a ForkJoinPool that allows a task to "fork" off another task by submitting it tot he same Executor. Then gives it the option of later attempting to "help join" that task by attempting to run it if it has not been run.

http://stackoverflow.com/questions/5254294/using-executorservice-with-a-tree-of-tasks-to-perform

http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Ortega-Fall-2012-CSE633.pdf
http://www.slideshare.net/pkpramit/matrix-multiplicationan-example-of-concurrent-programming
http://www.cs.berkeley.edu/~yelick/cs267-sp04/lectures/13/lect13-pmatmul-6x.pdf

Parallel Matrix Multiplication

1D Block Layout Algorithm

2D Block Layout Algorithm

Labels

Popular Posts