Chapter 27
Multithreaded Algorithms

Document last modified: 

Resources

Intel Cilk++ Overview - http://software.intel.com/sites/products/evaluation-guides/docs/cilk-plus-evaluation-guide.pdf

Windows - Intel Cilk++ automatically integrates with Visual Studio 8.

  1. Download and install Cilk++ - http://software.intel.com/en-us/articles/download-intel-cilk-sdk/
  2. Depending upon the Parallel Studio and Visual Studio versions, you may need to download the updates also.

    Examples

    Official Intel examples, fail in converting to Visual Studio 10.

    • Extract examples from Intel\Cilk directory, this will create a directory with several Cilk++ examples and VS solutions - cilkexamples.msi
    • Open one of the solutions in the examples directory.
    Same examples that convert to Visual Studio 10.
    • Download and extract the examples.
    • Open one of the solutions in the examples directory.
    Or, after installing Cilk++, create a C++ Windows Console application, copy the following to Maximum.cilk and execute
    Maximum.cilk
    #include <cilk.h>
    #include <stdio.h>
    #include <math.h>
    
    int Pmax(int A[], int l, int r) {
       int m, x, y;
    
       if(l>=r) return A[l];
       m=(int)floor((double)((l+r)/2));
    
       x = cilk_spawn Pmax(A, l, m);
       y = cilk_spawn Pmax(A, m+1, r);
       cilk_sync;
    
       if (x > y) return x;
       else return y;
    }
    
    int cilk_main(int argc, char *argv[]) {
       int A[8] = {6, 3, 5, 7, 8, 1, 4, 2};
       int x;
    
       x = cilk_spawn Pmax(A, 0, 7);
    
       cilk_sync;
    
       printf("maximum: %d\n", x);
       return 0;
    }

Mac - Cilk from MIT

Compiler as installed on Mac OSX - Cilktest.zip

  1. Open terminal window, change directory to Cilktest.
  2. sudo mv /usr/include/mach/i386/_structs.h /usr/include/mach/i386/_structs.h.original
  3. sudo cp _structs.h /usr/include/mach/i386

Compiler - http://sourceforge.net/projects/cilkformac/files/cilk-5.4.6.tar.gz/download

Installing Cilk on Mac OSX - http://d.hatena.ne.jp/yuyarin/20101026/1288036789

Compile/Execute commands at terminal shell

alias cilkc='cilkc -D__CILK'   Create alias for changes to: /usr/include/secure/_string.h
cilkc -O2 examples/fib.cilk -o fib   Compile fib.cilk example
time ./fib --nproc 1 40
  Result: 102334155
  real 0m6.833s
  user 0m5.849s
  sys 0m0.009s
  Timed execution on one core.
time ./fib --nproc 2 40
  Result: 102334155
  real 0m3.507s
  user 0m5.904s
  sys 0m0.033s
  Timed execution on two cores.

Fibonacci execution comparison using C++, Open MP and Cilk++ - http://myxman.org/dp/node/182

Cilk lecture notes Rice University - http://www.clear.rice.edu/comp422/lecture-notes/comp422-2011-Lecture4-Cilk.pdf

Overview

Current multicore processors generally support shared-memory parallel computation, where multiple threads can access a common state in parallel.

 Figure at right illustrates a dual core processor sharing access to common memory with algorithm instructions or data. Each thread (processor core) can execute code independent of other, concurrent threads.

Static-threading, where threads generally exist for life of computation, programmer responsible for partitioning problem to balance load between threads, communication between threads, and scheduling thread execution.

Dynamic-multithreading allows programmers to specify parallelism without specifying communication protocol, load balancing, scheduling, etc. Though still evolving, main features are:

Model advantages

  1. Multithreading extension of serial programming

    parallel - executes for iterations concurrently.

    spawn - executes spawned function in parallel with caller.

    sync - waits for any spawned threads to complete.
     

  2. Quantify parallelism by work and span.
     
  3. Analysis of recursive parallel algorithms using recurrences.
     
  4. Faithful to parallel computing evolution, e.g. multi-core technology.

 

Ideal parallel computer and sequentially consistency

 

Question 27.0

  1. Do both sum array A = <1,2,3,4,5,6,7,8>?
  2. What is the recurrence for sum(A, l, r)? T(n) = ?

  3. Draw the recursion tree for sum(A, l, r).

  4. What is the upper bound of each sum?

  5. What is the recursion tree height, in terms of n?

  6. Suppose each call to sum(A, l, r) was executed on a new processor.

    How many total processors would we need to sum the 8 values?

    Upper bound execution time, in terms of n?

  7. What is 1024 / lg 210?
sum(A, n)
  1. result = 0
  2. for i = 1 to n
  3.    result  = result + A[ i ]
  4. return result
sum(A, l, r)
  1. if l ≥ r return A[ l ]
  2. m=⌊(l+r)/2⌋
  3. return sum(A, l, m) + sum(A, m+1, r)

 

 

27.1 The basics of dynamic multithreading

A first look

Recursion tree - each box represents a processor and call sum(A, l, r).

The top level processor is result of initial sum( <1,2,3,4,5,6,7,8>, 0, 7)

Subsequent processors are result of spawn.

Serial
sum(A, l, r)
  1. if l ≥ r return A[ l ]  -- size 1 problem
  2. m=⌊(l+r)/2⌋
  3. return sum(A, l, m) + sum(A, m+1, r)
 

 

Work                             Master Theorem Case 1

T1(n) =  2T1(n/2) + Θ(1)
         = Θ( n )                    

              

Parallel
sum(A, l, r)
  1. if l ≥ r return A[ l ]   -- size 1 problem
  2. m = ⌊(l+r)/2⌋
  3.     x = spawn sum(A, l, m)
  4.     y = spawn sum(A, m+1, r)
  5. sync
  6. return  x + y

Span                                              Master Theorem Case 2

T(n) = T(n/2) + Θ( 1 )         T(n/2) running 2 threads
          = Θ( lg n ) + Θ( 1 )         
          = Θ( lg n )

Parallelism

T1(n) / T(n) = Θ( n ) / Θ( lg n ) = Θ( n / lg n )

Question 27.1.1

  1. The number of subproblems double but the size is reduced by half for each. Upper bound for execution on one processor?
     
  2. How many processors are used?
     
  3. Recursion tree height in terms of n?
     
  4. Upper bound guess for the parallel version?

 

Fibonacci

Fibonacci (serial version)
F0 = 0

F1 = 1

Fi = Fi-1 + Fi-2           for i ≥ 2 

Fib( n )
  1. if n 1
  2.     return n
  3. else x = Fib(n-1)
  4.         y = Fib(n-2)
  5.         return x + y

Analysis T(n) = Θ( φn )

T(n) = T(n-1) + T(n-2) + Θ(1)

Assume T(n) aFn - b        a > 1 and b > 0

T(n) = T(n-1) + T(n-2) + Θ(1)  
  aFn-1 - b + aFn-2 - b + Θ(1) Substitute
  = a(Fn-1 + Fn-2) - 2b + Θ(1)  
  = aFn - b  -(b - Θ(1)) Fibonacci definition
  aFn - b for b ≥ Θ(1)

T(n) = Θ( φn ) where φ = 1.61803... and Fn = φn (see equation 3.25)

Fibonacci (serial version)

Fibonacci (parallel version)

spawn P-Fib(n-1) - spawns execution of P-Fib(n-1) that may execute concurrently with y = P-Fib(n-2)

sync - waits for completion of all executions spawned from current execution.

Fibonacci (parallel version)
P-Fib( n )
  1. if n 1
  2.     return n
  3. else x = spawn P-Fib(n-1)
  4.         y = P-Fib(n-2)
  5.         sync
  6.         return x + y

Strands

strand - Group of instruction chains having no parallel control (spawn, sync or return from spawn). These instructions are all executed serially in a single thread.

Following identifies instructions comprising the three types of strands:

Black - base case or procedure instructions up to spawn

Gray - instructions following spawn and up to sync

White - instructions following sync up to return, that is x+y

Logically in series - when directed path from strand u to strand v. Below, strands 1 → 2, 1 → 3, 1 → 4, 2 → 3, 2 → 4, 3 → 4 are in series.

Logically in parallel - when no directed path from strand u to strand v. Below, strands 5 and 3 are parallel.

Critical path - longest path in a DAG, passing through strands 1, 2, 3 and 4, of length 4.

Fibonacci (parallel version)
P-Fib( 2 )

if 2 1

    return 2

else x = spawn P-Fib(1)

        y = P-Fib(0)

        sync

        return x + y

Question 27.1.2 - In above, which strands are parallel or serial?

  1. 1 and 5
     
  2. 2 and 5
     
  3. 4 and 5

Critical path marked by heavy edges passing through 8 strands of execution.

Because strands 1-8 in same path, execution is logically serial.  Defines the longest execution time, upper bound.

Assuming each strand takes unit time, 17 time units of work (each circle is one unit).

Question 27.2 

Does the left subroutine exhibit more parallelism (do more work in less time) than the right or just use more processors?

Draw the execution of strands for the left similar to that below at right.

P-Fib( n )
  1. if n 1
  2.     return n
  3. else x = spawn P-Fib(n-1)
  4.         y = spawn P-Fib(n-2)
  5.         sync
  6.         return x + y
P-Fib( n )
  1. if n 1
  2.     return n
  3. else x = spawn P-Fib(n-1)
  4.         y = P-Fib(n-2)
  5.         sync
  6.         return x + y

Following identifies instructions comprising the three types of strands:

Black - base case or procedure instructions up to spawn

Gray - instructions following spawn and up to sync

White - instructions following sync up to return, that is x+y

 

Performance measures

Serial: When two sub-computations joined in series, work is sum of their work and the span is the sum of their spans.

Parallel: When joined in parallel, work is sum of their work and the span is the maximum of their spans.

work - total time to execute on one processor. Figure below marks 17 time units (strands in this case) of work.

span - longest time to execute in any path in DAG. Figure below marks 8 time units (strands in this case) shaded edges in critical path.

T1 - work, running time on 1 processor

Tp - running time on P processors

T- span, running time on ∞ processors

work law - Tp ≥ T1/P

P processors can complete at most P units of work in one step.

At best, P processors requires time T1/P or linear improvement.

T1 = 100s

P = 5

T1/P = 100s /5 = 20s

span law - Tp ≥ T

An ∞-processor machine can emulate a P-processor machine, the time on a P-processor machine will always be equal or greater than a ∞-processor machine.

speedup - T1/Tp

Ratio number of times faster on P-processor machine.

T1/T5 = 20s / 10s = 2 indicates a 5-processor machine produces a 2 times speedup.

T1/Tp   P (work law) so speedup of P-processors at most P.

T1/Tp = P = Θ(P) is perfect or linear speedup.

parallelism - T1/T

Ratio of work to the span. Can be viewed as:

  1. Average amount of work performed in parallel along the critical path.

  2. As upper bound, the maximum possible speedup on ∞-processor machine.

  3. Limit on obtaining linear speedup.

    After number of processors exceeds algorithm parallelism, additional processors cannot improve speedup.

Example       P-Fib(4 )

T1 = 17        work

T = 8          span

T1/T = 17/8 = 2.125 parallelism

No matter the number of processors, the algorithm parallelism limits speedup to 2.125 times for P-Fib(4).

Larger n for Fibonacci allows greater parallelism hence greater speedup possible by additional processors.

Question 27.3 

  1. T1/T = parallelism.  What happens as parallel execution time T decreases?

  2. What does a larger number for parallelism mean?

  3. Consider the recursion tree for Fibonacci and that each node is a separate processor.

    What is obvious that supports the statement: "Larger n allows greater parallelism hence greater speedup possible by additional processors."

 

parallel slackness - (T1/T)/P = T1/PT

Factor that algorithm parallelism exceeds number of processors.

T1/PT decreases as P increases.  

 T1/PT < 1 cannot achieve linear speedup, means number of processors already exceeds parallelism.

T1/PT > 1 number of processors is limiting factor.

T1 = 100s

T = 5s

P = 10

parallelism - T1/T = 100s /5s = 20

T1/PT = 100s/ 5s*10 = 2    more processors would reduce parallel slackness.

 

 

Scheduling - online, centralized, greedy

Schedulers determine which strands are executed on a given processor.

Text's multithreaded model does not specify which strands execute on which processor. Rather than the programmer, the scheduler must then determine in order to balance processors' load.

Without any a priori knowledge of strands' execution - scheduler must operate online - always available.

Centralized, greedy scheduler:

Best time possible on P processors

Tp = T1/P             linear speedup

T1/P for complete step - when P strands available to be assigned to P-processors.

Tp = T                plenty of processors

T for incomplete step - when less than P strands available to be assigned to P-processors.

Theorem 27.1 - On an ideal P-processor computer, a greedy scheduler executes a multithreaded computation with work T1 and T span in time:

Tp   T1/P + T

Tp is computation time on P-processors.

T1/P + T is upper-bound of a greedy scheduler. The time to execute all complete and incomplete steps.

Corollary 27.2 - Running time Tp of any multithreaded computation scheduled by a greedy scheduler on ideal parallel P-processor computer is within a factor of 2 of optimal.

 

Analyzing multithreaded algorithms

Fib( n )
  1. if n 1
  2.     return n
  3. else x = Fib(n-1)
  4.         y = Fib(n-2)
  5.         return x + y

 

P-Fib( n )
  1. if n 1
  2.     return n
  3. else x = spawn P-Fib(n-1)
  4.         y = P-Fib(n-2)
  5.         sync
  6.         return x + y

Serial           Fib( n )

T(n) = Θ( φn )

Parallel         P-Fib( n )

T1(n) = T(n) = Θ( φn )

T(n) = max( T(n-1), T(n-2) ) + Θ( 1 )             T(n-1) is worst (max) case.

         = T(n-1) + Θ( 1 )

         = Θ( n )                                                         by Chip & Conquer

Parallelism increases as n increases

T1(n) / T(n) = Θ( φn ) / Θ( n ) = Θ( φn / n )

limit φn / n → ∞

Fibonacci execution comparison using C++, Open MP and Cilk++ - http://myxman.org/dp/node/182

For small values of n

T1(n) < T(n)

T1(n)/PT(n) < 1 cannot achieve linear speedup, means number of processors already exceeds parallelism.

parallelism less than 1, serial execution time less

2 processors

serial 0.08s
parallel 0.49s

Cutoff

Execution tests showed a cutoff at n 30 to use serial Fibonacci function and parallel for n > 30 yielded significant speedup over pure serial or pure parallel.

Use of cutoff, an example of improving performance by coarsening parallelism.

CILK_NPROC=1 ./a.out 35
Serial fib(35) = 9227465 Time: 0.090000
Parallel fib(35) = 9227465 Time: 0.730000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.070000

CILK_NPROC=2 ./a.out 35
Serial fib(35) = 9227465 Time: 0.080000
Parallel fib(35) = 9227465 Time: 0.490000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.040000

CILK_NPROC=4 ./a.out 35
Serial fib(35) = 9227465 Time: 0.130000
Parallel fib(35) = 9227465 Time: 0.260000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.020000

Question 27.4

The following finds the maximum of an array using divide-and-conquer.

P-max(A, l, r)

   if l ≥ r return A[ l ]
   m=⌊(l+r)/2⌋

   x = spawn P-max(A, l, m)
   y = spawn P-max(A, m+1, r)

   sync
   if x > y return x
   else return y

For: P-max( <6, 3, 5, 7, 8, 1, 4, 2>, 0, 7)

  1. Draw the serial recursion tree by ignoring the spawn and sync.

  2. Are the parallel threads the same as recursion tree?

  3. What is the length of the Critical Path?

  4. What is lg 8 = lg 23?

  5. Determine the parallelism.

 

Simulated parallel execution of P-max for large n (1000) shows near linear speedup and parallelism = 121.06     

Our parallelism analysis for n=1024 yields:

1024 / lg 210 = 102


Work
                        
T1(n) = 2T1(n/2)+Θ(1)
         = Θ(n)

Span Critical path

T(n) = T(n/2)+Θ(1)
         = Θ(lg n)

  Case 1 Master Theorem




T(n/2) not 2T(n/2)
because running 2 threads
Case 2 Master 2

Parallelism

T1(n) / T(n)
           =Θ(n) / Θ(lg n)
           =Θ(n / lg n)

            
Recursion tree height = lg n

Number recursions = 2lg n+1-1

P-max(A, l, r)

   if l ≥ r return A[ l ]
   m=⌊(l+r)/2⌋

   x = spawn P-max(A, l, m)
   y = spawn P-max(A, m+1, r)

   sync
   if x > y return x
   else return y
 

Question 27.5

Give a parallel version that multiplies each element of an array by 2 using divide-and-conquer.

P-double( <6, 3, 5, 7, 8, 1, 4, 2>, 0, 7)

Number of calls as function of n?

Recursion tree height as function of n?

T1(n) =

T(n) =

T1(n) / T(n) =

 

Coarsening

Spawns are relatively costly compared to simply calling a subroutine but, in divide-and-conquer algorithms, for example, the cost can be amortized over parallel spawns.

At each node, there are two spawns but the overall spawn cost is constant for the next level while amount of work completed doubles.

For small n problems, the percentage of cost due to spawns is higher because there are fewer parallel threads over which to spread the cost. 

Small n problems can often be solved more efficiently serially than in parallel.

Solving small problems serially and larger problems in parallel is called parallelism coarsening.

As an example, P-max has been coarsened to solve problems of size 8 or smaller serially, it has a cutoff of 8.

Fibonacci had a cutoff of 30.

P-max(A, l, r)

   if l ≥ r return A[ l ]
   m=⌊(l+r)/2⌋

   x = spawn P-max(A, l, m)
   y = spawn P-max(A, m+1, r)

   sync
   if x > y return x
   else return y
P-max(A, l, r)

   if r-l 8
       max = A[ l ]
       for i = l+1 to r
            if A[ i ] > max
               max = A[ i ]
       return max

   m=⌊(l+r)/2⌋

   x = spawn P-max(A, l, m)
   y = spawn P-max(A, m+1, r)

   sync
   if x > y return x
   else return y

 

Parallel loops

Multiply n element vector by 2.

VecDouble( A )
  1. n = A.length
  2. parallel for i = 1 to n
  3.      Ai = Ai * 2
A 1 2 3 4
  20 22 24 26

=
A 1 2 3 4
  10 11 12 13

* 2

Line 3: n parallel executions, each with a different value for i: Ai = Ai * 2

Therefore each loop iteration uses a different array variable, one assignment is independent of all others, can occur concurrently, that is in any order.

A1 =  A1 * 2

A3 =  A3 * 2


gives same results as
A3 =  A3 * 2

A1 =  A1 * 2

parallel for can be implemented by compiler using divide-and-conquer.

VecDouble( A )
  1. n = A.length
  2. parallel for i = 1 to n
  3.      Ai = Ai * 2
VecDouble( A )
  1. n = A.length
  2. spawn P-VecDouble(A, 1, n)
  3. sync

 

 

P-VecDouble(A, i, i')
  1. if i == i'              -- Single array element
  2.   Ai = Ai * 2
  3. else mid = ⌊(i+i')/2⌋
  4.     spawn P-VecDouble(A, i, mid)
  5.     spawn P-VecDouble(A, mid+1, i')
  6.     sync
  7. return
Work    

T1(n) = 2T1(n/2) + Θ( 1 )
         = Θ( n ) 

Span

T(n) = T(n/2) + Θ( 1 )
         =  Θ( lg n )

Parallelism

T1(n) / T(n) = Θ( n ) / Θ( lg n )
                     =  Θ( n / lg n)

Question 27.6.1

1. Give a parallel version using parallel for that searches an array for a value and returns the index if found or -1.

LinearSearch( <16, 13, 15, 17, 15, 12, 15, 14>, 15) returns index 3.

2. Give a parallel version using divide-and-conquer that searches an array for a value and returns the index if found or -1.

LinearSearch(A,x,n)
  1. location = -1
  2. for i = 1 to n
  3.    if A[ i ] == x
  4.        location = i
  5. return location
Hint: For analysis, recall that the parallel for operating on an array can be synthesized by a parallel divide-and-conquer of the array.

3. For the parallel algorithm, what is:

T1(n)

T(n), the span of parallel for (recall that parallel for is equivalent to divide-and-conquer)

T1(n) / T(n)

4. Is there a best or worst case for your algorithm?

5. Does your algorithm always return the same value, given the same data?

Example

Add two n x n arrays A and B.

Mat-Vec( A, B )
  1. n = A.rows
  2. let y be a new n x n array
  3. parallel for i == 1 to n
  4.    parallel for j = 1 to n
  5.      yij =  Bij + Aij
  6. return y
y 1 2 3
1 30 32 34
2 36 38 40
3 42 44 46
 

=

A 1 2 3
1 10 11 12
2 13 14 15
3 16 17 18
 

+

B 1 2 3
1 20 21 22
2 23 24 25
3 26 27 28

Line 5: Notice that each addition is independent of other additions, can occur concurrently, that is in any order.

y22 =  B22 + A22

y23 =  B23 + A23


gives same results as
y23 =  B23 + A23

y22 =  B22 + A22

Example

Multiply n x n array A by n-vector x.

Mat-Vec( A, x )                   Correct
  1. n = A.rows
  2. let y be a new vector of length n
  3. parallel for i = 1 to n
  4.    yi = 0
  5. parallel for i = 1 to n
  6.    for j = 1 to n
  7.      yi = yi + Aij * xj
  8. return y
Mat-Vec( A, x )                   Wrong
  1. n = A.rows
  2. let y be a new vector of length n
  3. parallel for i = 1 to n
  4.    yi = 0
  5. parallel for i = 1 to n
  6.    parallel for j = 1 to n
  7.      yi = yi + Aij * xj
  8. return y
 
y 1 2 3
  6 12 18
 

=

A 1 2 3
1 1 1 1
2 2 2 2
3 3 3 3
 

*

 
x 1 2 3
  1 2 3

Line 6: Why not use parallel for j = 1 to n

Line 7: Because allows two or more processors to access same variable in shared memory.

When an instruction that is non-atomic:

yi = yi + Aij * xj

reads and writes a shared variable yi, race conditions can occur that produce results different than serial execution.

Correct Wrong
Processor
5
y5 = y5 + A5j * xj
Processor
9
y9 = y9 + A9j * xj
Processor
5
y5 = y9 + A5j * xj
Processor
9
y9 = y5 + A9j * xj

Example

In following, what can happen at Line 7 of the wrong version, each processor must first read value of y into processor register.

No guaranteed order in which processors will access shared memory or complete execution.

Serial
y ← 0

y ← y + 2

Parallel
P1 starts and ends, then
P2 starts and ends.

P1:  rP1 ← 0

       y ← rP1

       rP1 ← y

       rP1 ← rP1 + 2

       y ← rP1

P2:  rP2 ← 0

       y ← rP2

       rP2 ← y

       rP2 ← rP2 + 2

       y ← rP2


can give different results

 

 

Question 27.6.2

What is y at the end of each?

 

P1 and P2 intermix execution.

P1:  rP1 ← 0

       y ← rP1

       rP1 ← y

P2:  rP2 ← 0

       y ← rP2

P1:  rP1 ← rP1 + 2

       y ← rP1

P2:  rP2 ← y

       rP2 ← rP2 + 2

       y ← rP2

Analysis

Mat-Vec( A, x )                  
  1. n = A.rows
  2. let y be a new vector of length n
  3. parallel for i = 1 to n
  4.    yi = 0
  5. parallel for i = 1 to n
  6.    for j = 1 to n
  7.      yi = yi + Aij * xj
  8. return y

T1(n), work of Mat-Vec(A: n x n matrix, x), is serial running time by replacing parallel for with for.

T1(n) = Θ( n2 )

parallel for implemented as divide-and-conquer subroutine using nested parallelism.

Mat-Vec( A, x )                 
  1. n = A.rows
  2. let y be a new vector of length n
  3. Mat-Vec-Init-Loop(y, 1, n)
  4. Mat-Vec-Main-Loop(A, x, y, n, 1, n)
  5. return y

 

Mat-Vec-Main-Loop(A, x, y, n, i, i')
  1. if i == i'                        -- Single row of matrix
  2.    for j = 1 to n             -- serial for
  3.       yi = yi + Aij * xj
  4. else mid = ⌊(i+i')/2⌋
  5.     spawn Mat-Vec-Main-Loop(A, x, y, n, i, mid)
  6.     Mat-Vec-Main-Loop(A, x, y, n, mid+1, i')
  7.     sync
  8. return

Mat-Vec-Init-Loop(y, i, i')

  1. if i == i'                        -- Single array element
  2.   yi = 0
  3. else mid = ⌊(i+i')/2⌋
  4.     spawn Mat-Vec-Init-Loop(y, i, mid)
  5.     Mat-Vec-Init-Loop(y, mid+1, i')
  6.     sync
  7. return

T(n), span of parallel for, as implemented using divide-and-conquer.

lg n - depth of recursion of parallel for i = 1 to n

iter(i) span of ith iteration

T(n) = Θ( lg n ) = Θ( lg n ) + Θ( 1 ), for Mat-Vec-Init-Loop(y, 1, n)

Θ( lg n ) recursive spawning dominates the constant work, Θ( 1 ), of yi = 0 and other constant time parts of subroutine

T(n) = Θ( n ) = Θ( lg n ) + Θ( n ) + Θ( 1 ), for Mat-Vec-Main-Loop(A, x, y, n, 1, n)

Θ( lg n ) the recursive spawning tree height.

Θ( n ) the maximum iter(i), since all n have same span executing Lines 2 & 3

for i = 1 to n    yi = yi + Aij * xj.

Θ( 1 ) the constant time parts of subroutine.

parallelism

T1(n) / T(n) = Θ( n2 ) / Θ( n ) = Θ( n )

Counting strands does not work here because most of time spent in serial execution.

Question 27.6.3 - For below DAG:

What is the work, T1(n)?

What is the Critical Path of the following execution DAG?

What is its length, T(n)?

Do those agree with our analysis, T1(n) / T(n) = Θ( n )?

Mat-Vec(A: 8 x 8 matrix, x) DAG is:

Mat-Vec-Main-Loop(A, x, y, n, i, i')
  1. if i == i'                        -- Single row of matrix
  2.    for j = 1 to n
  3.       yi = yi + Aij * xj
  4. else mid = ⌊(i+i')/2⌋
  5.     spawn Mat-Vec-Main-Loop(A, x, y, n, i, mid)
  6.     Mat-Vec-Main-Loop(A, x, y, n, mid+1, i')
  7.     sync
  8. return

 

Race conditions

Determinacy race when two or more logically parallel instructions access same memory location and at least one performs a memory write.

Two parallel strands executing:

x = x + 1

non-atomic, composed of instructions:

  1. read x into a register
  2. increment register
  3. write register to x
Race-Example
  1. x = 0
  2. parallel for i = 1 to 2
  3.        x = x + 1
  4. print x

Following diagrams how parallel execution produces wrong result due to race condition.

The wrong implementation achieves a span of Θ( lg n ) but introduces a race condition, at Line 7.

Line 5 and 6 allow n2 concurrent threads, with the parallel for j threads attempting to execute:  yi = yi + Aij * xj

Mat-Vec( A, x )                   Correct
  1. n = A.rows
  2. let y be a new vector of length n
  3. parallel for i == 1 to n
  4.    yi =0
  5. parallel for i == 1 to n
  6.    for j = 1 to n
  7.      yi = yi + Aij * xj
  8. return y

 

Mat-Vec( A, x )                   Wrong
  1. n = A.rows
  2. let y be a new vector of length n
  3. parallel for i == 1 to n
  4.    yi =0
  5. parallel for i == 1 to n
  6.    parallel for j = 1 to n
  7.      yi = yi + Aij * xj
  8. return y

 

Question 27.7

Below is our parallel version of linear search using parallel for that searches an array for a value and returns the index if found or -1.

  1. Does a race condition exist? If so, where?

  2. Does P-search( <16, 13, 15, 17, 15, 12, 15, 19>, 15, 8) return index 3, 5 or 7?

  3. Will the answer always be the same, given the same array? If not, how is another answer possible?
P-search(A, x, l, r)

   if l ≥ r then
       if A[ l ] = x return l
       else return -1

   m=⌊(l+r)/2⌋

   a = spawn P-search(A, x, l, m)
   b = spawn P-search(A, x, m+1, r)

   sync
   if
a != -1 return a
   else return b
P-search(A, x, n)
  1. location ← -1
  2. parallel for i ← 1 to n
  3.    if A[i] = x
  4.        location ← i
  5. return location

 

27.3 Multithreaded merge sort

Serial Merge_Sort and Merge
Merge (A, p, q, r)
-- alters A
-- preserves p, q, r
-- pre: p ≤ q < r  &  A[p..q] & A[(q + 1)..r] sorted
-- post: A[p..r] sorted
--         forall m; p≤m<r; A[m]≤A[m+1]
1 n1 ← q - p + 1
2 n2 ← r - q
3 -- Copy L[1..n1]←A[p..q]
--         R[1..n2]←A[q+1..r]
4 for i ← 1 to n1 do
5    L[i] ← A[p + i - 1]
6 for j ← 1 to n2 do
7    R[j] ← A[q + j]
8 L[n1 + 1] ← ∞
9 R[n2 + 1] ← ∞
10 i ← 1
11 j ← 1
  -- forall m; p≤m<k-1; A[m] ≤ A[m+1]
12 for k ← p to r do
13    if L[i] ≤ R[j] then
14       A[k] ← L[i]
15       i ← i + 1
16    else
17       A[k] ← R[j]
18       j ← j + 1
Merge_Sort (A, p, r)
-- alters A
-- post: A[p..r] is sorted
--   forall m; p≤m<r; A[m]≤A[m+1]
1 if p < r then
2    q ← ⌊(p + r) / 2⌋
3    Merge_Sort (A, p, q)
4    Merge_Sort (A, q + 1, r)
5    Merge (A, p, q, r)

Example

Merge_Sort(A, 9, 16)

Merge_Sort(A, 9, 12)

Merge_Sort(A, 13, 16)

Merge(A, 9, 12, 16)

Merge_Sort(A, 9, 16) result

Work     T1(n) =  2T1(n/2) + Θ(n) = Θ(n lg n)            Merge is Θ(n)

Serial version
Merge_Sort (A, p, r)
1 if p < r then
2    q ← floor ((p + r) / 2)
3    Merge_Sort (A, p, q)
4    Merge_Sort (A, q + 1, r)
5    Merge (A, p, q, r)
Parallel version
Merge_Sort' (A, p, r)
1 if p < r then
2    q ← ⌊(p + r) / 2⌋
3    spawn Merge_Sort' (A, p, q)
4    Merge_Sort' (A, q + 1, r)
5    sync
6    Merge (A, p, q, r)

Span       

T(n) = T(n/2) + Θ( n )                            T(n/2) not 2T(n/2) because running 2 threads
          = Θ( n )                                             by Master Theorem case 3

                                                                    f(n) > nlog21 = 1

                                                                    f(n) = Θ( n ) → f(n) = Ω(nlog21+1 )

parallelism

T1(n) / T(n) = Θ(n lg n) / Θ( n ) = Θ( lg n )

Not great parallelism, lg 1024 = lg 210 = 10

Reducing the span, T(n), increases parallelism.

T(n) = Θ( n ) due to serial Merge.

 

Multithreaded Merge

Text defines a multithreaded P-Merge resulting in P-Merge-Sort parallelism of:  Θ( n/lg2 n )

P-Merge-Sort(A, p, r, B, s)
  1. n = r - p + 1
  2. if n == 1
  3.    B[s] = A[p]
  4. else T[1..n] = new array
  5.    q = ⌊(r + p) / 2⌋  
  6.    q' = q - p + 1
  7.    spawn P-Merge-Sort(A, p, r, T, 1)
  8.                P-Merge-Sort(A, q+1, r, T, q'+1)
  9.    sync
  10.    P-Merge(T, 1, q', q'+1, n, B, s)
P-Merge(T, p1, r1, p2, r2, A, p3)
  1. n1 = r1 - p1 - 1
  2. n2 = r2 - p2 - 1
  3. if n1 < n2
  4.    p1 ↔ p2
  5.    r1 ↔ r2
  6.    n1 ↔ n2
  7. if n1 == 0
  8.    return
  9. else  q1 = ⌊(r1 + p1) / 2⌋
  10.          q2 = BinarySearch(T[ q1 ], T, p2, r2)
  11.          q3 = p3 + (q1 - 1) + (q2 - p2)
  12.          A[ q3 ] = T[ q1 ]
  13.          spawn P-Merge(T, p1, q1-1, p2, q2-1, A, p3)
  14.                      P-Merge(T, q1+1, r1, q2, r2,  A, q3+1)
  15.          sync

How P-Merge works

A is merged into B array.

Merges two arrays of length n1 and n2 using divide-and-conquer by reducing array size to n = 0.

1-6. Ensures n1 ≥ n2.
7-8. Base case, arrays are empty.
9. Lower part of T contains one sorted array, upper part of T the other.

The middle element, T[ q1 ], of lower sorted array is median, all elements below T[ q1 ] are less than or equal and all elements above are greater than or equal.

10. Using BinarySearch, find where T[ q1 ] could be inserted into upper sorted array, index q2, and array still be sorted. 
11. q3 is index where to copy T[ q1 ] into A to maintain sorted order.
12. Copy T[ q1 ] into A[ q3 ]
13. Recursively merge lower half.
14. Recursively merge upper half.

Analysis of P-Merge

BinarySearch is serial, Θ(lg n)

PM(n) = PM(3n/4) + Θ(lg n)
            = Θ(lg2 n)                              Exercise 4.6-2

PM1(n) = Θ(n)

PM1(n) / PM(n) = Θ(n) / Θ(lg2 n) = Θ(n / lg2 n)

Analysis of P-Merge-Sort

PMS1(n) = 2PMS1(n/2) + PM1(n)
              = 2PMS1(n/2) + Θ(n)
              = Θ(n lg n)                            Master Theorem Case 2

PMS(n) = PMS(n/2) + PM(n)
              = PMS(n/2) + Θ(lg2 n)        
              = Θ(lg3 n)                             Exercise 4.6-2

PMS1(n) / PMS(n) = Θ(n lg n) / Θ(lg3 n)
                             = Θ(n / lg2 n)

 

Comparison of parallelism: P-Merge-Sort using Serial and Parallel Merge

Very little difference for small values of n.

Serial merge with parallel sort does not improve parallelism as n increases, utilizing only a few processors for large n.

Parallel merge does improve parallelism as n increases, utilizing relatively more processors for large n.

Serial            T1(n) / T(n) = Θ(n lg n) / Θ( n ) = Θ( lg n )

                         lg 1024 = lg 210 = 10

                         lg 4294967296 = lg 232 = 32

Parallel          PMS1(n) / PMS(n) = Θ(n lg n) / Θ(lg3 n) = Θ(n / lg2 n)

                      1024/lg2 1024 = 1024/(lg 210)2 = 1024/100 = 10.2

         4294967296/lg2 4294967296 =
            4294967296/(lg 4294967296)2  =
            4294967296/(lg 232)2 =
            4294967296/322 =
            4294967296/1024 = 4194304

 

Examples

Analysis - Maximum of an array using divide-and-conquer - coded for MIT Cilk

Work    

T1(n) =  2T1(n/2) + Θ(1)
         = Θ( n )                          by Master Theorem Case 1

Span

T(n) = T(n/2) + Θ( 1 )           
                                                T(n/2) running 2 threads
         = Θ( lg n ) + Θ( 1 )          by Master Theorem Case 2
         = Θ( lg n )

Parallelism

T1(n) / T(n) = Θ( n ) / Θ( lg n ) = Θ( n / lg n )

Speedup - measured results

When executed for n=1000 and increasing
numbers of processors, the speedup follows
a near linear increase with the number
of processors.

Simulation shows near linear speedup and parallelism = 121.06     

Our parallelism analysis for n=1024 yields:

1024 / lg 210 = 102

 

 

#include <cilk-lib.cilkh>
#include <stdio.h>
#include <math.h>

cilk int Pmax(int A[], int l, int r) {
   int m, x, y;

   if(l>=r) return A[l];
   m=(int)floor((l+r)/2);

   x = spawn Pmax(A, l, m);
   y = spawn Pmax(A, m+1, r);

   sync;
   if (x > y) return x;
   else return y;
}

cilk int cilk_main(int argc, char *argv[]) {
   int A[8] = {6, 3, 5, 7, 8, 1, 4, 2};
   int x;

   x = spawn Pmax(A, 0, 7);
   sync;
   printf("maximum: %d\n", x);

   return 0;
}

Intel Cilk++ is slightly different from the examples of the text. Below is the same maximum solution and a parallel search using the Intel syntax, the search uses cilkview to display parallel execution measures. See the resources listed at the beginning of this page for documentation on Intel Cilk++.

Array maximum by divide-and-conquer
#include <cilk.h>
#include <stdio.h>
#include <math.h>

int Pmax(int A[], int l, int r) {
   int m, x, y;

   if(l>=r) return A[l];
   m=(int)floor((double)((l+r)/2));

   x = cilk_spawn Pmax(A, l, m);
   y = cilk_spawn Pmax(A, m+1, r);
   cilk_sync;

   if (x > y) return x;
   else return y;
}

int cilk_main(int argc, char *argv[]) {
   int A[8] = {6, 3, 5, 7, 8, 1, 4, 2};
   int x;

   x = cilk_spawn Pmax(A, 0, 7);

   cilk_sync;

   printf("maximum: %d\n", x);
   return 0;
}

 

Array search by parallel for
#include <cilk.h>
#include <cilkview.h>
#include <stdio.h>
#include <math.h>

int search(int A[], int x, int n) {
   int location=-1;

   cilk_for(int i=0;i<n; i++)
	if( A[i]==x) location = i;
   return location;
}

int cilk_main(int argc, char *argv[]) {
   int A[1000];
   int x;
   cilk::cilkview cv;

   for(int i=0;i<1000;i++)	A[i]=0;

   A[12]=13;
   A[789]=13;
   
   cv.start();
   x = search(A, 13, 1000);
   cv.stop();
   cv.dump("search results", true);
   printf("location: %d\n", x);
   printf("%f8 seconds\n", 
              cv.accumulated_milliseconds() / 1000.f );
   return 0;

Speedup as measured at execution.